Rabu, 14 November 2007

A guide to server and workstation optimization

Server and workstation tuning is an ongoing process.
Believing that you are done only means that you don't know what else can be tuned.

This article should apply equally to FreeBSD 4.x and 5.x


The method of tuning your system is heavily dependent on its function:

  • Will the system perform a lot of small network transactions?
  • Or a small number of large transactions?
  • How will disk operations factor in?

How you answer these and other questions determines what you need to do to improve the performance of your system.
There are several steps you can take before you need to start changing sysctls or re-arranging your partitions. How your applications are compiled plays a major role too. Beyond application compilation we will look at tuning the various other parts of our system including the network, disks, and other system control functions. I have tried not to duplicate the data in the tuning(7) man page, which already contains a wealth of good information on the basics of system performance tuning.


Optimizing software compiling

When source code is compiled, your compiler makes assumptions about your hardware in order to create compatible binaries. If you have an x86-complient CPU for example, your compiler will by default create binaries which can be run on any CPU from a 386 onwards. While this allows portability, any new abilities your CPU advantage of (MMX, SSE, SSE2, 3DNow!, etc) will not be used. So portability creates inefficiency. This is also why using pre-compiled binaies on your system is a sure fire way to reduce your overall performance!

System tuning is best performed on a new system, before many packages are installed. The steps you take here will also effect any new software you install. We assume that your packages are installed from the ports collection (/usr/ports). These steps should be applicable to any other software compiles and we will cover that later in this paper.

The first step to making sure your ports software will be compiled effeciently is to have good compiler flags set up. These are defined in /etc/make.conf. This file does not exist on new systems, but you can copy /etc/defaults/make.conf to /etc/make.conf.
Edit the file, and look for the line starting: #CPUTYPE=
Valid options for the CPUTYPE are listed in the file, in the paragraph above this line. My server is a P233/MMX, and my CPUTYPE line looks like: CPUTYPE=i586/mmx
What this does: The CPUTYPE option notifies the compiler of any special features your CPU has. The compiler will then, where possible, compile code to take advantage of these features. The disadvantage to this is that your compiled binaries may not run on different CPU types. As long as you aren't copying binaries from one server to another, this should not be a problem.

Also in the file, look for the line: #CFLAGS= -O -pipe
Uncomment this line, and change it to: CFLAGS= -O2 -pipe -funroll-loops
What this does: The '-O2' flag sets the optimization level. GCC has the following possible optimization levels:

  • -O: Some opimizations are enabled such as '-fthread-jumps' and '-fdefer-pop'
  • -O2: All optimizations which do not cause the size of the resulting executable to increase are turned on. This is useful for a speed/space tradeoff
  • -O3: Optimize even more. This option may cause the size of your bianries to increase
  • -Os: Optimize for size. Perform most of the optimizations in -O2 and some which reduce the code size

The '-pipe' option decreases the amount of time taken to compile software. When two compiler processes need to communicate data between each other, they can use files on the disk or pipes. As pipes do not require writing anything to disk they can significantly decrease the amount of time taken here.
Finally, '-funroll-loops' causes finite loops to be "unrolled". When a binary compiled with this option is run, the CPU does not have to run through every possible itteration of the loop to get its result. Instead, loops are replaces with with their equivilent non-looping code. This saves one CPU register which would otherwise be tied up in tracking the itteration of the loop.
The gcc man page (man gcc) is a good resource for this.

Warning: It has been noted that for some users on FreeBSD 4.8 and 4.9, the -funroll-loops causes SSHv2 with the base OpenSSH to break. Installing the OpenSSH-portable port to overwrite the base install fixes this problem quickly and easily, and provides a newer version of OpenSSH:

  • cd /usr/ports/security/openssh-portable && \
    make install -DOPENSSH_OVERWRITE_BASE

The make.conf file also contains a line for CXXFLAGS. These options are similar to our CFLAGS options but are used for C++ code. If you are going to compile C++ code, you should take a look at this also.


Optimizing kernel compiling

Efficient kernel compiling is covered in my Kernel Tuning paper at: http://silverwraith.com/papers/freebsd-kernel.php


Optimizing network performance

How you optimize your system for networking depends on what your system will be doing. Below we will take a look at two common applications for servers, Mail and File serving.

Network throughput:

There are a number of steps which can be applied to all installations to improve network performance, and should be done by everyone.

Most modern network cards and switches, support the ability to auto-negotiate the speed to communicate at. While this reduces administration is, it comes at the cost of network throughput. If your switch, server or workstation is set to use auto-negotiation, every few moments it stops transferring network traffic in order to renegotiate its speed. On low-bandwidth use networks this performance degradation might be hard to spot, but on high-bandwidth use networks it become very obvious: You have packet loss, you cannot achieve your full line speed, and your CPU usage is low. I would recommend that everyone read the man page on their network driver and manually define the network speed. This should if possible, also be done on the network switch. Some simple $10 switches do not have interfaces to which you can log in to set this, but fortunately they usually do not renegotiate the network speed after the cable is plugged in, unless the network link is lost.
The network speed can either be set with ifconfig at run time, or in /etc/rc.conf for boot time. Here are two examples for /etc/rc.conf for the rl(4) and fxp(4) network drivers:

  • ifconfig_rl0="inet x.x.x.x netmask x.x.x.x media 100baseTX mediaopt full-duplex"
  • ifconfig_fxp0="inet x.x.x.x netmask x.x.x.x media 100BaseTX mediaopt full-duplex"

If you are fortunate enough to have one of the following network cards:

  • dc -- DEC/Intel 21143 and clone 10/100 ethernet driver
  • fxp -- Intel EtherExpress Pro/100B ethernet device driver
  • rl -- RealTek 8129/8139 fast ethernet device driver
  • sis -- SiS 900, SiS 7016 and NS DP83815 fast ethernet device driver

Note: If your card isn't listed here, do not give up hope! More drivers are being converted as demand comes in and you should look at the documentation for your driver to see if it is supported. If you're still unsure, join the freebsd-questions@freebsd.org mailing list from http://lists.freebsd.org/mailman/listinfo and ask there.

You can enable DEVICE_POLLING in your kernel. DEVICE_POLLING changes the method through which data gets from your network card to the kernel. Traditionally, each time the network card needs attention (for example when it receives a packet), it generates an interrupt request. The request causes a context switch and a call to an interrupt handler. A context switch is when the CPU and kernel have to switch from user land (the user's programs or daemons), and kernel land (dealing with device drivers, hardware, and other kernel-bound tasks). The last few years have seen significant improvements in the efficiency of context switching but it is still an extremely expensive operation. Furthermore, the amount of time the system can have to spend when dealing with an interrupt can be almost limitless. It is completely possible for an interrupt to never free the kernel, leaving your machine unresponsive. Those of us unfortunate enough to be on the wrong side of certain Denial of Service attacks will know about this.

The DEVICE_POLLING option changes this behavior. It causes the kernel to poll the network card itself at certain predefined times: at defined intervals, during idle loops, or on clock interrupts. This allows the kernel to decide when it is most efficient to poll a device for updates and for how long, and ultimately results in a significant increase in performance.

If you want to take advantage of DEVICE_POLLING, you need to compile two options in to your kernel:

  • options DEVICE_POLLING
  • options HZ=1000

The first line enables DEVICE_POLLING and the second device slows the clock interrupts to 1000 times per second. The need to apply the second, because in the worst case your network card will be polled on clock ticks. If the clock ticks very fast, you would spend a lot of time polling devices which defeats the purpose here.

Finally we need to change one sysctl to actually enable this feature. You can either enable polling at runtime or at boot. If you want to enable it at boot, add this line to the end of your /etc/sysctl.conf:

  • kern.polling.enable=1

The DEVICE_POLLING option by default does not work with SMP enabled kernels. When the author of the DEVICE_POLLING code initially commited it he admits he was unsure of the benefits of the feature in a multiple-CPU environment, as only one CPU would be doing the polling. Since that time many administrators have found that there is a significant advantage to DEVICE_POLLING even in SMP enabled kernels and that it works with no problems at all. If you are compiling an SMP kernel with DEVICE_POLLING, edit the file: /usr/src/sys/kern/kern_poll.c and remove the following lines:

        #ifdef SMP
#include "opt_lint.h"
#ifndef COMPILING_LINT
#error DEVICE_POLLING is not compatible with SMP
#endif
#endif
Mail servers:

Mail servers typically have a very large number of network connections, which transfer a small amount of data for a short period of time, before closing the connection. Here is it useful for us to have a large number of small network buffers.
Network buffer clusters are assigned two per connection, one for sending and one for receiving. The size of the buffer dictates how fast data will be able to funnel through the network, and in the event of a network delay how much data will be able to backlog on the server for that connection before there is a problem. Having a network buffer too small means data will be backlogged at the CPU waiting for the network to clear. This causes greater CPU overhead. Having a network buffer too large means that memory is wasted as the buffer will not be used efficiently. Finding this balance is key to tuning.

When we discuss simultaneous network connections, we refer to connections in any network state: SYN_SENT, SYN_RECV, ESTABLISHED, TIME_WAIT, CLOSING, FIN_WAIT, FIN_WAIT_2, etc. Even if the network connection is in an ESTABLISHED state for only a few seconds, it can end up in any of the other states for a long time. I generally find that multiplying the number of ESTABLISHED connections by 8 leaves me with room to breath in the event that I see an abnormally high surge of traffic inbound or outbound. I've come to this number over time through trial and error. So if you expect to have a peak of 128 servers sending you mail, having 2048 network buffer clusters would be good (128 * 2 per connection * 8). Also remember that connections can take up to two full minutes or more to close completely. So if you expect more than 128 mails in any given two minute period, you also need to increase the number to accomodate that.

Another important value to control is the maximum number of sockets. One socket is created per network connection, and one per unix domain socket connection. While remote servers and clients will connect to you on the network, more and more local applications are taking advantage of using unix domain sockets for inter-process communication. There is far less overhead as full TCP packets don't have to be constructed. The speed of unix domain socket communication is also much faster as data does not have to go over the network stack but can instead go almost directly to the application. The number of sockets you'll need depends on what applications will be running. I would recommend start with with same number of network buffers, and then tuning it as appropriate.

You can find out how many network buffer clusters in use with the command netstat -m

You can specify the values you want, at the end of your /boot/loader.conf file as:

  • kern.ipc.nmbclusters=2048
  • kern.ipc.maxsockets=2048

Note: With any performance tuning, it is important to monitor your system after you make your changes. Did you go overboard, or underestimate what you would need? Always check and adjust accordingly. The numbers here might not be the exact ones that you need!

File servers:

Tuning the network for file servers is not unlike tuning mail servers. The main differences are:

  • File servers generally have longer-lived network connections
  • File servers usually transfer larger files than mail servers
  • File servers mostly perform less transfers than mail servers

Again we come back to network buffer clusters. How many clients do you have? With file servers the chances of a spike in the number of connections is small, as the number of clients is fixed. Therefore we do not need to set aside large numbers of buffers to accommodate spikes. Multiplying the number of network buffers by two is good practice, and some admins prefer to multiply by four to accommodate multiple file transfers.

So if we have 128 clients connecting to the file server, we would set the number of network buffer clusters to 1024 (128 * 2 per connection * 4) in /boot/loader.conf:

  • kern.ipc.nmbclusters=1024
  • kern.ipc.maxsockets=1024

Note: With any performance tuning, it is important to monitor your system after you make your changes. Did you go overboard, or underestimate what you would need? Always check and adjust accordingly. The numbers here might not be the exact ones that you need!

Web servers:

Web servers are not unlike mail servers. Unless you are doing a lot of file serving over the Internet, you will have clients connecting to you for short periods of time. If you have more than one element on your web page, for example multiple images or frames, you can expect that the web browsers of clients will make multiple connections to you. Up to four connections per page served are certainly not uncommon. Also if your web pages use server-side scripting to connect to databases or other servers, you need to add a network connection for each of those.

Web servers again like mail servers, go through periods of highs and lows. While on average you might servers 100 pages a minute, at your low you might server 10 pages a minute and at peak over 1000 pages a minute. Whether you have 128Mb RAM, or 1Gb RAM, you should try and be as liberal as possible in allocating memory to your network stack. Using the above example, at a peak of 1000 pages per minute, your clusters and sockets should be around 16384 (1000 pages * 2 per connection * 4 connections * 2 for growth) in /boot/loader.conf:

  • kern.ipc.nmbclusters=16384
  • kern.ipc.maxsockets=16384

Tuning your Apache or other web servers is slightly outside the scope of this paper, as there is already a ton of excellent data availible on the internet which I could never hope to do justice in this paper. A starting point I would recommend is Aleksey Tsalolikhin's notes from his Nov 2001 presentation to the Unix Users Association of Sothern California on web server performance tuning: http://www.bolthole.com/uuala/webtuning.txt, it should lead you on to more wonderful things.

Note: With any performance tuning, it is important to monitor your system after you make your changes. Did you go overboard, or underestimate what you would need? Always check and adjust accordingly. The numbers here might not be the exact ones that you need!


Optimizing disk usage and throughput

Optimizing the the disk subsystem on FreeBSD also depends on what you want to do with your system. It is very much installation dependent, so what I've done below is list the various factors and what they do. You can decide what is best for you.

  1. RAID:
    RAID is a method of spreading your data over multiple disks. There two reasons why you might use RAID; for redundancy to prevent data loss, and for speed. The three most common types of RAID in use on small system installations are RAID0, RAID1 and RAID1+0 (sometimes referred to as RAID10).
    With RAID1 (also called mirroring), you use only two disks per partition, and keep the data on both disks identical. In the event that one disk is lost, you have your data on another disk. The speed advantage from RAID1 comes when reading. Your system can send multiple read requests to the disks, which will be performed in parallel. The disk whose heads are closest to the requested space will get the request to fetch the data. Writes are no faster than on a single disk. When a write request is sent, both disks must acknowledge that the write has completed before the write is finished.
    RAID0 (also called stripping) spreads the data evenly over two or more disks. Data on one disk is not replicated on the others, so there is no redundancy to prevent data loss. But reads and writes are significantly faster as they happen on multiple disks at the same time. This increases your throughput and your maximum disk operations relative to the number of disks you have. For example, 4 disks would give a 400% increase.
    RAID10 offers the best of both worlds and requires at least 4 disks. Half of the disks are stripped with RAID0, and then both are replicated as a mirror on the remaining disks.
  2. Queue splitting:
    Is you are running a mail server and feel that your system is being slowed because of the speed of your disks, an alternative to RAID could be to split your queues. Most modern mail transfer agents (MTA's) have the ability to break up their large single queue directory into multiple smaller directories. These multiple directories can then be placed on different disks. There are several advantages to this:
    • A disk failure will only take out a half or less of your queue
    • Your throughput on mail will not be disk bound
    • Opening 20 small directories is significantly faster than opening one huge directory
  3. Partitioning:
    Having separate partitions on separate disks can help a lot. For example, your system will always be performing different tasks at any one given time: writing log files, serving out data, and so on. The Unix directory structure is built around using different directories for partitions for different purposes. /usr is traditionally used for user data, /var is used for log and mail spools, etc. Arrange these on different disks to best suit your needs. If you have disks of varying speeds on your system, place the most frequently used partitions on the faster disks.
  4. IDE vs SCSI:
    Back in days of yore (in the early 1990's) when disk performance was crucial, the choice was quite obviously to go for SCSI disks. SCSI provided faster throughput, and less bottle-necking. SCSI disk sizes were significantly larger and more disks could fit in a single system. Times have changed and so have the requirements of most users, and the much sought after disk sizes and faster throughput's are now available on IDE disks. SCSI disk sizes have also grown but not as fast. SCSI disks still offer faster throughput's however. At the time of writing, the fastest IDE interfaces could push 133Mbyte/s, whereas the fastest SCSI interfaces could push 320Mbyte/s.

Tidak ada komentar:

OFB.biz: Open for Business