Rabu, 14 November 2007

Optimising FreeBSD and it's kernel

This is a very simple step by step guide on optimising your FreeBSD server or workstation. It doesn't go into a great amount of detail, but after spending several months searching for one source of simple optimisation information and failing, I wrote this paper. All the suggestions listed here are known optimisations available to you if you know where to find them and have the time to do so. There is nothing secret, or special or amazing in this paper, just information on how you can optimise your system.
It can mostly be applied to the other BSDs too, but not Linux. There are plenty of Linux documents out there, so go find one of those. I'm sure there are several HOWTO's. This document is true as of the release of FreeBSD 4.8. Some parts of it, such as optimising your kernel, can also be applied to some previous releases.


Before we proceed any further you should know that as with any system tuning, we have to accept the possibility that something will break. If you're going to recompile the kernel, please read the following URL first. If possible print a copy of the page and keep it with you. It'll help you if your new kernel doesn't boot:
http://www.freebsd.org/handbook/kernelconfig-trouble.html

First we need to make sure we have the right source files downloaded. The best place to start, is the original source files that were intended for the release of FreeBSD you are using. We assume you have a fairly recent release - not more than 2 revisions (approx 6 to 8 months) old. Execute the following command and follow the steps to get the sources needed to recompile a kernel:

  • # /stand/sysinstall
  • Choose 'Configure'
  • Choose 'Distributions'
  • Choose 'src'
  • Choose 'sys' and then choose 'OK', and 'OK' again.
  • Choose an FTP site to download the sources from. If you're prompted to 'skip over the network configuration now', choose 'Yes'. After the install is complete, choose the 'Exit' options until you exit sysinstall.

Congratulations, you now have a know working kernel source tree installed in /usr/src/sys!
Now, cd /usr/src/sys/i386/conf and follow the next steps.

You should see two files in this directory, GENERIC and LINT. These are both kernel configuration files and contain all the data the kernel needs from the user, to compile. They are very easy to read and edit.
GENERIC is the Generic kernel configuration which should work with every system. LINT contains a list of all possible kernel configuration file directives. We won't worry about LINT just yet. First lets trim the GENERIC kernel down and get comfortable with that. Follow these steps to success:

  • Copy the GENERIC file to a new file. Traditionally, this new file has the same name as the hostname of your machine.
  • Edit this new file in your favourite text editor. I prefer vi and have written a cheat sheet for new vi users at:
    http://www.silverwraith.com/papers/vi-quickref.txt
  • There are only a few pitfalls and special cases to watch out for in this file. As you can see it just a plain text file and any line starting with a # is considered a comment and ignored by the compiler.
    • The word 'GENERIC' in the line 'ident GENERIC' is the name of your kernel. This can be any single alphanumeric word with no spaces or punctuation. I recommend you keep it short. Again, it is traditional to name this to be the same as your machine hostname and kernel configuration file name but it is not required. It is informational only.
    • The maxusers line does not actually limit the maximum number of users. It is used internally with an algorithm in the param.c file to determine the sizes and numbers of certain tables. This can remain at the default. As of FreeBSD 4.7 this is set to '0' by default, which causes the actual value to be auto-sized depending on the amount of memory that you have.
    • Most, if not all items in your kernel config will have a comment toward the end of the line. Anything labelled with '[KEEP THIS!]' should not be removed. FreeBSD needs these things! Anything labelled '(required!)' should also be kept if you're going to use that type of device. For example, if you're going to use SCSI devices, don't comment out 'scbus'. If you're not using any SCSI devices, then you should comment this out. Some PCI network cards require the inclusion of 'device miibus'. These are noted in your kernel configuration file. If you don't use these NICs you can also comment this out.
    • Now go through the entire file and comment out anything you don't think you'll need. Effectively every line contains a driver. Commenting it out will mean that particular piece of hardware will not work your system, even though it may be recognised as present in the system. Thus it is bad to comment out your own network card, but good to comment out any cards you don't have. Don't worry if you comment out something you will need later. You can always recompile fairly quickly and simply.
  • After you've finished editing the configuration file, save it and exit from the editor.
  • Issue the command: config , where is your configuration file. This only takes a moment and it will tell you which directory is your compile directory. That is where we will compile the kernel. The config command creates the directory structure and files which the next steps use in compiling the kernel.
  • cd to the directory that config gave and issue these commands in turn, each after the previous has finished:
    • # make depend
    • # make
    • # make install
    • Hopefully the above will all work without producing an error. If you get an error, see your local FreeBSD Guru. If you got no error, consider the installation of the new kernel a success!

That is it. Really! If something breaks and your new kernel will not boot, DON'T PANIC! Read the 'Kernel Does Not Boot' section of the URL at the top of the page.


Optimising compiling of code, the kernel
and debugging kernels

This section is for the slightly more advanced compilation of code and kernels. We will also briefly mention how to compile a kernel with the debugging symbols built in. The debugging kernel can be useful when your server or workstation is kernel panicing frequently and you want to find out why. There is a great article from OnLamp.com on how to debug a kernel, available at:
http://www.onlamp.com/pub/a/bsd/2002/04/04/Big_Scary_Daemons.html


Optimising code and kernel compilation

First lets look at optimising the compilation of C code itself. One of the key tasks when doing this is to start with as simple a configuration as possible. Hopefully if you've read everything above you'll be at this stage already :-)

You should set a few specific options in /etc/make.conf that will optimise the compilation of new code on your machine. This means all code will be compiled with these options. If you set the right ones, they can significantly improve the speed and efficiency with which code is compiled and executed. They will also help to reduce memory consumption. If you have installed the portupgrade package from /usr/ports/sysutils/portupgrade, you should execute this command after setting these options and updating your ports collection. It will cause every port you have installed, to have it's latest version downloaded and recompiled with the optimisations. I think it is worth it:

portupgrade -ra

Updating your ports collection en masse can have other consequences which are worth realising. If the current installed version of your port has undergone a major update (eg exim3 to exim4), a straight upgrade in this way could break your configuration. Use with care! The compilation process can take a while, depending on the speed of your CPU, the amount of memory you have, your internet connection and so on, so if you're on a slow link and wish to download the distfiles for each package first and do the updating offline, you can issue this command to download the distfiles for each port first:

portupgrade -Fra

So without further ado, the list of optimisations follows. Initially you won't have an /etc/make.conf, you can just start editing a file with that name (if you want to see every possible option to put in the file, reference /etc/defaults/make.conf):

CPUTYPE=cpu_type where cpu_type is one of i386, i486, i586, i586/mmx i686, p2, p3, p4,
k6, k6-2, or k7. gcc v2, which comes with FreeBSD doesn't yet support the latest Athlon
optimisations, so if you have a Thunderbird or other such CPU, just set k7 for now. gcc v3 has
better support and this will be available in FreeBSD 5 when it is released toward the end of
2002. NOTE: An important point to remember: Most CPU's of the same family are backward
compatible. That is, the K7 is backward compatible with K6 code. Also, Intel-CPU code is
mostly universally compatible across CPU families. However, if you compile code for a
non-Intel family CPU type and later upgrade to a newer Intel CPU, there is a good chance you
may encounter problems unless you recompile your code.
CFLAGS= -O3 -pipe -funroll-loops -ffast-math -O3 sets optimisation level 3 where the
largest number of practical optimisations can be made to your code (everything except the
kernel). It also does sacrifice binary size for speed. -pipe causes code to be passed between
processes using pipes during compilation rather than using temporary files, which has obvious
I/O advantages. -funroll-loops causes iterating loops with a known number of iterations to be
unrolled into faster executions. -ffast-math breaks IEEE/ANSI strict math rules. One way it
does this is by assuming that the square root of numbers are non-negative. This shouldn't be
used if you're compiling code that relies on the exact implementation of IEEE or ANSI rules
for math functions. Unless you're writing your own code that does just this you shouldn't have
a problem with setting this. It should help reduce your compile times.
COPTFLAGS= -O2 -pipe -funroll-loops -ffast-math This is the same as the "CFLAGS"
statement, except it's used on the kernel when you compile it. We choose -O2 instead of -O3,
because -O3 is known to produce broken kernels. Officially, only -O is supported, but I have
never had a problem with -O2. It should be noted at this point that one difference between -O2
and -O3 is that -O3 produces slightly larger code during its optimising. In normal situations
this is OK, but we want to compact the kernel down it is another reason to stick with -O2.
This also has the same effect as adding the 'makeoptions COPTFLAGS' lines to the
kernel config as discussed below.

kernel optimisation

In you kernel configuration, add the following line after the 'machine' (i386, i486, etc) types near the top:

makeoptions    COPTFLAGS="-O2 -pipe -funroll-loops -ffast-math"

This does two things. First the -O2 switch tells the compiler to optimise the compilation. This takes advantage of internal compiler tricks to do this. You could use -O3 to implement even more optimisation tricks, but these aren't supported and -O3 is known to cause many stability issues. -O2 may or may not work for you depending on how many things you have compiled into the kernel and how non-standard your hardware is.

TOP_TABLE_SIZE=number where number is a prime number, at least twice the number of lines
in /etc/passwd. This statement sets the size of the hash that the
top(1) uses for usernames when it runs.
options         CPU_WT_ALLOC

should be set if you have an AMD K5/K6/K6-2 or Cyrix 6x86 chip. It provides for the kernel to enable cache Write Allocation for the L1 cache, which was disabled by default on these chips.

options        NFS_NOSERVER

If you are running a workstation or server when you know that you will not be acting as an NFS server, you can add the above line to your kernel configuration to disable NFS server code. NFS servers allow other servers and workstations to mount parts of their filesystems using the Network FileSystem protocol.

options         NSWAPDEV=number

Another way of saving kernel memory is to define the maximum number of swap devices. Your kernel needs to allocate a fixed amount of bit-mapped memory so that it can interleave swap devices. I set the preceding parameter to 1 on my workstation and 2 on my servers. I rarely run out of swap space on a workstation but if I need to add more to a server, I can easily create another partition


Building a debugging kernel

Another option you have when compiling your kernel is to build with the debugging symbols. This can be fundamental in determining the reason your kernel panics, if it does. Be warned though - compilation time can increase slightly when doing this. To make a debug kernel, add the following line to your kernel configuration:

makeoptions     DEBUG=-g

This doesn't actually install a kernel with full debugging symbols as /kernel. The /kernel that gets installed is the stripped down regular kernel, but a separate kernel.debug file is in /usr/src/sys/compile/Your_Kernel_Name/kernel.debug. If your kernel panics and leaves behind a core file, the kernel.debug file is used to get the debugging symbols when you actually do the debug. See the OnLamp article for more on this.
One final thing you need to do if you're going to be building a debug kernel, is to have your system actually dump the memory to the swap partition. You can do this by adding the following like to /etc/rc.conf and rebooting:

dumpdev="/dev/ad0s1b"

Where "/dev/ad0s1b" is your swap partition as defined in /etc/fstab.
NOTE: Your swap partition needs to be at least 1Mb (1024Kb) bigger than your total amount of memory. When your system crashes, the memory will get dumped to your swap partition. When your system returns, your swap partition will be enabled and your disks will be fsck'd before they are mounted. During this process, fsck uses a small amount of swap space. It would be preferable if you had twice as much swap space as memory in this situation.

A guide to server and workstation optimization

Server and workstation tuning is an ongoing process.
Believing that you are done only means that you don't know what else can be tuned.

This article should apply equally to FreeBSD 4.x and 5.x


The method of tuning your system is heavily dependent on its function:

  • Will the system perform a lot of small network transactions?
  • Or a small number of large transactions?
  • How will disk operations factor in?

How you answer these and other questions determines what you need to do to improve the performance of your system.
There are several steps you can take before you need to start changing sysctls or re-arranging your partitions. How your applications are compiled plays a major role too. Beyond application compilation we will look at tuning the various other parts of our system including the network, disks, and other system control functions. I have tried not to duplicate the data in the tuning(7) man page, which already contains a wealth of good information on the basics of system performance tuning.


Optimizing software compiling

When source code is compiled, your compiler makes assumptions about your hardware in order to create compatible binaries. If you have an x86-complient CPU for example, your compiler will by default create binaries which can be run on any CPU from a 386 onwards. While this allows portability, any new abilities your CPU advantage of (MMX, SSE, SSE2, 3DNow!, etc) will not be used. So portability creates inefficiency. This is also why using pre-compiled binaies on your system is a sure fire way to reduce your overall performance!

System tuning is best performed on a new system, before many packages are installed. The steps you take here will also effect any new software you install. We assume that your packages are installed from the ports collection (/usr/ports). These steps should be applicable to any other software compiles and we will cover that later in this paper.

The first step to making sure your ports software will be compiled effeciently is to have good compiler flags set up. These are defined in /etc/make.conf. This file does not exist on new systems, but you can copy /etc/defaults/make.conf to /etc/make.conf.
Edit the file, and look for the line starting: #CPUTYPE=
Valid options for the CPUTYPE are listed in the file, in the paragraph above this line. My server is a P233/MMX, and my CPUTYPE line looks like: CPUTYPE=i586/mmx
What this does: The CPUTYPE option notifies the compiler of any special features your CPU has. The compiler will then, where possible, compile code to take advantage of these features. The disadvantage to this is that your compiled binaries may not run on different CPU types. As long as you aren't copying binaries from one server to another, this should not be a problem.

Also in the file, look for the line: #CFLAGS= -O -pipe
Uncomment this line, and change it to: CFLAGS= -O2 -pipe -funroll-loops
What this does: The '-O2' flag sets the optimization level. GCC has the following possible optimization levels:

  • -O: Some opimizations are enabled such as '-fthread-jumps' and '-fdefer-pop'
  • -O2: All optimizations which do not cause the size of the resulting executable to increase are turned on. This is useful for a speed/space tradeoff
  • -O3: Optimize even more. This option may cause the size of your bianries to increase
  • -Os: Optimize for size. Perform most of the optimizations in -O2 and some which reduce the code size

The '-pipe' option decreases the amount of time taken to compile software. When two compiler processes need to communicate data between each other, they can use files on the disk or pipes. As pipes do not require writing anything to disk they can significantly decrease the amount of time taken here.
Finally, '-funroll-loops' causes finite loops to be "unrolled". When a binary compiled with this option is run, the CPU does not have to run through every possible itteration of the loop to get its result. Instead, loops are replaces with with their equivilent non-looping code. This saves one CPU register which would otherwise be tied up in tracking the itteration of the loop.
The gcc man page (man gcc) is a good resource for this.

Warning: It has been noted that for some users on FreeBSD 4.8 and 4.9, the -funroll-loops causes SSHv2 with the base OpenSSH to break. Installing the OpenSSH-portable port to overwrite the base install fixes this problem quickly and easily, and provides a newer version of OpenSSH:

  • cd /usr/ports/security/openssh-portable && \
    make install -DOPENSSH_OVERWRITE_BASE

The make.conf file also contains a line for CXXFLAGS. These options are similar to our CFLAGS options but are used for C++ code. If you are going to compile C++ code, you should take a look at this also.


Optimizing kernel compiling

Efficient kernel compiling is covered in my Kernel Tuning paper at: http://silverwraith.com/papers/freebsd-kernel.php


Optimizing network performance

How you optimize your system for networking depends on what your system will be doing. Below we will take a look at two common applications for servers, Mail and File serving.

Network throughput:

There are a number of steps which can be applied to all installations to improve network performance, and should be done by everyone.

Most modern network cards and switches, support the ability to auto-negotiate the speed to communicate at. While this reduces administration is, it comes at the cost of network throughput. If your switch, server or workstation is set to use auto-negotiation, every few moments it stops transferring network traffic in order to renegotiate its speed. On low-bandwidth use networks this performance degradation might be hard to spot, but on high-bandwidth use networks it become very obvious: You have packet loss, you cannot achieve your full line speed, and your CPU usage is low. I would recommend that everyone read the man page on their network driver and manually define the network speed. This should if possible, also be done on the network switch. Some simple $10 switches do not have interfaces to which you can log in to set this, but fortunately they usually do not renegotiate the network speed after the cable is plugged in, unless the network link is lost.
The network speed can either be set with ifconfig at run time, or in /etc/rc.conf for boot time. Here are two examples for /etc/rc.conf for the rl(4) and fxp(4) network drivers:

  • ifconfig_rl0="inet x.x.x.x netmask x.x.x.x media 100baseTX mediaopt full-duplex"
  • ifconfig_fxp0="inet x.x.x.x netmask x.x.x.x media 100BaseTX mediaopt full-duplex"

If you are fortunate enough to have one of the following network cards:

  • dc -- DEC/Intel 21143 and clone 10/100 ethernet driver
  • fxp -- Intel EtherExpress Pro/100B ethernet device driver
  • rl -- RealTek 8129/8139 fast ethernet device driver
  • sis -- SiS 900, SiS 7016 and NS DP83815 fast ethernet device driver

Note: If your card isn't listed here, do not give up hope! More drivers are being converted as demand comes in and you should look at the documentation for your driver to see if it is supported. If you're still unsure, join the freebsd-questions@freebsd.org mailing list from http://lists.freebsd.org/mailman/listinfo and ask there.

You can enable DEVICE_POLLING in your kernel. DEVICE_POLLING changes the method through which data gets from your network card to the kernel. Traditionally, each time the network card needs attention (for example when it receives a packet), it generates an interrupt request. The request causes a context switch and a call to an interrupt handler. A context switch is when the CPU and kernel have to switch from user land (the user's programs or daemons), and kernel land (dealing with device drivers, hardware, and other kernel-bound tasks). The last few years have seen significant improvements in the efficiency of context switching but it is still an extremely expensive operation. Furthermore, the amount of time the system can have to spend when dealing with an interrupt can be almost limitless. It is completely possible for an interrupt to never free the kernel, leaving your machine unresponsive. Those of us unfortunate enough to be on the wrong side of certain Denial of Service attacks will know about this.

The DEVICE_POLLING option changes this behavior. It causes the kernel to poll the network card itself at certain predefined times: at defined intervals, during idle loops, or on clock interrupts. This allows the kernel to decide when it is most efficient to poll a device for updates and for how long, and ultimately results in a significant increase in performance.

If you want to take advantage of DEVICE_POLLING, you need to compile two options in to your kernel:

  • options DEVICE_POLLING
  • options HZ=1000

The first line enables DEVICE_POLLING and the second device slows the clock interrupts to 1000 times per second. The need to apply the second, because in the worst case your network card will be polled on clock ticks. If the clock ticks very fast, you would spend a lot of time polling devices which defeats the purpose here.

Finally we need to change one sysctl to actually enable this feature. You can either enable polling at runtime or at boot. If you want to enable it at boot, add this line to the end of your /etc/sysctl.conf:

  • kern.polling.enable=1

The DEVICE_POLLING option by default does not work with SMP enabled kernels. When the author of the DEVICE_POLLING code initially commited it he admits he was unsure of the benefits of the feature in a multiple-CPU environment, as only one CPU would be doing the polling. Since that time many administrators have found that there is a significant advantage to DEVICE_POLLING even in SMP enabled kernels and that it works with no problems at all. If you are compiling an SMP kernel with DEVICE_POLLING, edit the file: /usr/src/sys/kern/kern_poll.c and remove the following lines:

        #ifdef SMP
#include "opt_lint.h"
#ifndef COMPILING_LINT
#error DEVICE_POLLING is not compatible with SMP
#endif
#endif
Mail servers:

Mail servers typically have a very large number of network connections, which transfer a small amount of data for a short period of time, before closing the connection. Here is it useful for us to have a large number of small network buffers.
Network buffer clusters are assigned two per connection, one for sending and one for receiving. The size of the buffer dictates how fast data will be able to funnel through the network, and in the event of a network delay how much data will be able to backlog on the server for that connection before there is a problem. Having a network buffer too small means data will be backlogged at the CPU waiting for the network to clear. This causes greater CPU overhead. Having a network buffer too large means that memory is wasted as the buffer will not be used efficiently. Finding this balance is key to tuning.

When we discuss simultaneous network connections, we refer to connections in any network state: SYN_SENT, SYN_RECV, ESTABLISHED, TIME_WAIT, CLOSING, FIN_WAIT, FIN_WAIT_2, etc. Even if the network connection is in an ESTABLISHED state for only a few seconds, it can end up in any of the other states for a long time. I generally find that multiplying the number of ESTABLISHED connections by 8 leaves me with room to breath in the event that I see an abnormally high surge of traffic inbound or outbound. I've come to this number over time through trial and error. So if you expect to have a peak of 128 servers sending you mail, having 2048 network buffer clusters would be good (128 * 2 per connection * 8). Also remember that connections can take up to two full minutes or more to close completely. So if you expect more than 128 mails in any given two minute period, you also need to increase the number to accomodate that.

Another important value to control is the maximum number of sockets. One socket is created per network connection, and one per unix domain socket connection. While remote servers and clients will connect to you on the network, more and more local applications are taking advantage of using unix domain sockets for inter-process communication. There is far less overhead as full TCP packets don't have to be constructed. The speed of unix domain socket communication is also much faster as data does not have to go over the network stack but can instead go almost directly to the application. The number of sockets you'll need depends on what applications will be running. I would recommend start with with same number of network buffers, and then tuning it as appropriate.

You can find out how many network buffer clusters in use with the command netstat -m

You can specify the values you want, at the end of your /boot/loader.conf file as:

  • kern.ipc.nmbclusters=2048
  • kern.ipc.maxsockets=2048

Note: With any performance tuning, it is important to monitor your system after you make your changes. Did you go overboard, or underestimate what you would need? Always check and adjust accordingly. The numbers here might not be the exact ones that you need!

File servers:

Tuning the network for file servers is not unlike tuning mail servers. The main differences are:

  • File servers generally have longer-lived network connections
  • File servers usually transfer larger files than mail servers
  • File servers mostly perform less transfers than mail servers

Again we come back to network buffer clusters. How many clients do you have? With file servers the chances of a spike in the number of connections is small, as the number of clients is fixed. Therefore we do not need to set aside large numbers of buffers to accommodate spikes. Multiplying the number of network buffers by two is good practice, and some admins prefer to multiply by four to accommodate multiple file transfers.

So if we have 128 clients connecting to the file server, we would set the number of network buffer clusters to 1024 (128 * 2 per connection * 4) in /boot/loader.conf:

  • kern.ipc.nmbclusters=1024
  • kern.ipc.maxsockets=1024

Note: With any performance tuning, it is important to monitor your system after you make your changes. Did you go overboard, or underestimate what you would need? Always check and adjust accordingly. The numbers here might not be the exact ones that you need!

Web servers:

Web servers are not unlike mail servers. Unless you are doing a lot of file serving over the Internet, you will have clients connecting to you for short periods of time. If you have more than one element on your web page, for example multiple images or frames, you can expect that the web browsers of clients will make multiple connections to you. Up to four connections per page served are certainly not uncommon. Also if your web pages use server-side scripting to connect to databases or other servers, you need to add a network connection for each of those.

Web servers again like mail servers, go through periods of highs and lows. While on average you might servers 100 pages a minute, at your low you might server 10 pages a minute and at peak over 1000 pages a minute. Whether you have 128Mb RAM, or 1Gb RAM, you should try and be as liberal as possible in allocating memory to your network stack. Using the above example, at a peak of 1000 pages per minute, your clusters and sockets should be around 16384 (1000 pages * 2 per connection * 4 connections * 2 for growth) in /boot/loader.conf:

  • kern.ipc.nmbclusters=16384
  • kern.ipc.maxsockets=16384

Tuning your Apache or other web servers is slightly outside the scope of this paper, as there is already a ton of excellent data availible on the internet which I could never hope to do justice in this paper. A starting point I would recommend is Aleksey Tsalolikhin's notes from his Nov 2001 presentation to the Unix Users Association of Sothern California on web server performance tuning: http://www.bolthole.com/uuala/webtuning.txt, it should lead you on to more wonderful things.

Note: With any performance tuning, it is important to monitor your system after you make your changes. Did you go overboard, or underestimate what you would need? Always check and adjust accordingly. The numbers here might not be the exact ones that you need!


Optimizing disk usage and throughput

Optimizing the the disk subsystem on FreeBSD also depends on what you want to do with your system. It is very much installation dependent, so what I've done below is list the various factors and what they do. You can decide what is best for you.

  1. RAID:
    RAID is a method of spreading your data over multiple disks. There two reasons why you might use RAID; for redundancy to prevent data loss, and for speed. The three most common types of RAID in use on small system installations are RAID0, RAID1 and RAID1+0 (sometimes referred to as RAID10).
    With RAID1 (also called mirroring), you use only two disks per partition, and keep the data on both disks identical. In the event that one disk is lost, you have your data on another disk. The speed advantage from RAID1 comes when reading. Your system can send multiple read requests to the disks, which will be performed in parallel. The disk whose heads are closest to the requested space will get the request to fetch the data. Writes are no faster than on a single disk. When a write request is sent, both disks must acknowledge that the write has completed before the write is finished.
    RAID0 (also called stripping) spreads the data evenly over two or more disks. Data on one disk is not replicated on the others, so there is no redundancy to prevent data loss. But reads and writes are significantly faster as they happen on multiple disks at the same time. This increases your throughput and your maximum disk operations relative to the number of disks you have. For example, 4 disks would give a 400% increase.
    RAID10 offers the best of both worlds and requires at least 4 disks. Half of the disks are stripped with RAID0, and then both are replicated as a mirror on the remaining disks.
  2. Queue splitting:
    Is you are running a mail server and feel that your system is being slowed because of the speed of your disks, an alternative to RAID could be to split your queues. Most modern mail transfer agents (MTA's) have the ability to break up their large single queue directory into multiple smaller directories. These multiple directories can then be placed on different disks. There are several advantages to this:
    • A disk failure will only take out a half or less of your queue
    • Your throughput on mail will not be disk bound
    • Opening 20 small directories is significantly faster than opening one huge directory
  3. Partitioning:
    Having separate partitions on separate disks can help a lot. For example, your system will always be performing different tasks at any one given time: writing log files, serving out data, and so on. The Unix directory structure is built around using different directories for partitions for different purposes. /usr is traditionally used for user data, /var is used for log and mail spools, etc. Arrange these on different disks to best suit your needs. If you have disks of varying speeds on your system, place the most frequently used partitions on the faster disks.
  4. IDE vs SCSI:
    Back in days of yore (in the early 1990's) when disk performance was crucial, the choice was quite obviously to go for SCSI disks. SCSI provided faster throughput, and less bottle-necking. SCSI disk sizes were significantly larger and more disks could fit in a single system. Times have changed and so have the requirements of most users, and the much sought after disk sizes and faster throughput's are now available on IDE disks. SCSI disk sizes have also grown but not as fast. SCSI disks still offer faster throughput's however. At the time of writing, the fastest IDE interfaces could push 133Mbyte/s, whereas the fastest SCSI interfaces could push 320Mbyte/s.

Protecting yourself from denial of service

This article applies equally to FreeBSD 4.x and 5.x

Protecting your servers, workstations and networks can only go so far. Attacks which consume your availible Internet-facing bandwidth, or overpower your router's CPU can still take you offline. This paper is aimed at mitigating the effects of such attacks, and guiding you in what you should do if you are attacked.
Different types of attacks

Denial of Service attacks as their name implies, set out to remove a service from functional use by its clients. Web servers will stop serving web pages, e-mail servers will stop accepting or delivering e-mail, and routers will go dark taking you off the Internet all together.

Denial of a particular service will come in one of two forms:

* Complete consumption of a resource such as bandwidth, memory, CPU, file handles, or any other finite asset.
* Exploiting a weakness in the service to stop it functioning or causing the service to crash.

Over the last few years, attackers have refined their methods. As developers make software more reliable and more resiliant to DoS, the attack vectors have changed to target hard to secure parts of a service. In this paper we will discuss the first type of attack, and what we can do to protect our services from it.
Getting the most out of your service

Protecting your services from attack follows many of the same idiologies as tuning your services for maximum performance. The greater the load you can handle, the more resilliant you are. Things change slightly when the attack changes the profile of your service.
For example, if you have a webserver tuned to transfer mostly large file transfer and your attack forces through a lot of small, shortlived transactions, you could find you run out of network memory buffers very quickly. I would recommend starting by reading the papers on tuning FreeBSD for different applications: http://silverwraith.com/papers/freebsd-tuning.php

The paper describes good ways to start tuning your servers. Also, the tuning(7) man page is an excellent resource on performance improvements.
Analyzing and Blocking Denial of Service

The first step to protecting yourself from an attack is to understand the nature of different types of attacks. As we said earlier, resource consumption attacks target your system in places which can cause bottlenecks. The most popular targets are network bandwidth, system memory, network stack memory, disk I/O, operating system limitations such as a limit on the number of open file handles, and the CPU. These bottlenecks can either be on your systems or in your network hardware.
Attacks on bandwidth

Attacks against your network bandwidth are amongst the hardest to defend, and how you deal with them depends heavily on your network topology and how helpful your ISP is. To start with ask yourself the following questions:

* Is the attack against a single host, or multiple hosts?
* Is the attacker hitting a small set of ports, or randomly hitting many ports?
* Does the attack consists of protocols which should normally not be used with the attacked servers?

We are fortunate today, that most attacks are simple in their nature. They choose one or two styles of attack and at most a small number of IP addresses. This makes sense - bandwidth is as hard for attackers to acquire as it is for us to defend. If your Internet peering bandwidth is not saturated, the accepted approach is to block traffic to the attacked host(s) at your gateway. It is a good idea to run tcpdump on the attacked servers if you can, to see what kind of attack is taking place. Look for floods of very similar packets - all TCP SYN, UDP or ICMP. Look for packets all going to a particular port. If you find the number of source IP addresses is reasonably small, blocking out the packets based on source address may be possible. However if the source addresses are highly volatile in addressing, this can be an indicator that they are spoofed (forged). When this is the case, you may need to look for other similarities in the attack such as packet size, window size, fragmentation, etc. If you have the ability to block based on these less common criteria you may want to investigate here further. In today's multi-gigabit networks, it is not unusual for an Internet connection to have more bandwidth than the local LAN, and so it may be possible for you to block the attack at your Internet gateway.

More often than not though, this does not apply, and having your Internet bandwidth consumed can be a tiring and frustrating ordeal. This might be the right time to call your ISP, if you have one who is willing to work with you on these problems. Before you make the call, try to analyze the attack. This is most certainly help your ISP in selectively filtering the attack off your network. If filtering is possible, one of two common options will usually be available, selectively filtering out the attacking systems, or dropping all packets to the attacked servers. The latter is often preferred as it is easier to manage and would be more effective in the event that the attack profile changes against those hosts.

If you run Border Gateway Protocol (BGP) at your Internet gateway to announce your IP space to the Internet you may have a third option, one which is becoming popular with a lot of ISP's. UUNet, C&W, XO, and many others are allowing users to export routes as small as /32, with a special community string which causes all incoming data for the route to be dropped at the ISP's border. This is a highly effective method of dropping an attack on the floor with the least damage to yourself and your ISP. Of course this only works well if the number of hosts being attacked is small, or if your ISP offers such functionality. Contact your ISP to find out. The obvious downside of this is that the IP addresses you export in this fashion will lose ALL connectivity to the Internet.

In general is it a good idea to keep your network clean; only allow traffic on to it which is needed for your services to operate. Allow TCP to ports 80 and 443 on your servers and allow UDP to your game servers. Allow SSH connections only from trusted hosts. All of these limit the options of the attackers when they come to visit.
Attacks on Systems and Services

If your bandwidth is not saturated, it is most likely that the attack is against your systems and the services they host, rather than your entire network. Again, the remedy depends on the nature of the attack. Systems contain all of the possible bottlenecks which can be targeted, and you may find that more than one bottleneck is exposed at any one time. Attacks on systems and their services generally fall into the following categories:

* Network subsystem limitations (very high number of packets per second)
* OS or application memory limitations (memory consumption)
* Disk or CPU limitations (large numbers of valid requests)

System-targetting attacks can be some of the most frustrating as people have a hard time defending against them. There is however some special magic which FreeBSD lets us use to help out.

By default, each time your network card receives a packets, it generates an interrupt to the CPU along its IRQ. The CPU will catch this and dedicate an amount of time to fetch this packet from the interface. Under normal operations this can happen several thousand times per second which is well within the capabilities of even low end CPU's. It is quite likely with older CPU's that you will start to see performance impacts at around 25,000 to 50,000 packets per second. With packet sizes of 1500 bytes, this works out to around 40Mbytes/sec to 75Mbytes/sec, which is quite a lot for most older CPU's to serve anyway. Most 1Ghz systems will begin to feel pressure around 75,000 packets per second. The problem is exasperated by two factors:

* If the packet flood is from TCP SYN packets, these packets must be fully processed and then SYN ACK packets sent back to the source address of the original SYN packets. This is a reasonably expensive operation in itself. Other TCP and UDP packets to closed ports, and also ICMP packets, need to be similarly processed and have the appropriate TCP or ICMP replies sent back. While not as expensive as SYN processing, this still takes time and consumes outbound bandwidth.
* Packet size also plays an important factor. You can get more small packets in a particular amount of bandwidth than you can large packets. The more packets you take in, the more CPU time is required to process them, no matter what type of packets they are.

As we discussed previously, each time an IRQ is generated some CPU time is taken. If enough IRQ's can be generated, the CPU will have no time to do anything other than serve the interrupts. Inbound packets do not get processed, applications get no CPU time, and your system is effectively dead in the water. This is known as "Live-lock". Your system is still live, in so much that it has not crashed, but it is locked from performing any useful functions. Once packets stop coming in to the interface, the CPU starts to process all of the backlogged packets it has already accepted. This can take anything from a few minutes to several hours.

There are several things you can do to prevent or mitigate the effects of a high rate of packets, before you need to go out and buy any hardware upgrades. All of these are performed using FreeBSD's sysctl(8) command. Here are the settings you will need, you can place them in /etc/sysctl.conf:

* net.inet.tcp.msl=7500
net.inet.tcp.msl defines the Maximum Segment Life. This is the maximum amount of time to wait for an ACK in reply to a SYN-ACK or FIN-ACK, in milliseconds. If an ACK is not received in this time, the segment can be considered "lost" and the network connection is freed.
There are two implications for this. When you are trying to close a connection, if the final ACK is lost or delayed, the socket will still close, and more quickly. However if a client is trying to open a connection to you and their ACK is delayed more than 7500ms, the connection will not form. RFC 753 defines the MSL as 120 seconds (120000ms), however this was written in 1979 and timing issues have changed slightly since then. Today, FreeBSD's default is 30000ms. This is sufficient for most conditions, but for stronger DoS protection you will want to lower this to 7500, or maybe even less.
* net.inet.tcp.blackhole=2
net.inet.tcp.blackhole defines what happens when a TCP packet is received on a closed port. When set to '1', SYN packets arriving on a closed port will be dropped without a RST packet being sent back. When set to '2', all packets arriving on a closed port are dropped without an RST being sent back. This saves both CPU time because packets don't need to be processed as much, and outbound bandwidth as packets are not sent out.
* net.inet.udp.blackhole=1
net.inet.udp.blackhole is similar to net.inet.tcp.blackhole in its function. As the UDP protocol does not have states like TCP, there is only a need for one choice when it comes to dropping UDP packets. When net.inet.udp.blackhole is set to '1', all UDP packets arriving on a closed port will be dropped.
* net.inet.icmp.icmplim=50
The name 'net.inet.icmp.icmplim' is somewhat misleading. This sysctl controls the maximum number of ICMP "Unreachable" and also TCP RST packets that will be sent back every second. It helps curb the effects of attacks which generate a lot of reply packets.
* kern.ipc.somaxconn=32768
kern.ipc.somaxconn limits the maximum number of sockets that can be open at any one time. The default here is just 128. If an attacker can flood you with a sufficiently high number of SYN packets in a short enough period of time, all of your possible network connections will be used up, thus successfully denying your users access to the service.

You may find these settings to either be too aggressive, or not aggressive enough. You should tune them until you receive satisfactory results.

Finally, if you are blessed enough to own one of the following network cards you can enable a kernel feature call DEVICE_POLLING:

* dc
* em
* fxp
* nge
* rl
* sis

DEVICE_POLLING changed the way that interrupts are handled. Actually with DEVICE_POLLING, they are not handled at all! DEVICE_POLLING causes interrupts to be effectively ignored. Instead, at certain times, the CPU will poll the network card, and pick up an packets that are waiting for processing. This can significantly reduce the amount of CPU time used in processing inbound traffic, but only the above cards are supported as the drivers have to be written to support DEVICE_POLLING. The FXP cards generally work best with the feature as their drivers are very well developed, as is their hardware. The hardware design and quality of RL cards is a lot lower - without sufficient CPU (usually around 1Ghz), they have a hard time achieving the full 100MB/s at all. If you are looking for a new network card, you will get what you pay for!
You can learn more about DEVICE_POLLING at the author's home page:
http://info.iet.unipi.it/~luigi/polling/. You can also find good installation and tuning instructions there, as well as some statitics from comparative tests with DEVICE_POLLING enabled and disabled.
Tracking the source of the attack

Attacks can come from inside and outside your network and obviously one is easier to isolate than the other. Tracking the sources of attacks requires some familiarity with packet sniffing tools such as TCPDump, ngrep, or ethereal. Unless you have spent several months, carefully profiling your network traffic and set up monitoring specifically to alert you of anomalies, the chances of discovering you are under Denial of Service conditions before someone else does are slim. More often than not it is complaints such as "The Internet is slow", or "I can't get my e-mail" that lead us to find the truth. It is important to realize two things:

* Attacks can comes from inside and outside your network
* Not all service-denying events constitute a denial of service attack, and not all denial of service attacks constitute a service-denying event

What does this mean to you? It means that when you start to look for why "your Internet is slow" or why people cannot get their e-mail, remember that the source of the problem could be from any machine on your network or the Internet, and that the denial may not be deliberate.

A good place to start is the point of bottleneck. This could be the CPU on your HTTP proxy, or maybe your Internet gateway. If your bottleneck is a system process such as a proxy server, examine the logs for this. Is there a single system or small number of systems making an unusually large number of requests, or using more resources than they should? If your bottleneck is your Internet gateway (which we assume is running FreeBSD), you can use a command like this to view what IP packets are passing through your gateway:
router# tcpdump -n -i -c 100

This command will display a summary of the first 100 packets (-c 100) it sees, on the (-i ), and will not resolve the IP addresses to host names (-n) which can take extra time and may itself fail if you are having connectivity issues. An example output line may look like this:
04:59:53.915324 192.168.0.3.2327 > 192.168.0.10.1214: S 3199611726:3199611726(0) win 16384 (DF)

Let us look at the first few parts of this output which can be useful to us.

* 04:59:53.915324
This is the time stamp of when the packet was processed.
* 192.168.0.3.2327
This is the source IP address. The numbers after the last octet, 2327, indicate the source port number the packet was sent from.
* 192.168.0.10.1214
This is the destination IP address. The numbers after the last octet, 1214, indicate the destination port number to which the packet is going.
* S
This indicated the type of packet, in this case a SYN packet. The process and life of a TCP connection, and the types of packets you would see here can be learnt about from Daryl's TCP/IP Primer at: http://www.ipprimer.com/.

What you may see during an actual attack is hard to predict, as denial of service attacks come in so many shapes and sizes. A typical attack involves flooding a listening port on your server, with SYN packets. The idea is to make your system so busy processing the new connections that it cannot do anything else. Here you may be a large number of SYN packets. Usually these should be well balanced with packets of other types.

FreeBSD 5.3 is "stable" but not production-ready

FreeBSD 5.3 is "stable" but not production-ready

By Jem Matzan on December 20, 2004 (8:00:00 AM)


Printer friendly page Print
Comment on this article Comments

Since the introduction of the FreeBSD-5 branch, FreeBSD enthusiasts have been eagerly awaiting the day when the new codebase would stabilize. After much development and four previous releases, FreeBSD-5 has finally gone stable with version 5.3. But don't mistake a stable codebase with stable software. While the development team will no longer accept major changes to the base system, FreeBSD 5.3 still has bugs and problems.

FreeBSD is a complete Unix-like operating system entirely developed by a single large team of programmers. This is in stark contrast to GNU/Linux which, as a complete operating system, has no central, cohesive developer base and is packaged in myriad different ways by myriad different distribution projects and companies; and proprietary Unixes, which are closed-source, restrictively licensed, and work on a comparatively small number of usually proprietary hardware architectures. FreeBSD has historically been clean, fast, reliable, and scalable. It's easy to use, learn, set up, and navigate from the command line, has more than 10,000 software programs in the Ports system, runs on a wide variety of hardware, and can easily be used for either a desktop or a server.

The transition to 5.x

Until the release of 5.3, the most recent "production release" was the FreeBSD-4 series, which is presently at version 4.10 and has been deemed the "Legacy" release in the wake of the 5.x branch going to STABLE. FreeBSD-5 was supposed to be a grand introduction of new technology -- a revolutionary improvement to the tried and true 4.x branch -- but soon after it left the gate, it got caught up in developer politics and failed implementations of too-ambitious theories among other questionable design decisions, causing some developers to fork the FreeBSD-4 project into a separate and more focused operating system.

The ULE (which is not an acronym; its full name is SCHED_ULE as opposed to the older SCHED_4BSD) scheduler continues to have stability and performance problems and was totally disabled instead of being made the default process scheduler in 5.3 as planned. The mix of threading subsystems still yields problems with efficiency and stability. Also, the networking subsystem may now be multithreaded and therefore faster on SMP systems, but users with some implementations of the 3Com (SysKonnect/Yukon) gigabit LAN chip are now unable to access their network at all because of new bugs that have popped up in the driver; other SysKonnect/Yukon users have problems under heavy network traffic, along with those using Intel Pro/1000 chips. Unfortunately all of our test systems use these network chips for onboard LAN; coincidentally they are two of the most popular gigabit LAN chipsets used on modern motherboards from major manufacturers. We also experienced lockups during boot if a custom-compiled kernel did not have SMP enabled on a Hyper-Threaded computer. A list of these and other errata can be found here.

Considering the long list of significant problems in FreeBSD 5.3-RELEASE, it would seem irrational to recommend that anyone switch a production server from 4.x or any previous known-working 5.x release to 5.3. Just the same, the FreeBSD project maintains a migration guide for this purpose.

A lost lead

FreeBSD 5.x enjoyed an excellent head start in the fully 64-bit AMD64 operating system arena, but now trails the pack, with only Windows XP 64-bit behind it in speed and completeness. While 64-bit GNU/Linux in the form of SUSE, Red Hat, and Gentoo have all achieved a reasonable level of accomplishment (and Debian is on its way), FreeBSD 5.3-RELEASE did not add any long-awaited features, such as full 32-bit FreeBSD binary compatibility and 64-bit Linux binary compatibility. Linux 32-bit compatibility is also not natively available, but as usual there is an unofficial, not-yet-committed hack to get it to work. In addition, there is a severe reliability problem with systems that have more than 4GB of system memory, which is a limit meant to be broken by the AMD64 architecture. After having used FreeBSD 5.2.1-RELEASE for AMD64 on an Asus K8V Deluxe AMD64 workstation for several months, we've found 5.3-RELEASE to be unusable on the same machine. Due to the driver problems with the onboard network adapter as mentioned above, this test machine cannot even be properly used with the i386 edition, essentially forcing a downgrade to 5.2.1-RELEASE.

Improvements since 5.2.1

So far we've only focused on the negative parts of FreeBSD 5.3, but there are a few significant improvements over the previous version:

  • Windows NDIS binary drivers are now natively supported in the kernel; this means better wireless NIC compatibility
  • GCC is now at 3.4.2, Binutils at 2.15, and GDB at 6.1. Also, X.org has been upgraded to 6.7, GNOME to 2.6.2 and KDE to 3.3.0
  • There have also been several bug fixes and security patches since the previous release

A mediocre release

While the FreeBSD team seems to have accomplished some of its goals for 5-STABLE, they have also introduced a number of critical bugs. Where FreeBSD used to be a highly usable, reliable, and scalable operating system, the last three releases have been increasingly substandard, culminating in a hardly usable operating system on our test machines. The FreeBSD development team has a tradition of writing good code and maintaining a high-quality operating system. Unfortunately, FreeBSD 5.3-RELEASE lends little credence to that reputation.

Project leader Scott Long's release announcement claims that the team focused especially on bug squashing and testing, but considering all of the problems we encountered on our systems (and the fact that we reported one of these serious problems on the mailing lists during the release candidate testing), Long's assertion seems optimistic at best. Here's hoping that the FreeBSD team gets its act together politically and technically, and reclaims its reputation for excellence in operating system design and development.

Review: FreeBSD 5.4

Review: FreeBSD 5.4

By Jem Matzan on May 31, 2005 (8:00:00 AM)


Printer friendly page Print
Comment on this article Comments

One of the oldest Unix-like operating systems, FreeBSD, continues its advancement with the sixth release in the FreeBSD-5 series. Its developers have added nothing major, but have made many modifications, fixing a number of problems introduced in previous releases. FreeBSD 5.4 is the best release since 5.1, but it still may not be ready for prime time.

FreeBSD is a complete, multi-platform, Unix-like operating system developed by a large community of developers. As with GNU/Linux, you can make FreeBSD into a server or a desktop operating system. FreeBSD handles software management through two frameworks: the package database, which contains precompiled software packages, and the Ports tree, which contains metadata that allows you to automatically download and compile programs from source code. There are more than 12,000 programs in the Ports tree. Users can install packages easily from the command line, from an ncurses-based utility called sysinstall, or through "distribution sets" designed to install several packages together.

The FreeBSD operating system is licensed under the BSD license, although some included userland programs are licensed under other free software licenses.

Significant enhancements since 5.3

I found much wrong with FreeBSD 5.3, and I was glad to see that the FreeBSD development team squashed some of the bugs that I encountered in that and previous releases. A few of the most notable changes in the x86 and AMD64 editions are:

  • Security flaws in fetch, procfs, linprocfs, telnet, sendfile, ioctl, and cvs were fixed. These security fixes were already available to FreeBSD 5.3 users.
  • The ULE process scheduler was fixed, and is now available as an alternative to the 4BSD scheduler.
  • CPU frequency scaling functionality was added to the kernel.
  • Several network card drivers were made multiprocessor-capable. One network driver was added to support USB Ethernet adapters. The Intel Pro/1000 and SysKonnect/Yukon LAN drivers were fixed.
  • OpenBSD's CARP protocol was implemented.
  • The FreeBSD IP Firewall (ipfw) was updated with new options and features.
  • Network devices can now be given aliases at boot time.
  • BIND, netcat, Heimdal, OpenSSL, and Sendmail were updated to newer versions.
  • Documentation was updated.

You can find a complete list of bug fixes, enhancements, and additions in the release notes.

Putting it through the gauntlet

Testing FreeBSD 5.4 took longer than usual because of problems I had with it. I was sad to see that most of them were leftover bugs from 5.3 and 5.21 that still have not been fixed. These are problems that I found in several days of testing and note-taking on two test machines with both the x86 and AMD64 editions of FreeBSD.

I didn't have any trouble with a single-CPU Athlon 64 4000+ with an MSI K8T Neo2-FIR motherboard and a Seagate SATA-V hard drive with the AMD64 edition. On a dual Opteron system running the 64-bit FreeBSD, everything worked fine in AMD64 mode, except that an annoying "AD4: TIMEOUT - WRITE_DMA" error message popped up often and slowed to a crawl my test systems that used Serial ATA hard drives with the Silicon Image SiI3512 SATA controller.

Using the x86 version on the Opteron system resulted in a crash on first boot, possibly because the developers don't compile SMP support into the default kernel. It's not possible to compile a custom kernel during the installation procedure, so I'm not sure how one would fix the problem.

The x86 "boot only" ISO that I downloaded did not detect my Realtek 8169S LAN card properly during installation. This is a problem because this ISO needs to connect to FTP sites to download the operating system. The standard two-disc ISOs for both x86 and AMD64 worked perfectly -- only the "boot only" ISO failed to configure the network card. And yes, the MD5 sum did match between the server and my downloaded copy, and I tried writing the ISO twice to two different kinds of disc.

I was disappointed to find that Linux binary compatibility was still 64-bit only for 64-bit FreeBSD. That means no 32-bit Linux binaries. The proprietary Nvidia driver is still 32-bit only as well, although that's more Nvidia's fault than the FreeBSD project's. Most other programs in the Ports tree will work on AMD64, but many of them still don't compile for that architecture without editing their Makefiles.

Although not technically a "bug," all of the links on the release notes Web page that lead to man pages are broken at the time of this writing.

The "boot with USB keyboard" option in the boot menu was useless; every way I tried it, I had to unplug and replug my USB keyboard to get it to work after booting FreeBSD. This is the first time that I tested FreeBSD extensively with a USB keyboard, so I don't know if this bug is a holdover from previous releases.

On the plus side, I had good results with FreeBSD's new process scheduler that was disabled in the previous release. I used SCHED_ULE (ULE doesn't stand for anything -- it completes the word "schedule" and was designed to replace the default SCHED_4BSD scheduler) to compile KDE and GNOME from scratch to see if I could break it. Several hours later, both compiled and installed without incident. I left ULE in the kernel for another day and used the system normally without running into any problems with stability or noticeable changes in performance.

Conclusions

I used to use FreeBSD as my workstation operating system; in fact, we kicked off NewsForge's "My Workstation OS" series with a piece on that subject. But instead of getting better with each release, FreeBSD seems to hang on to a lot of serious problems while concentrating on less critical issues like what is and is not under the "big giant lock" (the nickname for the old thread-locking mechanism in FreeBSD, which prevented it from being multi-threaded). From a user's and reviewer's perspective, it looks as if FreeBSD's developers are trying to optimize code that does not yet work properly.

Speaking as a former FreeBSD user, I want this operating system to work again. I was disappointed to find that that didn't happen with 5.4-RELEASE. If you have FreeBSD 4.11 production machines and are thinking of upgrading, I suggest you leave them as they are for now.

Selasa, 13 November 2007

find a particular port

eg:
# cd /usr/ports
# make search name=lsof
Port: lsof-4.56.4
Path: /usr/ports/sysutils/lsof
Info: Lists information about open files (similar to fstat(1))
Maint: obrien@FreeBSD.org
Index: sysutils
B-deps:
R-deps:


Senin, 12 November 2007

Which OS is Fastest -- FreeBSD Follow-Up

Jeffrey Rothman and John Buckman

In the weeks after our article "Which OS is Fastest for High-Performance Network Applications?" was published in Sys Admin (July 2001), we received many emails from readers in the FreeBSD community who were unhappy with the benchmark results.

They stated that the FreeBSD operating system, when installed "out of the box" is configured by default to be very safe and reliable, and that the FreeBSD community purposely chose reliability over speed in configuring the default operating system. They contend that few production sites run FreeBSD as pre-configured. Rather, most FreeBSD systems administrators tune the operating system by reading "man tune", by joining the FreeBSD discussion groups, and by reading other FreeBSD documentation. These readers felt that our "out of the box" test did not represent how FreeBSD is used in the real world, and thus that our benchmark results were unfair.

Based on the FreeBSD readers’ statement that "most systems administrators tune FreeBSD before putting it in production", we agreed to apply their tuning tips, re-run our tests, and publish the results. We started an email discussion list for all interested readers to discuss, agree, and suggest performance improvement changes to our FreeBSD system configuration. We applied their 17 OS changes and recompiled the kernel. Our revised test results are shown in Figures 1-3.

File System Test -- After FreeBSD Tuning

In our originally published file system test, FreeBSD did poorly, because by default the system uses synchronous updates of file system metadata. This makes FreeBSD more reliable in the event of an unexpected system shutdown (a crash or power outage), but negatively impacts speed. We enabled asynchronous mounting of the file system, as well as the other OS tweaks that were recommended to us (see the last section below for a list of the FreeBSD performance improvements we made). The results of the new test run are graphed in Figure 1 (see in particular, "FreeBSD-tuned" vs. "FreeBSD-untuned").

In Figure 1, FreeBSD-tuned, Windows 2000, and Linux blur together because their results are fairly close. For this reason, we also graphed the hard disk benchmarks with the two slowest performers (Solaris, FreeBSD-untuned) removed so that the differences between the top three performers would be clearer (see Figure 2).

As expected, the asynchronous option greatly improved FreeBSD file system performance, bringing it in line with Linux and Windows 2000, which both have a similar feature. FreeBSD performed better (by about 30%) than the others at the 8-k and 16-k file size. However, FreeBSD performed worse with a 128k file (16% worse than Windows, 39% worse than Linux), which skewed the "total run time" results, because that file size took the longest to run. Reader Jeremiah Gowdy said this about the 128-k results: "the loss of performance at 128k has to do with the allocation of space, and how the disk was newfs’ed". The total run time for the hard drive test for each OS was: Linux: 542 seconds, Windows 2000: 613 seconds, FreeBSD tuned: 630 seconds, FreeBSD untuned: 2398 seconds, and Solaris: 3990 seconds.

Real-World MailEngine Test -- After FreeBSD Tuning

As described in our original article, we ran the program "MailEngine" (http://www.lyris.com/products/mailengine/) in a 200,000-recipient email delivery test. In the original results, with each operating system left "untuned", FreeBSD was slowest; with the beginning of a downtrend at 1500 simultaneous email sends. We applied the 17 FreeBSD OS tweaks that were suggested to us by our FreeBSD readers, and re-ran the FreeBSD test. Our results are shown in Figure 3.

After the FreeBSD tweaks, we found that FreeBSD tuned had very similar performance to Linux (untuned) when running 1000 or less simultaneous sends. Overall, the tuned version of FreeBSD was 27% faster at sending email than the untuned version. FreeBSD mail sending performance peaked at 1000 to 1500 simultaneous sends, and then steadily declined as simultaneous connections increased.

In the previously published test, we had been unable to run with 3000 connections. Now, with the 17 FreeBSD OS patches (including patches that our readers felt should fix this problem), we were frequently able to run at 3000 connections, but not much beyond that, and not consistently with 3000. In our program, the bind() system call failed sometimes with the EAGAIN error, other times with an EBADF error. This did not occur in the other operating systems. Both of these errors would indicate some sort of operating system resource shortage or system limit. Some of our readers wrote that they were aware of other FreeBSD sites that went well beyond these numbers of simultaneous connections, but none of the OS patches suggested allowed us to work around this limit. With overall mailing performance declining steadily, if mailing speed were the goal, it would not make sense to load FreeBSD with more than 1500 simultaneous sends.

Conclusions about FreeBSD

For applications that are disk intensive, we recommend that systems administrators configure their FreeBSD system to use the async option (or use soft updates for more reliability). Our hard disk benchmark was 3.8 times faster with the asynchronous FreeBSD file system, and its performance was in line with Windows 2000 and Linux (slightly faster at times, and slightly slower at other times, depending on the file size). In our real-world MailEngine test, we found that a tuned version of FreeBSD was as fast as an untuned version of Linux, for connection levels of 1500 sends or fewer, with FreeBSD performance declining steadily at simultaneous connection levels above 1500.

FreeBSD Tuning Tips

The following FreeBSD OS tuning tips were suggested to us by readers of our article.

In single-user mode:

 tunefs -n enable /
tunefs -n enable /usr
tunefs -n enable /var

Kernel modifications to make -- recompile and install the kernel afterwards:

MAXUSERS 512

 in /boot/load.conf
hw.ata.wc="1"
kern.ipc.nmbclusters="60000"
in /etc/fstab

Add to options for all hard disk file systems ",async":

 /etc/sysctl.conf
vfs.vmiodirenable=1
kern.ipc.maxsockbuf=2097152
kern.ipc.somaxconn=8192
kern.ipc.maxsockets=16424
kern.maxfiles=65536
kern.maxfilesperproc=32768
net.inet.tcp.rfc1323=1
net.inet.tcp.delayed_ack=0
net.inet.tcp.sendspace=65535
net.inet.tcp.recvspace=65535
net.inet.udp.recvspace=65535
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

Acknowledgements

We thank the FreeBSD readers who helped us tune FreeBSD and achieve the results above. In particular, we would like to thank Jeremiah Gowdy, Wes Peters, Mark Blackman, Brad Knowles, Nick Sayer, Robert Hough, and Tarjei Jensen.

Jeffrey Rothman is the Manager of Technical Support and head System Administrator at Lyris, and holds a Ph.D. in Computer Science from U.C. Berkeley on the topic of high-performance memory architectures for multiprocessor systems.

John Buckman is the CEO/Founder of Lyris, and the original software programmer behind their three products: ListManager, MailShield, and MailEngine.

kern.ipc.nmbclusters

kern.ipc.nmbclusters may be adjusted to increase the number of network
mbufs the system is willing to allocate. Each cluster represents approx-
imately 2K of memory, so a value of 1024 represents 2M of kernel memory
reserved for network buffers. You can do a simple calculation to figure
out how many you need. If you have a web server which maxes out at 1000
simultaneous connections, and each connection eats a 16K receive and 16K
send buffer, you need approximately 32MB worth of network buffers to deal
with it. A good rule of thumb is to multiply by 2, so 32MBx2 = 64MB/2K =
32768. So for this case you would want to set kern.ipc.nmbclusters to
32768. We recommend values between 1024 and 4096 for machines with mod-
erates amount of memory, and between 4096 and 32768 for machines with
greater amounts of memory. Under no circumstances should you specify an
arbitrarily high value for this parameter, it could lead to a boot-time
crash. The -m option to netstat(1) may be used to observe network clus-
ter use. Older versions of FreeBSD do not have this tunable and require
that the kernel config(8) option NMBCLUSTERS be set instead.

Minggu, 11 November 2007

vnode


In computing, an inode is a data structure on a traditional Unix-style file system such as UFS. An inode stores basic information about a regular file, directory, or other file system object.

Contents

[hide]

[edit] Details

When a file system is created, data structures that contain information about files are created. Each file has an inode and is identified by an inode number (often "i-number" or even shorter, "ino") in the file system where it resides. Inodes store information on files such as user and group ownership, access mode (read, write, execute permissions) and type of file. There is a fixed number of inodes, which indicates the maximum number of files each filesystem can hold.

A file's inode number can be found using the ls -i command, while the ls -l command will retrieve inode information.

Non-traditional Unix-style filesystems such as ReiserFS may avoid having a table of inodes, but must store equivalent data in order to provide equivalent function. The data may be called stat data, in reference to the stat system call that provides the data to programs.

The kernel's in-memory representation of this data is called struct inode in Linux. Systems derived from BSD use the term vnode, with the v of vnode referring to the kernel's virtual file system layer.

The POSIX standard mandates filesystem behavior that is strongly influenced by traditional UNIX filesystems. Regular files are required to have the following attributes:

  • The length of the file in bytes.
  • Device ID (this identifies the device containing the file).
  • The User ID of the file's owner.
  • The Group ID of the file.
  • The file mode, which determines what users can read, write, and execute the file.
  • Timestamps telling when the inode itself was last modified (ctime, change time), the file content last modified (mtime, modification time), and last accessed (atime, access time).
  • A reference count telling how many hard links point to the inode.
  • Pointers to the disk blocks that store the file's contents.

The term inode usually refers to inodes on block devices that manage regular files, directories, and possibly symbolic links. The concept is particularly important to the recovery of damaged file systems.

The inode number indexes a table of inodes in a known location on the device; from the inode number, the kernel can access the contents of the inode, including the data pointers, and so the contents of the file.

Inodes do not contain filenames. Unix directories are lists of "link" structures, each of which contains one filename and one inode number. The kernel can search a directory, looking for a particular filename, and convert the filename to the correct corresponding inode number if the name is found.

The stat system call retrieves a file's inode number and some of the information in the inode.

The exact reasoning for designating these as "i" nodes is unsure. When asked, Unix pioneer Dennis Ritchie replied:

'In truth, I don't know either. It was just a term that we started to use. "Index" is my best guess, because of the slightly unusual file system structure that stored the access information of files as a flat array on the disk, with all the hierarchical directory information living aside from this. Thus the i-number is an index in this array, the i-node is the selected element of the array. (The "i-" notation was used in the 1st edition manual; its hyphen became gradually dropped).'

Example of structure:

Estructure

[edit] Implications

The properties of a file system that makes use of the concept of inodes surprise many users who are not used to it at first:

  • If multiple names link to the same inode (they are all hard links to it) then all of the names are equivalent. The first one to have been created has no special status. This is unlike sometimes more familiar symbolic links where all of the links depend on the original name.
  • An inode can even have no links at all. Normally such a file would be removed from the disk and its resources freed for reallocation (the normal process of deleting a file) but if any processes are holding the file open, they may continue to access it, and the file will only be finally deleted when the last reference to it is closed. This includes executable images which are implicitly held open by the processes executing them. For this reason, when programs are updated, it is recommended to delete the old executable first and create a new inode for the updated version, so that any instances of the old version currently executing may continue to do so unbothered.
  • Typically, it is not possible to map from an open file to the filename that was used to open it. The operating system would convert the filename to an inode number at the first possible chance, then forget the filename. This means that the getcwd() and getwd() library functions would need to search the parent directory to find a file with an inode matching the "." directory, then search the grandparent directory for that directory, and so on until reaching the "/" directory. SVR4 and Linux systems retain extra information to avoid this awkwardness.
  • Historically, it was possible to hard link directories. This made the directory structure be an arbitrary directed graph instead of a tree. It was possible for a directory to be its own parent (the / directory, however, retains this status). Modern systems generally prohibit this confusing state.
  • A file's inode number will stay the same when it is moved to another directory on the same device, or when the disk is defragmented. Therefore, moving either a file's directory entry or its data (or both) is not enough to prevent a running process from accessing it, if the process ever had a chance of finding out the inode number. This also implies that completely conforming behavior of inodes is impossible to implement with many non-Unix file systems, such as FAT and its descendants, which don't have a way of storing this lasting "sameness" when both a file's directory entry and its data are moved around.

[edit] Practical considerations

Many computer programs used by system administrators in UNIX operating systems often give i-node numbers to designate a file. Popular disk integrity checking utility fsck or pfiles command may serve here as examples. Thus need arises to translate i-node numbers to file pathnames and vice versa. This can be accomplished using file-finding utility find with option -inum or ls command with proper option which on many platforms is -i.

[edit] External links

OFB.biz: Open for Business