Sabtu, 22 Desember 2007

Squeeze Your Gigabit NIC for Top Performance

October 24, 2007
By Charlie Schluting

Many new workstations and servers are coming with integrated gigabit (define) network cards, but quite a few people soon discover that they can't transfer data much faster than they did with 100 Mb/s network cards. Multiple factors can affect your ability to transfer at higher speeds, and most of them revolve around operating system settings. In this article we will discuss the necessary steps to make your new gigabit-enabled server obtain close to gigabit speeds in Linux, FreeBSD, and Windows.

Hardware considerations
First and foremost we must realize that there are hardware limitations to consider. Just because someone throws a gigabit network card in a server doesn't mean the hardware can keep up.


Network cards are normally connected to the PCI (define) bus via a free PCI slot. In older workstation and non server-class motherboards the PCI slots are normally 32 bit, 33MHz. This means they can transfer at speeds of 133MB/s. Since the bus is shared between many parts of the computer, it's realistically limited to around 80MB/s in the best case.

Gigabit network cards provide speeds of 1000Mb/s, or 125MB/s. If the PCI bus is only capable of 80MB/s this is a major limiting factor for gigabit network cards. The math works out to 640Mb/s, which is really quite a bit faster than most gigabit network card installations, but remember this is probably the best-case scenario.

If there are other hungry data-loving PCI cards in the server, you'll likely see much less throughput. The only solution for overcoming this bottleneck is to purchase a motherboard with a 66MHz PCI slot, which can do 266MB/s. Also, the new 64 bit PCI slots are capable of 532MB/s on a 66MHz bus. These are beginning to come standard on all server-class motherboards.

Assuming we're using decent hardware that can keep up with the data rates necessary for gigabit, there is now another obstacle — the operating system. For testing, we used two identical servers: Intel Server motherboards, Pentium 4 3.0 GHz, 1GB RAM, integrated 10/100/1000 Intel network card. One was running Gentoo Linux with a 2.6 SMP (define) kernel, and the other is FreeBSD 5.3 with an SMP kernel to take advantage of the Pentium 4's HyperThreading capabilities. We were lucky to have a gigabit capable switch, but the same results could be accomplished by connecting both servers directly to each other.

Software Considerations
For testing speeds between two servers, we don't want to use FTP or anything that will fetch data from disk. Memory to memory transfers are a much better test, and many tools exist to do this. For our tests, we used [ttcp](http://www.pcausa.com/Utilities/pcattcp.htm).

The first test between these two servers was not pretty. The maximum rate was around 230 Mb/s: about two times as fast as a 100Mb/s network card. This was an improvement, but far from optimal. In actuality, most people will see even worse performance out of the box. However, with a few minor setting changes, we quickly realized major speed improvements — more than a threefold improvement over the initial test.

Many people recommend setting the MTU of your network interface larger. This basically means telling the network card to send a larger Ethernet frame. While this may be useful when connecting two hosts directly together, it becomes less useful when connecting through a switch that doesn't support larger MTUs (define). At any rate, this isn't necessary. 900Mb/s can be attained at the normal 1500 byte MTU setting.

For attaining maximum throughput, the most important options involve TCP window sizes. The TCP window controls the flow of data, and is negotiated during the start of a TCP connection. Using too small of a size will result in slowness, since TCP can only use the smaller of the two end system's capabilities. It is quite a bit more complex than this, but here's the information you really need to know:
Configuring Linux and FreeBSD
For both Linux and FreeBSD we're using the sysctl utility. For all of the following options, entering the command 'sysctl variable=number' should do the trick. To view the current settings use: 'sysctl '

* Maximum window size:
o FreeBSD:
kern.ipc.maxsockbuf=262144
o Linux:
net.core.wmem_max=8388608

* Default window size:
o FreeBSD, sending and receiving:
net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=65536
o Linux, sending and receiving:
net.core.wmem_default = 65536
net.core.rmem_default = 65536

* RFC 1323:
This enables the useful window scaling options defined in rfc1323, which allows the windows to dynamically get larger than we specified above.
o FreeBSD:
net.inet.tcp.rfc1323=1
o Linux:
net.ipv4.tcp_window_scaling=1
* Buffers:
When sending large amounts of data, we can run the operating system out of buffers. This option should be enabled before attempting to use the above settings. To increase the amount of "mbufs" available:
o FreeBSD:
kern.ipc.nmbclusters=32768
o Linux:
net.ipv4.tcp_mem= 98304 131072 196608

These quick changes will skyrocket TCP performance. Afterwards we were able to run ttcp and attain around 895 Mb/s every time – quite an impressive data rate. There are other options available for adjusting the UDP datagram sizes as well, but we're mainly focusing on TCP here.

Windows XP/2000 Server/Server 2003
The magical location for TCP settings in the registry editor is HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

We need to add a registry DWORD named TcpWindowSize, and enter a sufficiently large size. 131400 (make sure you click on 'decimal') should be enough. Tcp1323Opts should be set to 3. This enables both rfc1323 scaling and timestamps.

And, similarly to Unix, we want to increase the TCP buffer sizes:

ForwardBufferMemory 80000
NumForwardPackets 60000

One last important note for Windows XP users: If you've installed Service Pack 2, then there is another likely culprit for poor network performance. Explained in [knowledge base article 842264(http://support.microsoft.com/?kbid=842264), Microsoft says that disabling Internet Connection Sharing after an SP2 install should fix performance issues.

The above tweaks should enable your sufficiently fast server to attain much faster data rates over TCP. If your specific application makes significant use of UDP, then it will be worth looking into similar options relating to UDP datagram sizes. Remember, we obtained close to 900Mb/s with a very fast Pentium 4 machine, server-class motherboard, and quality Intel network card. Results may vary wildly, but adjusting the above settings are a necessary step toward realizing your server's capabilities.

Jumat, 21 Desember 2007

Beberapa parameter untuk performance BIND sbg DNS recursive

recursive-clients 1000000;
version "xxxOK";
lame-ttl 900;
max-cache-size 100K; //diperkecil agar mempercepat respon perubahan disisi dns lawan
max-ncache-ttl 60;//diperkecil agar mempercepat respon perubahan disisi dns lawan
max-cache-ttl 60;//diperkecil agar mempercepat respon perubahan disisi dns lawan
cleaning-interval 1;//diperkecil agar mempercepat respon perubahan disisi dns lawan

Senin, 17 Desember 2007

The sixty second pmc howto - saran dari pakar BSD untuk kernel profiling

For those of you who haven't been using pmc for kernel profiling, you
definitely ought to consider it. Joseph Koshy's tools are very easy to use,
and the results are quite informative. The output mode to convert pmc samples
to something gmon understands is particularly neat.

In thirty seconds, here's how to get it working:

(1) Compile your kernel with device hwpmc, options HWPMC_HOOKS.

(2) Run "pmcstat -S instructions -O /tmp/sample.out" to start sampling of
instruction retirement events, saving the results to /tmp/sample.out.

(3) Exercise your kernel code.

(4) Exit pmcstat -- I hit ctrl-c, but there are probably more mature ways of
setting this up.

(5) Run "pmcstat -R /tmp/sample.out -k /zoo/tiger-2/boot/kernel/kernel -g" to
convert the results to gmon output format so it can be processed by gprof.
Obviously, you need to set the right path to your boot kernel -- by
default, pmcstat uses /boot/kernel/kernel for kernel sample results.

(6) View the results using gprof, "gprof /zoo/tiger-2/boot/kernel/kernel
p4-instr-retired/gmon.out". Again, update the path as needed.

Since there is no call graph information in the sample, the first few pages of
gprof output will be of limited utility, but the summary table by function is
the bit I found most useful:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
13.6 7251.00 7251.00 0 100.00% _mtx_lock_sleep [1]
3.7 9213.00 1962.00 0 100.00% _mtx_lock_spin [2]
3.3 10956.00 1743.00 0 100.00% bus_dmamap_load_mbuf_sg
[3]
2.7 12370.00 1414.00 0 100.00% tcp_input [4]
2.6 13781.00 1411.00 0 100.00% tcp_output [5]
2.6 15172.00 1391.00 0 100.00% spinlock_exit [6]
2.5 16496.00 1324.00 0 100.00% uma_zalloc_arg [7]
2.0 17555.00 1059.00 0 100.00% spinlock_enter [8]
1.9 18553.00 998.00 0 100.00% uma_zfree_arg [9]
1.6 19409.00 856.00 0 100.00% em_rxeof [10]
1.6 20260.00 851.00 0 100.00% sleepq_signal [11]
1.5 21071.00 811.00 0 100.00% em_get_buf [12]
1.4 21821.00 750.00 0 100.00% rn_match [13]
1.3 22531.00 710.00 0 100.00% ether_demux [14]
1.3 23222.00 691.00 0 100.00% ip_output [15]
...

In this sample, there's a mutex contention problem, which shows up clearly.
Without the call graph, there's more work yet to do to identify the source of
contention, but the fact that pmc exhibits a relatively low overhead, not to
mention higher accuracy, than the existing kernel profiling is great. And,
this isn't limited to instruction retirement. You can also profile by cache
line miss, mis-predicted branches, etc. Here's an excerpt from the resource
stall sample on the same workload:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
3.9 3225.00 3225.00 0 100.00% m_freem [1]
3.7 6253.00 3028.00 0 100.00% ether_input [2]
3.5 9116.00 2863.00 0 100.00% cpu_switch [3]
2.9 11470.00 2354.00 0 100.00% mb_ctor_pack [4]
2.7 13681.00 2211.00 0 100.00% uma_zalloc_arg [5]
2.7 15882.00 2201.00 0 100.00% em_rxeof [6]
2.6 18045.00 2163.00 0 100.00% em_intr_fast [7]
2.6 20148.00 2103.00 0 100.00%
intr_event_schedule_thread
[8]
2.2 21941.00 1793.00 0 100.00% if_handoff [9]
2.2 23727.00 1786.00 0 100.00% sleepq_signal [10]
1.8 25222.00 1495.00 0 100.00% _mtx_lock_sleep [11]
1.7 26612.00 1390.00 0 100.00% tcp_output [12]
1.7 27997.00 1385.00 0 100.00% bus_dmamap_load_mbuf_sg
[13]
1.6 29309.00 1312.00 0 100.00% uma_zfree_arg [14]

So if you're doing kernel performance work, and not already using pmc, you
probably should be.

Robert N M Watson
rwatson@freebsd.org

OFB.biz: Open for Business