Senin, 17 Desember 2007

The sixty second pmc howto - saran dari pakar BSD untuk kernel profiling

For those of you who haven't been using pmc for kernel profiling, you
definitely ought to consider it. Joseph Koshy's tools are very easy to use,
and the results are quite informative. The output mode to convert pmc samples
to something gmon understands is particularly neat.

In thirty seconds, here's how to get it working:

(1) Compile your kernel with device hwpmc, options HWPMC_HOOKS.

(2) Run "pmcstat -S instructions -O /tmp/sample.out" to start sampling of
instruction retirement events, saving the results to /tmp/sample.out.

(3) Exercise your kernel code.

(4) Exit pmcstat -- I hit ctrl-c, but there are probably more mature ways of
setting this up.

(5) Run "pmcstat -R /tmp/sample.out -k /zoo/tiger-2/boot/kernel/kernel -g" to
convert the results to gmon output format so it can be processed by gprof.
Obviously, you need to set the right path to your boot kernel -- by
default, pmcstat uses /boot/kernel/kernel for kernel sample results.

(6) View the results using gprof, "gprof /zoo/tiger-2/boot/kernel/kernel
p4-instr-retired/gmon.out". Again, update the path as needed.

Since there is no call graph information in the sample, the first few pages of
gprof output will be of limited utility, but the summary table by function is
the bit I found most useful:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
13.6 7251.00 7251.00 0 100.00% _mtx_lock_sleep [1]
3.7 9213.00 1962.00 0 100.00% _mtx_lock_spin [2]
3.3 10956.00 1743.00 0 100.00% bus_dmamap_load_mbuf_sg
[3]
2.7 12370.00 1414.00 0 100.00% tcp_input [4]
2.6 13781.00 1411.00 0 100.00% tcp_output [5]
2.6 15172.00 1391.00 0 100.00% spinlock_exit [6]
2.5 16496.00 1324.00 0 100.00% uma_zalloc_arg [7]
2.0 17555.00 1059.00 0 100.00% spinlock_enter [8]
1.9 18553.00 998.00 0 100.00% uma_zfree_arg [9]
1.6 19409.00 856.00 0 100.00% em_rxeof [10]
1.6 20260.00 851.00 0 100.00% sleepq_signal [11]
1.5 21071.00 811.00 0 100.00% em_get_buf [12]
1.4 21821.00 750.00 0 100.00% rn_match [13]
1.3 22531.00 710.00 0 100.00% ether_demux [14]
1.3 23222.00 691.00 0 100.00% ip_output [15]
...

In this sample, there's a mutex contention problem, which shows up clearly.
Without the call graph, there's more work yet to do to identify the source of
contention, but the fact that pmc exhibits a relatively low overhead, not to
mention higher accuracy, than the existing kernel profiling is great. And,
this isn't limited to instruction retirement. You can also profile by cache
line miss, mis-predicted branches, etc. Here's an excerpt from the resource
stall sample on the same workload:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
3.9 3225.00 3225.00 0 100.00% m_freem [1]
3.7 6253.00 3028.00 0 100.00% ether_input [2]
3.5 9116.00 2863.00 0 100.00% cpu_switch [3]
2.9 11470.00 2354.00 0 100.00% mb_ctor_pack [4]
2.7 13681.00 2211.00 0 100.00% uma_zalloc_arg [5]
2.7 15882.00 2201.00 0 100.00% em_rxeof [6]
2.6 18045.00 2163.00 0 100.00% em_intr_fast [7]
2.6 20148.00 2103.00 0 100.00%
intr_event_schedule_thread
[8]
2.2 21941.00 1793.00 0 100.00% if_handoff [9]
2.2 23727.00 1786.00 0 100.00% sleepq_signal [10]
1.8 25222.00 1495.00 0 100.00% _mtx_lock_sleep [11]
1.7 26612.00 1390.00 0 100.00% tcp_output [12]
1.7 27997.00 1385.00 0 100.00% bus_dmamap_load_mbuf_sg
[13]
1.6 29309.00 1312.00 0 100.00% uma_zfree_arg [14]

So if you're doing kernel performance work, and not already using pmc, you
probably should be.

Robert N M Watson
rwatson@freebsd.org

Tidak ada komentar:

OFB.biz: Open for Business