Discussion:
An idea to enhance cpc
Jin Yao
2010-03-16 05:27:46 UTC
Permalink
Hi All,

When I use cputrack to track one process on a numa system
(like nhm-ex) and want to see some performance events like
“RMA” (Remote Memory Access) which the process costs.

The cputrack can tell me the RMA value which the process costs
on the whole system, eg: in last 5s, it costs 1,000 RMA on all 4
nodes (4 sockets).

But sometime, I want to know the RMA costs per node,
eg, how many RMA the process costs on node1 or how many
it costs on node2?

The cputrack can't give me above result because cpc doesn't support
seperating performance counter value per cpu for a thread/process.
So I want to provide a patch to enhance cpc to support this feature.

Does anybody think it's valuable?

Thanks
Jin Yao
--
This message posted from opensolaris.org
Li, Aubrey
2010-03-19 06:51:00 UTC
Permalink
Post by Jin Yao
Hi All,
When I use cputrack to track one process on a numa system
(like nhm-ex) and want to see some performance events like
“RMA” (Remote Memory Access) which the process costs.
The cputrack can tell me the RMA value which the process costs
on the whole system, eg: in last 5s, it costs 1,000 RMA on all 4
nodes (4 sockets).
But sometime, I want to know the RMA costs per node,
eg, how many RMA the process costs on node1 or how many
it costs on node2?
The cputrack can't give me above result because cpc doesn't support
seperating performance counter value per cpu for a thread/process.
So I want to provide a patch to enhance cpc to support this feature.
Does anybody think it's valuable?
Thanks
Jin Yao
Yeah, this is also needed for NUMAtop.
Yao - I think you can post your webrev here, I guess the experts here
more like to read code than anything else, ;-)

Thanks,
-Aubrey
_______________________________________________
perf-discuss mailing
Jin Yao
2010-03-19 15:25:47 UTC
Permalink
Let me introduce the patch's scenario.

1. Suppose we need to capture thread1 (use cpc_bind_pctx to bind to thread1).

2. Kernel allocates a slot array which is big enough to store the required
cpc events for thread1 on all cpus. Each slot will store the sampling values
for the events on a cpu.

3. When thread1 is migrated to cpuN, kernel programs the hardware counter
on cpuN and start counting.

4. When thread1 is switched off cpuN, kernel samples and stops the hardware
counter on cpuN. Add the sampling values to the slot which stands for the
required events on cpuN.

5. When user needs to get the thread1's sampling value, kernel copies out the
content of slot array (slot N means the events on cpuN for thread1).

To compatible with the current cpc version, the patch bring a new flag
"CPC_FLAG_LWP_ON_CORES" to enable/disable above mentioned function.

It should be a simple patch. Does anybody give me some suggestions?

Thanks
Jin Yao
--
This message posted from opensolaris.org
Jin Yao
2010-04-01 16:03:56 UTC
Permalink
Post by Jin Yao
Hi All,
When I use cputrack to track one process on a numa
system
(like nhm-ex) and want to see some performance events
like
“RMA” (Remote Memory Access) which the process
costs.
The cputrack can tell me the RMA value which the
process costs
on the whole system, eg: in last 5s, it costs 1,000
RMA on all 4
nodes (4 sockets).
But sometime, I want to know the RMA costs per node,
eg, how many RMA the process costs on node1 or how
many
it costs on node2?
The cputrack can't give me above result because cpc
doesn't support
seperating performance counter value per cpu for a
thread/process.
So I want to provide a patch to enhance cpc to
support this feature.
Does anybody think it's valuable?
Thanks
Jin Yao
oh, looks like the patch idea is lack of interests now. Please allow me
to give a sample again to show it's value.

We run a stream benchmark on a 4 sockets system 2 times. The stream
benchmark uses OpenMP for parallel and creates specifiled threads to do
the computation tasks. All computing threads must do barrier.

We see a big performance variation (10%) between two run.
1. In the better run, all the computing threads run on their own home
lgroup.
2. In the worse run, part of threads are migrated from its home lgroup
to other lgroup. Dtrace script confirms threads migration between
different lgroups with the worse run.

We guess the root cause is the threads on home lgroup runs faster than
other migrated off threads. But the fast threads have to wait the slow
threads to complete the jobs during the barrier phase.

If above guess is true, the migrated off threads should have a lot of RMA
(Remote Memory Access) on other nodes to access the memory on their home
lgroup.

But unfortunately, we don't have such data or observation to support it. Because
in current cpc implement, we can only get the total RMA of a thread on all
cpus/nodes. There is no way to seperate them per cpu/node for a thread.
We don't know how many RMA the migrated off threads cost on other lgroup.
That's why I want to provide a small patch to enhance cpc.

btw, another new question is raised, why scheduler migrates these threads from
their home lgroup to other lgroup? It's maybe a interesting topic worth
digging later.

Thanks
Jin Yao
--
This message posted from opensolaris.org
_______________________________________________
perf-disc
Kuriakose Kuruvilla
2010-04-05 19:16:28 UTC
Permalink
Hi Jin Yao

1. Regarding your RMA example, wouldn't you be able to use the dtrace
cpc provider to get this information?

2. How do you propose handling the case where both the proposed
per-hardware thread data and overflow profiling are enabled?

Thanks
/kuriakose
Jin Yao
2010-04-06 13:03:42 UTC
Permalink
Post by Kuriakose Kuruvilla
Hi Jin Yao
1. Regarding your RMA example, wouldn't you be able
to use the dtrace
cpc provider to get this information?
2. How do you propose handling the case where both
the proposed
per-hardware thread data and overflow profiling are
enabled?
Thanks
/kuriakose
_______________________________________________
perf-discuss mailing list
Thanks kuriakose's suggestions.

I just tried the dtrace cpc provider to check if it can satisfy my requirements.
I wrote a dtrace script and test it on nhm-ep.

#!/usr/sbin/dtrace -s

cpc:::mem_load_retired.llc_miss-all-5000
{
@llc_miss[pid,tid,cpu] = count();
}

cpc:::mem_uncore_retired.remote_dram-all-5000
{
@rma[pid,tid,cpu] = count();
}

cpc:::mem_uncore_retired.local_dram-all-5000
{
@lma[pid,tid,cpu] = count();
}

cpc:::instr_retired.any-all-1000000
{
@ir[pid,tid,cpu] = count();
}

cpc:::cpu_clk_unhalted.thread-all-1000000
{
@clk[pid,tid,cpu] = count();
}

END
{
printf("\nllc_miss");
printa(@llc_miss);
printf("\nrma");
printa(@rma);
printf("\nlma");
printa(@lma);
printf("\nir");
printa(@ir);
printf("\nclk");
printa(@clk);
}

The script does work and most of output are the information I need, but only
without the output of "clk". Do I lost something special in above script?
Or do I need to manually assign "cpu_clk_unhalted.thread" to a special pic?

But in a word, the dtrace cpc provider can successfully give me the information
which my patch want to provide.

Thanks
Jin Yao
--
This message posted from opensolaris.org
Kuriakose Kuruvilla
2010-04-07 23:16:52 UTC
Permalink
Post by Jin Yao
The script does work and most of output are the information I need, but only
without the output of "clk". Do I lost something special in above script?
Or do I need to manually assign "cpu_clk_unhalted.thread" to a special pic?
Worked for me. Manually assigning the event to a pic is not necessary.
Post by Jin Yao
But in a word, the dtrace cpc provider can successfully give me the information
which my patch want to provide.
One thing to note is that the dtrace cpc provider samples the system
information (e.g. pid) whenever the overflow occurred and is delivered.
It is not necessary that all the events leading to that overflow were
caused by the process running at the time the overflow was delivered.
Nevertheless, it is a quick way to test your hypothesis regarding the
remote accesses.

/kuriakose

Loading...