Kishore Kumar Pusukuri
2011-02-12 04:48:19 UTC
Hi,
I am running a multithreaded application with 20 threads on my 24-core
AMD Opteron (ccNUMA) machine running Solaris 10. When I run the application with
threads binding to cores using pbind (one-thread to one-core), its performance is
dramatically degrading. It is around 80% performance loss with binding. To understand
this, I used "prstat -m", and found that without binding (the default case),
the % lock-contention (LCK field) is around 13%, but with binding it is around 30%.
Moreover, the % latency (LAT field) is almost zero but with binding it is around 37.
Please find LCK and LAT fields of prstat output below.
Configuration USR LCK LAT
--------------------------------
No-Binding 86 13 0.1
Binding 32 30 37
Therefore, the application with binding spends most of the time in contention
or in the ready-queue. BTW, there is no significant difference in cache miss-ratio
measured with cpustat(1).
Is it because of the following reasons? If not, please let me know how to find the
reasons behind the above behavior.
Since the application has serious inter-thread communication, some threads
need to wait for locks, therefore the binding configuration increases memory
traffic among the chips. Moreover, because of the memory latency, the delay loop
time (the delay loop before retrying a lock) will be incremented exponentially
and therefore threads spend most of the time waiting for locks.
However, in the default configuration (no-binding), the load is balance well
by migrating threads among the cores, and therefore threads get a chance to
share the lock data structures and thus improves performance compared with
binding configuration.
Please find the "prstat -Lm" output per thread in both the configurations below:
No-Binding (Default) Configuration
==================================
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
15637 user 93 0.2 0.0 0.0 0.0 6.6 0.0 0.1 186 34 437 0 myprogram/13
15637 user 92 0.2 0.0 0.0 0.0 8.0 0.0 0.1 176 36 399 0 myprogram/11
15637 user 91 0.2 0.0 0.0 0.0 8.8 0.0 0.1 201 34 398 0 myprogram/10
15637 user 89 0.2 0.0 0.0 0.0 11 0.0 0.2 253 34 450 0 myprogram/12
15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 194 34 414 0 myprogram/17
15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 187 34 416 0 myprogram/9
15637 user 86 0.2 0.0 0.0 0.0 13 0.0 0.1 188 34 420 0 myprogram/21
15637 user 86 0.1 0.0 0.0 0.0 14 0.0 0.1 227 45 454 0 myprogram/3
15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 215 37 443 0 myprogram/15
15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 212 35 435 0 myprogram/7
15637 user 85 0.2 0.0 0.0 0.0 14 0.0 0.3 258 43 520 0 myprogram/2
15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 213 34 454 0 myprogram/5
15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 216 80 438 0 myprogram/19
15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 248 36 464 0 myprogram/6
15637 user 84 0.2 0.0 0.0 0.0 15 0.0 0.1 257 35 474 0 myprogram/14
15637 user 84 0.2 0.0 0.0 0.0 16 0.0 0.1 241 31 445 0 myprogram/18
15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 256 30 467 0 myprogram/16
15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 265 30 476 0 myprogram/8
15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 257 31 467 0 myprogram/20
15637 user 81 0.2 0.0 0.0 0.0 18 0.0 0.2 259 30 488 0 myprogram/4
15637 user 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 myprogram/1
Binding (thread-to-core) Configuration
=======================================
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
15687 user 6.1 0.0 0.0 0.0 0.0 41 0.0 53 33 8 54 0 myprogram/13
15687 user 5.7 0.0 0.0 0.0 0.0 32 0.0 62 31 10 38 0 myprogram/11
15687 user 5.5 0.0 0.0 0.0 0.0 37 0.0 57 26 15 35 0 myprogram/10
15687 user 5.4 0.0 0.0 0.0 0.0 47 0.0 47 34 6 78 0 myprogram/21
15687 user 5.4 0.0 0.0 0.0 0.0 35 0.0 60 28 16 43 0 myprogram/17
15687 user 5.2 0.0 0.0 0.0 0.0 42 0.0 53 33 6 59 0 myprogram/6
15687 user 5.2 0.0 0.0 0.0 0.0 36 0.0 59 31 8 36 0 myprogram/15
15687 user 5.2 0.0 0.0 0.0 0.0 56 0.0 39 36 7 72 0 myprogram/2
15687 user 5.1 0.0 0.0 0.0 0.0 51 0.0 44 34 6 62 0 myprogram/5
15687 user 5.0 0.0 0.0 0.0 0.0 50 0.0 45 33 6 54 0 myprogram/16
15687 user 5.0 0.0 0.0 0.0 0.0 39 0.0 56 31 8 43 0 myprogram/7
15687 user 4.9 0.0 0.0 0.0 0.0 38 0.0 57 33 7 41 0 myprogram/19
15687 user 4.8 0.0 0.0 0.0 0.0 32 0.0 63 29 11 47 0 myprogram/12
15687 user 4.7 0.0 0.0 0.0 0.0 43 0.0 53 31 8 36 0 myprogram/14
15687 user 4.6 0.0 0.0 0.0 0.0 36 0.0 59 32 8 46 0 myprogram/8
15687 user 4.5 0.0 0.0 0.0 0.0 51 0.0 45 33 5 63 0 myprogram/20
15687 user 4.5 0.0 0.0 0.0 0.0 57 0.0 38 32 6 60 0 myprogram/18
15687 user 4.4 0.0 0.0 0.0 0.0 59 0.0 37 31 7 66 0 myprogram/9
15687 user 4.3 0.0 0.0 0.0 0.0 43 0.0 53 30 6 41 0 myprogram/3
15687 user 4.3 0.0 0.0 0.0 0.0 43 0.0 53 33 5 57 0 myprogram/4
15687 user 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 myprogram/1
I am running a multithreaded application with 20 threads on my 24-core
AMD Opteron (ccNUMA) machine running Solaris 10. When I run the application with
threads binding to cores using pbind (one-thread to one-core), its performance is
dramatically degrading. It is around 80% performance loss with binding. To understand
this, I used "prstat -m", and found that without binding (the default case),
the % lock-contention (LCK field) is around 13%, but with binding it is around 30%.
Moreover, the % latency (LAT field) is almost zero but with binding it is around 37.
Please find LCK and LAT fields of prstat output below.
Configuration USR LCK LAT
--------------------------------
No-Binding 86 13 0.1
Binding 32 30 37
Therefore, the application with binding spends most of the time in contention
or in the ready-queue. BTW, there is no significant difference in cache miss-ratio
measured with cpustat(1).
Is it because of the following reasons? If not, please let me know how to find the
reasons behind the above behavior.
Since the application has serious inter-thread communication, some threads
need to wait for locks, therefore the binding configuration increases memory
traffic among the chips. Moreover, because of the memory latency, the delay loop
time (the delay loop before retrying a lock) will be incremented exponentially
and therefore threads spend most of the time waiting for locks.
However, in the default configuration (no-binding), the load is balance well
by migrating threads among the cores, and therefore threads get a chance to
share the lock data structures and thus improves performance compared with
binding configuration.
Please find the "prstat -Lm" output per thread in both the configurations below:
No-Binding (Default) Configuration
==================================
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
15637 user 93 0.2 0.0 0.0 0.0 6.6 0.0 0.1 186 34 437 0 myprogram/13
15637 user 92 0.2 0.0 0.0 0.0 8.0 0.0 0.1 176 36 399 0 myprogram/11
15637 user 91 0.2 0.0 0.0 0.0 8.8 0.0 0.1 201 34 398 0 myprogram/10
15637 user 89 0.2 0.0 0.0 0.0 11 0.0 0.2 253 34 450 0 myprogram/12
15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 194 34 414 0 myprogram/17
15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 187 34 416 0 myprogram/9
15637 user 86 0.2 0.0 0.0 0.0 13 0.0 0.1 188 34 420 0 myprogram/21
15637 user 86 0.1 0.0 0.0 0.0 14 0.0 0.1 227 45 454 0 myprogram/3
15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 215 37 443 0 myprogram/15
15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 212 35 435 0 myprogram/7
15637 user 85 0.2 0.0 0.0 0.0 14 0.0 0.3 258 43 520 0 myprogram/2
15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 213 34 454 0 myprogram/5
15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 216 80 438 0 myprogram/19
15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 248 36 464 0 myprogram/6
15637 user 84 0.2 0.0 0.0 0.0 15 0.0 0.1 257 35 474 0 myprogram/14
15637 user 84 0.2 0.0 0.0 0.0 16 0.0 0.1 241 31 445 0 myprogram/18
15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 256 30 467 0 myprogram/16
15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 265 30 476 0 myprogram/8
15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 257 31 467 0 myprogram/20
15637 user 81 0.2 0.0 0.0 0.0 18 0.0 0.2 259 30 488 0 myprogram/4
15637 user 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 myprogram/1
Binding (thread-to-core) Configuration
=======================================
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
15687 user 6.1 0.0 0.0 0.0 0.0 41 0.0 53 33 8 54 0 myprogram/13
15687 user 5.7 0.0 0.0 0.0 0.0 32 0.0 62 31 10 38 0 myprogram/11
15687 user 5.5 0.0 0.0 0.0 0.0 37 0.0 57 26 15 35 0 myprogram/10
15687 user 5.4 0.0 0.0 0.0 0.0 47 0.0 47 34 6 78 0 myprogram/21
15687 user 5.4 0.0 0.0 0.0 0.0 35 0.0 60 28 16 43 0 myprogram/17
15687 user 5.2 0.0 0.0 0.0 0.0 42 0.0 53 33 6 59 0 myprogram/6
15687 user 5.2 0.0 0.0 0.0 0.0 36 0.0 59 31 8 36 0 myprogram/15
15687 user 5.2 0.0 0.0 0.0 0.0 56 0.0 39 36 7 72 0 myprogram/2
15687 user 5.1 0.0 0.0 0.0 0.0 51 0.0 44 34 6 62 0 myprogram/5
15687 user 5.0 0.0 0.0 0.0 0.0 50 0.0 45 33 6 54 0 myprogram/16
15687 user 5.0 0.0 0.0 0.0 0.0 39 0.0 56 31 8 43 0 myprogram/7
15687 user 4.9 0.0 0.0 0.0 0.0 38 0.0 57 33 7 41 0 myprogram/19
15687 user 4.8 0.0 0.0 0.0 0.0 32 0.0 63 29 11 47 0 myprogram/12
15687 user 4.7 0.0 0.0 0.0 0.0 43 0.0 53 31 8 36 0 myprogram/14
15687 user 4.6 0.0 0.0 0.0 0.0 36 0.0 59 32 8 46 0 myprogram/8
15687 user 4.5 0.0 0.0 0.0 0.0 51 0.0 45 33 5 63 0 myprogram/20
15687 user 4.5 0.0 0.0 0.0 0.0 57 0.0 38 32 6 60 0 myprogram/18
15687 user 4.4 0.0 0.0 0.0 0.0 59 0.0 37 31 7 66 0 myprogram/9
15687 user 4.3 0.0 0.0 0.0 0.0 43 0.0 53 30 6 41 0 myprogram/3
15687 user 4.3 0.0 0.0 0.0 0.0 43 0.0 53 33 5 57 0 myprogram/4
15687 user 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 myprogram/1
--
This message posted from opensolaris.org
This message posted from opensolaris.org