- 1. Introduction
- 2. Types of Thread Scheduling
- 2.1. Compact Scheduling
- 2.2. Round-Robin Scheduling
- 2.3. Stupid Scheduling
- 3. Defining Affinity
- 3.1. The Linux-Portable Way (taskset)
- 3.2. The Other Linux-Portable Way (numactl)
- 3.3. Using OpenMP Runtime Extensions
- 3.4. getfreesocket
Although a compute node or workstation may appear to have 16 cores and 64 GB of DRAM, these resources are not uniformly accessible to your applications. The best application performance is usually obtained by keeping your code’s parallel workers (e.g., threads or MPI processes) as close to the memory on which they are operating as possible. While you might like to think that the Linux thread scheduler would do this automatically for you, the reality is that most HPC applications benefit greatly from a little bit of help in manually placing threads on different processor cores.
To get an idea of what your multithreaded application is doing while it is running, you can use the
Assuming your executable is called
application.x, you can easily see what cores each thread is using by issuing the following command in bash:
$ for i in $(pgrep application.x); do ps -mo pid,tid,fname,user,psr -p $i;done
The PSR field is the OS identifier for the core each TID (thread id) is utilizing.
2. Types of Thread Scheduling
Certain types of unevenly loaded applications can experience serious performance degradation caused by the Linux scheduler treating high-performance application codes in the same way it would treat a system daemon that might spend most of its time idle.
These sorts of scheduling issues are best described with diagrams. Let’s assume we have compute nodes with two processor sockets, and each processor has four cores:
When you run a multithreaded application with four threads (or even four serial applications), Linux will schedule those threads for execution by assigning each one to a CPU core. Without being explicitly told how to do this scheduling, Linux may decide to
- run thread0 to thread3 on core0 to core3 on socket0
- run thread0 and thread1 on core0 and core1 on socket0, and run thread2 and thread3 on socket1
- run thread0 and thread1 on core0 only, run thread2 on core1, run thread3 on core2, and leave core3 completely unutilized
- any number of other nonsensical allocations involving assigning multiple threads to a single core while other cores sit idle
It should be obvious that option #3 and #4 are very bad for performance, but the fact is that Linux will happily schedule your multithreaded job (or multiple single-thread jobs) this way if your threads behave in a way that is confusing to the operating system.
2.1. Compact Scheduling
Option #1 is often referred to as “compact” scheduling and is depicted in the diagram to the right. It keeps all of your threads running on a single physical processor if possible, and this is what you would want if all of the threads in your application need to repeatedly access different parts of a large array. This is because all of the cores on the same physical processor can access the memory banks associated with (or “owned by”) that processor at the same speed. However, cores cannot access memory stored on memory banks owned by a different processor as quickly; this is phenomenon is called NUMA (non-uniform memory access). If your threads all need to access data stored in the memory owned by one processor, it is often best to put all of your threads on the processor who owns that memory.
2.2. Round-Robin Scheduling
Option #2 is called “scatter” or “round-robin” scheduling and is ideal if your threads are largely independent of each other and don’t need to access a lot of memory that other threads need. The benefit to round-robin thread scheduling is that not all threads have to share the same memory channel and cache, effectively doubling the memory bandwidth and cache sizes available to your application. The tradeoff is that memory latency becomes higher as threads have to start accessing memory that might be owned by another processor.
2.3. Stupid Scheduling
Option #3 and #4 are what I call “stupid” scheduling (see diagram to the right) and can often be the default behavior of the Linux thread scheduler if you don’t tell Linux where your threads should run. This happens because in traditional Linux server environments, most of the proceses that are running at any given time aren’t doing anything. To conserve power, Linux will put a lot of these quiet processes on the same processor or cores, then move them to their own dedicated core when they wake up and have to start processing.
If your application is running at full bore 100% of the time, Linux will probably keep it on its own dedicated CPU core. However, if your application has an uneven load (e.g., threads are mostly idle while the last thread finishes), Linux will see that the application is mostly quiet and pack all the quiet threads (e.g., t0 and t1 in the diagram to the right) on to the same CPU core. This wouldn’t be so bad, but the cost of moving a thread from one core to another requires context switches which get very expensive when done hundreds or thousands of times a minute.
3. Defining affinity
3.1. The Linux-Portable Way (taskset)
If you want to launch a job (e.g.,
simulation.x) on a certain set of cores (e.g.,
$ taskset -c 0,2,4,6 simulation.x
If your process is already running, you can define thread affinity while in flight. It also lets you bind specific TIDs to specific processors at a level of granularity greater than specifying
-c 0,2,4,6because Linux may still schedule two threads on
core2 and nothing on
core0. For example,
$ for i in $(pgrep application.x);do ps -mo pid,tid,fname,user,psr -p $i;done
PID TID COMMAND USER PSR
21654 - applicat glock -
- 21654 - glock 0
- 21655 - glock 2
- 21656 - glock 2
- 21657 - glock 6
- 21658 - glock 4
$ taskset -p -c 0 21654
$ taskset -p -c 0 21655
$ taskset -p -c 2 21656
$ taskset -p -c 4 21657
$ taskset -p -c 6 21658
This sort of scheduling will happen under certain conditions, so specifying a set of cpus to a set of threads without specifically assigning each thread to a physical core may not always behave optimally.
3.2. The Other Linux-Portable Way (numactl)
The emerging standard for easily binding processes to processors on Linux-based supercomputers is
numactl. It can operate on a coarser-grained basis (i.e., CPU sockets rather than individual CPU cores) than taskset (only CPU cores) because it is aware of the processor topology and how the CPU cores map to CPU sockets. Using
numactl is typically easier–after all, the common goal is to confine a process to a numa pool (or “cpu node”) rather than specific CPU cores. To that end,
numactl also lets you bind a processor’s memory locality to prevent processes from having to jump across NUMA pools (called “memory nodes” in
Whereas if you wanted to bind a specific process to one processor socket with
taskset you would have to
$ taskset -c 0,2,4,6 simulation.x
the same operation is greatly simplified with
$ numactl --cpunodebind=0 simulation.x
If you want to also restrict
simulation.x’s memory use to the numa pool associated with cpu node 0, you can do
$ numactl --cpunodebind=0 --membind=0 simulation.x
$ numactl -C 0 -N 0 simulation.x
You can see what cpu nodes and their corresponding memory nodes are available on your system by using numactl -H:
$ numactl -H
available: 2 nodes (0-1)
node 0 size: 32728 MB
node 0 free: 12519 MB
node 1 size: 32768 MB
node 1 free: 16180 MB
node 0 1
0: 10 21
1: 21 10
numactl also lets you supply specific cores (like
taskset) with the –physcpubind or -C. Unlike
numactl does not appear to let you change the CPU affinity of a process that is already running.
An alternative syntax to
numactl -C is something like
$ numactl -C +0,1,2,3 simulation.x
By prefixing your list of cores with a
+, you can have numactl bind to relative cores. When combined with cpusets (which are enabled by default for all jobs on Gordon), the above command will use the 0th, 1st, 2nd, and 3rd core of the job’s given cpuset instead of literally core 0,1,2,3.
3.3. Using OpenMP Runtime Extensions
OpenMP 4.0 now includes standardized controls for binding threads to cores. I haven’t caught up with these changes but I will document them here once I do.
Multithreaded programs compiled with Intel Compilers can utilize Intel’s Thread Affinity Interface for OpenMP applications. Set and export the KMP_AFFINITY env variable to express binding preferences.KMP_AFFINITY has three principal binding strategies:
compact fills up one socket before allocating to other sockets
scatter evenly spreads threads across all sockets and cores
explicit allows you define exactly which cores/sockets to use
KMP_AFFINITY=compact will preferentially bind all your threads, one per core, to a single socket before it tries binding them to other sockets. Unfortunately, it will start at
socket0 regardless of if other processes (such as another SMP job) is already bound to that socket. You can explicitly specify an offset to force the job to bind to a specific socket, but you need to know exactly what is running on what cores and sockets on your node in order to specify this in your submit script.
You can also explicitly define which cores your job should use. Combined with a little knowledge of your system’s CPU topology ([Intel’s Processor Topology Enumeration tool][intel’s processor enumeration tool] is great for this). If you wanted to run on cores 0, 2, 4, and 6, you would do
GNU’s implementation of OpenMP has a environment variable similar to KMP_AFFINITY calledGOMP_CPU_AFFINITY. Incidentally, Intel’s OpenMP supports GOMP_CPU_AFFINITY, so using this variable may be a relatively portable way to specify thread affinity at runtime. The equivalent GOMP_CPU_AFFINITY for the KMP_AFFINITY I gave above would be:
I wrote a small perl script called
getfreesocket that uses
KMP_AFFINITY=explicit (or GOMP_CPU_AFFINITY) and some probing of the Linux OS at runtime to bind SMP jobs to free processor sockets. It should be invoked in a job script something like this:
nprocs=$(grep '^physical id' /proc/cpuinfo | sort -u | wc -l)
ncores=$(grep '^processor' /proc/cpuinfo | sort -u | wc -l)
if [ "z$freesock" == "z" ]
echo "Not enough free processors! aborting"
GOMP_CPU_AFFINITY="$(echo $freesock | sed -e 's/,/ /g')"
export KMP_AFFINITY OMP_NUM_THREADS GOMP_CPU_AFFINITY
This was a very simple solution to get single-socket jobs to play nicely on the shared batch system we were using at the Interfacial Molecular Science Laboratory. While
numactl is an easier way to accomplish some of this, it still requires that you know what other processes are sharing your node and on what CPU cores they are running. I’ve experienced problems with Linux’s braindead thread scheduling so this
getfreesocket finds completely unused sockets that can be fed into
This is not as great an issue if your resource manager supports launching jobs within cpusets. Your resource manager will provide a cpuset, and using relative specifiers for numactl cores (e.g., numactl -C +0-3) will bind to the free socket provided by the batch environment. Of course, this will not specifically bind one thread to one core, so using KMP_AFFINITY or GOMP_CPU_AFFINITY may remain necessary.