Impermanence in Linux – Exclusive (By Hari Iyer)

Impermanence, also called Anicca or Anitya, is one of the essential doctrines and a part of three marks of existence in Buddhism The doctrine asserts that all of conditioned existence, without exception, is “transient, evanescent, inconstant”

On Linux, the root of all randomness is something called the kernel entropy pool. This is a large (4,096 bit) number kept privately in the kernel’s memory. There are 24096 possibilities for this number so it can contain up to 4,096 bits of entropy. There is one caveat – the kernel needs to be able to fill that memory from a source with 4,096 bits of entropy. And that’s the hard part: finding that much randomness.

The entropy pool is used in two ways: random numbers are generated from it and it is replenished with entropy by the kernel. When random numbers are generated from the pool the entropy of the pool is diminished (because the person receiving the random number has some information about the pool itself). So as the pool’s entropy diminishes as random numbers are handed out, the pool must be replenished.

Replenishing the pool is called stirring: new sources of entropy are stirred into the mix of bits in the pool.

This is the key to how random number generation works on Linux. If randomness is needed, it’s derived from the entropy pool. When available, other sources of randomness are used to stir the entropy pool and make it less predictable. The details are a little mathematical, but it’s interesting to understand how the Linux random number generator works as the principles and techniques apply to random number generation in other software and systems.

The kernel keeps a rough estimate of the number of bits of entropy in the pool. You can check the value of this estimate through the following command:

cat /proc/sys/kernel/random/entropy_avail

A healthy Linux system with a lot of entropy available will have return close to the full 4,096 bits of entropy. If the value returned is less than 200, the system is running low on entropy.

The kernel is watching you

I mentioned that the system takes other sources of randomness and uses this to stir the entropy pool. This is achieved using something called a timestamp.

Most systems have precise internal clocks. Every time that a user interacts with a system, the value of the clock at that time is recorded as a timestamp. Even though the year, month, day and hour are generally guessable, the millisecond and microsecond are not and therefore the timestamp contains some entropy. Timestamps obtained from the user’s mouse and keyboard along with timing information from the network and disk each have different amount of entropy.

How does the entropy found in a timestamp get transferred to the entropy pool? Simple, use math to mix it in. Well, simple if you like math.

Just mix it in

A fundamental property of entropy is that it mixes well. If you take two unrelated random streams and combine them, the new stream cannot have less entropy. Taking a number of low entropy sources and combining them results in a high entropy source.

All that’s needed is the right combination function: a function that can be used to combine two sources of entropy. One of the simplest such functions is the logical exclusive or (XOR). This truth table shows how bits x and y coming from different random streams are combined by the XOR function.

Even if one source of bits does not have much entropy, there is no harm in XORing it into another source. Entropy always increases. In the Linux kernel, a combination of XORs is used to mix timestamps into the main entropy pool.

Generating random numbers

Cryptographic applications require very high entropy. If a 128 bit key is generated with only 64 bits of entropy then it can be guessed in 264 attempts instead of 2128 attempts. That is the difference between needing a thousand computers running for a few years to brute force the key versus needing all the computers ever created running for longer than the history of the universe to do so.

Cryptographic applications require close to one bit of entropy per bit. If the system’s pool has fewer than 4,096 bits of entropy, how does the system return a fully random number? One way to do this is to use a cryptographic hash function.

A cryptographic hash function takes an input of any size and outputs a fixed size number. Changing one bit of the input will change the output completely. Hash functions are good at mixing things together. This mixing property spreads the entropy from the input evenly through the output. If the input has more bits of entropy than the size of the output, the output will be highly random. This is how highly entropic random numbers are derived from the entropy pool.

The hash function used by the Linux kernel is the standard SHA-1 cryptographic hash. By hashing the entire pool and and some additional arithmetic, 160 random bits are created for use by the system. When this happens, the system lowers its estimate of the entropy in the pool accordingly.

Above I said that applying a hash like SHA-1 could be dangerous if there wasn’t enough entropy in the pool. That’s why it’s critical to keep an eye on the available system entropy: if it drops too low the output of the random number generator could have less entropy that it appears to have.

Running out of entropy

One of the dangers of a system is running out of entropy. When the system’s entropy estimate drops to around the 160 bit level, the length of a SHA-1 hash, things get tricky, and how they effect programs and performance depends on which of two Linux random number generators are used.

Linux exposes two interfaces for random data that behave differently when the entropy level is low. They are /dev/random and /dev/urandom. When the entropy pool becomes predictable, both interfaces for requesting random numbers become problematic.

When the entropy level is too low, /dev/random blocks and does not return until the level of entropy in the system is high enough. This guarantees high entropy random numbers. If /dev/random is used in a time-critical service and the system runs low on entropy, the delays could be detrimental to the quality of service.

On the other hand, /dev/urandom does not block. It continues to return the hashed value of its entropy pool even though there is little to no entropy in it. This low-entropy data is not suited for cryptographic use.

The solution to the problem is to simply add more entropy into the system.

Hardware random number generation to the rescue?

Intel’s Ivy Bridge family of processors have an interesting feature called “secure key.” These processors contain a special piece of hardware inside that generates random numbers. The single assembly instruction RDRAND returns allegedly high entropy random data derived on the chip.

It has been suggested that Intel’s hardware number generator may not be fully random. Since it is baked into the silicon, that assertion is hard to audit and verify. As it turns out, even if the numbers generated have some bias, it can still help as long as this is not the only source of randomness in the system. Even if the random number generator itself had a back door, the mixing property of randomness means that it cannot lower the amount of entropy in the pool.

On Linux, if a hardware random number generator is present, the Linux kernel will use the XOR function to mix the output of RDRAND into the hash of the entropy pool. This happens here in the Linux source code (the XOR operator is ^ in C).

Third party entropy generators

Hardware number generation is not available everywhere, and the sources of randomness polled by the Linux kernel itself are somewhat limited. For this situation, a number of third party random number generation tools exist. Examples of these are haveged, which relies on processor cache timing, audio-entropyd and video-entropyd which work by sampling the noise from an external audio or video input device. By mixing these additional sources of locally collected entropy into the Linux entropy pool, the entropy can only go up.

Advertisements

Source – UNIX, Destination – Windows Cygwin (SSH Password-less Authentication)

On Windows Server

In windows cygwin create user, say MyUser, locally and also create user in cygwin

cd C:\cygwin

Cygwin.bat

 

Administrator@MYWINDOWSHOST ~

$ /bin/mkpasswd -l -u MyUser >>/etc/passwd

MyUser@MYWINDOWSHOST ~

$ ls

MyUser@MYWINDOWSHOST ~

$ ls -al

total 24

drwxr-xr-x+ 1 MyUser        None    0 Mar 17 12:54 .

drwxrwxrwt+ 1 Administrator None    0 Mar 17 12:54 ..

-rwxr-xr-x  1 MyUser        None 1494 Oct 29 15:34 .bash_profile

-rwxr-xr-x  1 MyUser        None 6054 Oct 29 15:34 .bashrc

-rwxr-xr-x  1 MyUser        None 1919 Oct 29 15:34 .inputrc

-rwxr-xr-x  1 MyUser        None 1236 Oct 29 15:34 .profile

MyUser@MYWINDOWSHOST ~

$ ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/MyUser/.ssh/id_rsa):

Created directory ‘/home/MyUser/.ssh’.

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/MyUser/.ssh/id_rsa.

Your public key has been saved in /home/MyUser/.ssh/id_rsa.pub.

The key fingerprint is:

7d:40:12:1c:7b:c1:7f:39:ac:f5:1a:c5:73:ae:81:34 MyUser@MYWINDOWSHOST

The key’s randomart image is:

+–[ RSA 2048]—-+

|       .++o      |

|        .+..     |

|        . o. . o |

|         o .E *.+|

|        S …* =o|

|           .o o o|

|               = |

|              o  |

|                 |

+—————–+

MyUser@MYWINDOWSHOST ~

$ cd .ssh

MyUser@MYWINDOWSHOST ~/.ssh

$ ls

id_rsa  id_rsa.pub

MyUser@MYWINDOWSHOST ~/.ssh

$ touch authorized_keys

Generate the key in source ON UNIX SERVER

 $ ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/MyUser/.ssh/id_rsa):

Created directory ‘/home/MyUser/.ssh’.

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/MyUser/.ssh/id_rsa.

Your public key has been saved in /home/MyUser/.ssh/id_rsa.pub.

The key fingerprint is:

7d:40:12:1c:7b:c1:7f:39:ac:f5:1a:c5:73:ae:81:34 MyUser@MYWINDOWSHOST

The key’s randomart image is:

+–[ RSA 2048]—-+

|       .++o      |

|        .+..     |

|        . o. . o |

|         o .E *.+|

|        S …* o|

|           .o o o|

|               = |

|              o  |

|                 |

+—————–+

MyUser@MYUNIXSHOST ~

$ cd .ssh

MyUser@MYUNIXHOST ~/.ssh

$ ls

id_rsa  id_rsa.pub

Then push that rsa_pub into that destination authorized_keys

cat .ssh/id_rsa.pub | ssh MyUser@MYWINDOWSHOST ‘cat >>  .ssh/authorized_keys’

 

ssh  -v MyUser@MYWINDOWSHOST

—> YOU SHOULD BE ABLE LOGIN PASSWORD LESS

Credits :- Priyanka Padad (Operations Expert)

Hack #1 -> Define CD Base Directory Using CDPATH

If you are frequently performing cd to subdirectories of a specific parent
directory, you can set the CDPATH to the parent directory and perform
cd to the subdirectories without giving the parent directory path as
explained below.

# pwd
/home/ramesh
# cd mail
-bash: cd: mail: No such file or directory

[Note: The above cd is looking for mail directory under
current directory]

# export CDPATH=/etc
# cd mail
/etc/mail

[Note: The above cd is looking for mail under /etc and not
under current directory]

# pwd
/etc/mail

To make this change permanent, add

export CDPATH=/etc  to your ~/.bash_profile

Similar to the PATH variable, you can add more than one directory entry
in the CDPATH variable, separating them with : , as shown below.

export CDPATH=.:~:/etc:/var

This hack can be very helpful under the following situations:
• Oracle DBAs frequently working under $ORACLE_HOME, can set
the CDPATH variable to the oracle home
• Unix sysadmins frequently working under /etc, can set the
CDPATH variable to /etc
• Developers frequently working under project directory
/home/projects, can set the CDPATH variable to /home/projects
• End-users frequently accessing the subdirectories under their
home directory, can set the CDPATH variable to ~ (home
directory)

MySQL – Enterprise – Installation – Linux

Phase #1 –  PreRequisites

MAKE SURE A MOUNT POINT /MySql IS CREATED BEFORE RUNNING THIS SCRIPT…………………………

Creating the symbolic soft link for parallel database updations

ln -s /data /MySql/mysqldb
ln -s /data /MySql/mysql_db

Soft Links Created.
User and Group Adding.

groupadd -g27 mysql
echo ‘System Group mysql created with GID 27.’
useradd -m -d /var/lib/mysql -g mysql -G mysql -p root123 -u 27 mysql
echo ‘System User mysql created with UID 27 home dir=/var/lib/mysql.’
echo ‘root’ >>cron.allow
echo ‘mysql’ >>cron.allow
service crond restart
echo ‘added the user mysql to the cron’

DIRECTORY STRUCTURE CREATION

mkdir -p /MySql/mysqldb/configfiles
mkdir -p /MySql/mysqldb/datadump
mkdir -p /MySql/mysqldb/software_depot
mkdir -p /MySql/mysqldb/dbbackup
mkdir -p /MySql/mysqldb/archival
echo ‘DIRECTORY STRUCTURE COMPLETE’

CONTAINER CREATION

mkdir -p /MySql/mysql_db/mysql/2345/var/lib/mysql
mkdir -p /MySql/mysql_db/mysql/2345/tmp
mkdir -p /MySql/mysql_db/mysql/2345/var/log/binlogs
echo ‘CONTAINER STRUCTURE COMPLETE.’

SOFTWARE DEPOT PRE-REQUISITES

mkdir -p /MySql/mysqldb/software_depot/meb
cp /tmp/meb/bin /MySql/mysqldb/software_depot/meb/bin
mkdir -p /opt/product/meb
ln -s /MySql/mysqldb/software_depot/meb/bin /opt/product/meb
sh mysqlbackup –help
echo ‘SUCCESSFULL LINKED MEB’
chown -R mysql:mysql /opt/ /MySql/mysqldb/ /MySql/mysql_db/
echo ‘PRE-REQUISITES COMPLETED SUCCESSFULLY NOW KINDLY INSTALL MYSQL-SERVER RPM AND MYSQL-CLIENT RPM’

Phase #2 – Installation

Install

Capture7

 

Phase #3 – Configuration – my.cnf

RUN ONLY AS MYSQL USER.

cd /MySql/mysqldb/configfiles

 

echo [mysqld]

#This Option tells the server to load the plugin and prevent it from being removed while the server is running.
audit-log=FORCE_PLUS_PERMANENT

#Audit Log File Location in the Container.
audit_log_file=/MySql/mysql_db/mysql/2345/var/log/audit_2345.log

#Audit Log Policy Parameter
audit_log_policy=LOGINS

#Rotate/Refresh the Log File after it reaches the size 1GB
audit_log_rotate_on_size=1073741824

#The number of TCP/IP connections that are queued at once. If you have many remote users connecting to your database simultaneously, you may need to increase this value. The trade-off for a high value is slightly increased memory and CPU usage.
back_log=128

#The size of the cache to hold the SQL statements for the binary log during a transaction. A binary log cache is allocated for each client if the server supports any transactional storage engines and if the server has the binary log enabled (–log-bin option). If you often use large, multiple-statement transactions, you can increase this cache size to get better performance. The Binlog_cache_use and Binlog_cache_disk_use status variables can be useful for tuning the size of this variable.
binlog_cache_size=1M

#Use charset_name as the default server character set.
character-set-server=utf8

#Use collation_name as the default server collation.
collation-server=utf8_general_ci

#The number of seconds that the mysqld server waits for a connect packet before responding with Bad handshake.
connect_timeout=10

#***********MYSQL DATA DIRECTORY ****************
datadir=/MySql/mysql_db/mysql/2345/var/lib/mysql

#************DEAFULT STORAGE ENGINE ***************
default-storage-engine=innodb
ft_min_word_len=2
general_log=0

#General Log File Path.
general_log_file=/MySql/mysql_db/mysql/2345/var/log/general_2345.log

group_concat_max_len=500000
innodb_additional_mem_pool_size=16M
innodb_buffer_pool_instances=5
innodb_buffer_pool_size=8G
innodb_file_per_table=1
innodb_flush_method=O_DIRECT
innodb_log_buffer_size=32M
innodb_log_file_size=500M
innodb_thread_concurrency=64
interactive_timeout=900

#Binary Logs Index File Path.
log-bin-index=/MySql/mysql_db/mysql/2345/var/log/binlogs/logbin_2345.index
log_bin_trust_function_creators=1

#Binary Log File Path.
log-bin=/MySql/mysql_db/mysql/2345/var/log/binlogs/bin_2345.log

#Error Log File Path.
log-error=/MySql/mysql_db/mysql/2345/var/log/mysqld_2345.log
log-queries-not-using-index
log-slow-slave-statements
log_warnings
long_query_time=0.05
max_allowed_packet=1G
max_binlog_size=1073741824
max_connect_errors=4294967295

#The number of simultaneous connections allowed by the database server. If some users are being denied access during busy times, you may need to increase this value. The trade-off is a more heavily loaded server. In other words, CPU usage, memory usage, and disk I/O will increase.
max-connections=4096
max_heap_table_size=64M
net_read_timeout=120
net_write_timeout=3600
old_password=0
open_files_limit=4096

#Process ID File Path.
pid-file=/MySql/mysql_db/mysql/2345/var/lib/mysql/mysql_2345.pid

#Port Number Used By MySql.
port=2345

query-cache-limit=1M
query_cache_size=64M
read_buffer_size=1M
read_rnd_buffer_size=8M

#Relay Log Index File Path
relay-log-index=/MySql/mysql_db/mysql/2345/var/log/binlogs/relaylog_2345.index

#Relay Log Information File Path.
relay-log-info-file=/MySql/mysql_db/mysql/2345/var/log/binlogs/relaylog_2345.info

#Relay Log File Path
relay-log=/MySql/mysql_db/mysql/2345/var/log/binlogs/relay_2345.log
server-id=222345
skip-character-set-client-handshake
skip-name-resolve
skip-slave-start
slave_net_timeout=60
slow_query_log=1

#Slow Query Log File Path.
slow_query_log_file=/MySql/mysql_db/mysql/2345/var/log/slowqueries_2345.log

#MySQL Socket Path
socket=/MySql/mysql_db/mysql/2345/var/lib/mysql_2345.sock
table-definition-cache=2048
table_open_cache=4096
thread_cache_size=16

#MySql Temp Directory.
tmpdir=/MySql/mysql_db/mysql/2345/tmp
tmp_table_size=64M
>>my-23456.cnf

Phase #4 – Start/Stop Service and Login

Start-Stop.sh

#!/bin/bash

set -x

echo “Do You want to Start the MySql Daemon ??? [Select ‘start’ or ‘stop’ followed by an ENTER]:- ”
read bool

if [ $bool -eq “start”];
then
/usr/bin/mysqld_safe –defaults-file=/MySql/mysqldb/configfiles/my-2345.cnf &
echo ‘CHECKING FOR ERRORS’
cat=”$(which cat)”
path=”/MySql/mysql_db/mysql/2345/var/log/mysqld_2345.log”
err=”$cat $path|$(which grep) ERROR|$(which wc) -l”
if [$err -eq 0];
then
echo ‘NO ERRORS YIPPIE’
rm -rf /MySql/mysql_db/mysql/2345/var/log/mysqld_2345.log
elif [$err -gt 0];
then
echo ‘CHECK FOR THESE ERRORS’
$cat /MySql/mysql_db/mysql/2345/var/log/mysqld_2345.log|grep ERROR >>/MySql/mysql_db/mysql/2345/var/log/mysqld_err_2345.log
$cat /MySql/mysql_db/mysql/2345/var/log/mysqld_err_2345.log
rm -rf /MySql/mysql_db/mysql/2345/var/log/mysqld_2345.log
echo ‘RE-RUN the SCRIPT NOW IF YOU HAVE ERRORS.’

else;
echo ‘EXCEPTION ERROR !!!!!!!!!!!!!!!!!! ‘
fi
echo $?

elif [ $bool -eq “stop”];
then
count=”ps -eaf |grep mysqld|grep 2345|wc -l”
if [ $count -gt 0];
then
echo “Please Enter the MySql User. [Give the entry followed by ENTER]:- ”
read user
/usr/bin/mysqladmin –socket=/MySql/mysql_db/mysql/2345/var/lib/mysql/mysql_2345.sock –port=2345 -u$user -p shutdown
else;
echo “MYSQL PROCESS NOT RUNNING”
fi

else;
echo “INVALID INPUT PLEASE TRY AGAIN”
fi

Login.sh

#!/bin/bash

##  PASSWORD CHANGE SECTION ##
echo “Do you Want to Change the password for the user ??? [Type Y or N followed by an ENTER]:- ”
read bool

if [ $bool -eq “Y”];
then
echo “Enter the User to Change the password [Type the username followed by an ENTER]:- ”
read user
echo “Enter the password for $user [Type the Password followed by an ENTER]:- ”
read password
/usr/bin/mysqladmin –socket=/MySql/mysql_db/mysql/2345/var/lib/mysql_2345.sock –port=2345 -u $user password $password
elif [ $bool -eq “N”];
then
echo “PASSWORD WILL NOT BE CHANGED”

else;
echo “Please Provide a Valid Input”
fi

## LOGIN SECTION ##
echo “Do You Want to Login to MySQL ????”
read bool1
echo “Please Enter the User:- [Type the username followed by an ENTER]:- ”
read user
echo “Please enter the password for $user [Type the Password followed by an ENTER]:- ”
read password
if [ $bool1 -eq “Y”];
then
/usr/bin/mysql -A -v –socket=/MySql/mysql_db/mysql/2345/var/lib/mysql_2345.sock –port=2345 -u$user -p$password
elif [ $bool1 -eq “N”];
then
echo “OHK FINE WILL NOT LOGIN”

else;
echo “Please Provide a Valid Input”
fi

############################################

Process Affinity – Linux

  • 1. Introduction
  • 2. Types of Thread Scheduling
    • 2.1. Compact Scheduling
    • 2.2. Round-Robin Scheduling
    • 2.3. Stupid Scheduling
  • 3. Defining Affinity
    • 3.1. The Linux-Portable Way (taskset)
    • 3.2. The Other Linux-Portable Way (numactl)
    • 3.3. Using OpenMP Runtime Extensions
    • 3.4. getfreesocket

1. Introduction

Although a compute node or workstation may appear to have 16 cores and 64 GB of DRAM, these resources are not uniformly accessible to your applications. The best application performance is usually obtained by keeping your code’s parallel workers (e.g., threads or MPI processes) as close to the memory on which they are operating as possible. While you might like to think that the Linux thread scheduler would do this automatically for you, the reality is that most HPC applications benefit greatly from a little bit of help in manually placing threads on different processor cores.

To get an idea of what your multithreaded application is doing while it is running, you can use the pscommand.

Assuming your executable is called application.x, you can easily see what cores each thread is using by issuing the following command in bash:

$ for i in $(pgrep application.x); do ps -mo pid,tid,fname,user,psr -p $i;done

The PSR field is the OS identifier for the core each TID (thread id) is utilizing.

2. Types of Thread Scheduling

Certain types of unevenly loaded applications can experience serious performance degradation caused by the Linux scheduler treating high-performance application codes in the same way it would treat a system daemon that might spend most of its time idle.

These sorts of scheduling issues are best described with diagrams. Let’s assume we have compute nodes with two processor sockets, and each processor has four cores:

topology of a dual-socket, quad-core node

When you run a multithreaded application with four threads (or even four serial applications), Linux will schedule those threads for execution by assigning each one to a CPU core. Without being explicitly told how to do this scheduling, Linux may decide to

  1. run thread0 to thread3 on core0 to core3 on socket0
  2. run thread0 and thread1 on core0 and core1 on socket0, and run thread2 and thread3 on socket1
  3. run thread0 and thread1 on core0 only, run thread2 on core1, run thread3 on core2, and leave core3 completely unutilized
  4. any number of other nonsensical allocations involving assigning multiple threads to a single core while other cores sit idle

It should be obvious that option #3 and #4 are very bad for performance, but the fact is that Linux will happily schedule your multithreaded job (or multiple single-thread jobs) this way if your threads behave in a way that is confusing to the operating system.

compact scheduling

2.1. Compact Scheduling

Option #1 is often referred to as “compact” scheduling and is depicted in the diagram to the right. It keeps all of your threads running on a single physical processor if possible, and this is what you would want if all of the threads in your application need to repeatedly access different parts of a large array. This is because all of the cores on the same physical processor can access the memory banks associated with (or “owned by”) that processor at the same speed. However, cores cannot access memory stored on memory banks owned by a different processor as quickly; this is phenomenon is called NUMA (non-uniform memory access). If your threads all need to access data stored in the memory owned by one processor, it is often best to put all of your threads on the processor who owns that memory.

2.2. Round-Robin Scheduling

scatter or round-robin scheduling

Option #2 is called “scatter” or “round-robin” scheduling and is ideal if your threads are largely independent of each other and don’t need to access a lot of memory that other threads need. The benefit to round-robin thread scheduling is that not all threads have to share the same memory channel and cache, effectively doubling the memory bandwidth and cache sizes available to your application. The tradeoff is that memory latency becomes higher as threads have to start accessing memory that might be owned by another processor.

2.3. Stupid Scheduling

stupid scheduling

Option #3 and #4 are what I call “stupid” scheduling (see diagram to the right) and can often be the default behavior of the Linux thread scheduler if you don’t tell Linux where your threads should run. This happens because in traditional Linux server environments, most of the proceses that are running at any given time aren’t doing anything. To conserve power, Linux will put a lot of these quiet processes on the same processor or cores, then move them to their own dedicated core when they wake up and have to start processing.

If your application is running at full bore 100% of the time, Linux will probably keep it on its own dedicated CPU core. However, if your application has an uneven load (e.g., threads are mostly idle while the last thread finishes), Linux will see that the application is mostly quiet and pack all the quiet threads (e.g., t0 and t1 in the diagram to the right) on to the same CPU core. This wouldn’t be so bad, but the cost of moving a thread from one core to another requires context switches which get very expensive when done hundreds or thousands of times a minute.

3. Defining affinity

3.1. The Linux-Portable Way (taskset)

If you want to launch a job (e.g., simulation.x) on a certain set of cores (e.g., core0, core2, core4, and core6), issue

$ taskset -c 0,2,4,6 simulation.x

If your process is already running, you can define thread affinity while in flight. It also lets you bind specific TIDs to specific processors at a level of granularity greater than specifying -c 0,2,4,6because Linux may still schedule two threads on core2 and nothing on core0. For example,

$ for i in $(pgrep application.x);do ps -mo pid,tid,fname,user,psr -p $i;done
  PID   TID COMMAND  USER     PSR
21654     - applicat glock      -
    - 21654 -        glock      0
    - 21655 -        glock      2
    - 21656 -        glock      2
    - 21657 -        glock      6
    - 21658 -        glock      4
 
$ taskset -p -c 0 21654
$ taskset -p -c 0 21655
$ taskset -p -c 2 21656
$ taskset -p -c 4 21657
$ taskset -p -c 6 21658

This sort of scheduling will happen under certain conditions, so specifying a set of cpus to a set of threads without specifically assigning each thread to a physical core may not always behave optimally.

3.2. The Other Linux-Portable Way (numactl)

The emerging standard for easily binding processes to processors on Linux-based supercomputers isnumactl. It can operate on a coarser-grained basis (i.e., CPU sockets rather than individual CPU cores) than taskset (only CPU cores) because it is aware of the processor topology and how the CPU cores map to CPU sockets. Using numactl is typically easier–after all, the common goal is to confine a process to a numa pool (or “cpu node”) rather than specific CPU cores. To that end, numactl also lets you bind a processor’s memory locality to prevent processes from having to jump across NUMA pools (called “memory nodes” in numactl parlance).

Whereas if you wanted to bind a specific process to one processor socket with taskset you would have to

$ taskset -c 0,2,4,6 simulation.x

the same operation is greatly simplified with numactl:

$ numactl --cpunodebind=0 simulation.x

If you want to also restrict simulation.x’s memory use to the numa pool associated with cpu node 0, you can do

$ numactl --cpunodebind=0 --membind=0 simulation.x

or just

$ numactl -C 0 -N 0 simulation.x

You can see what cpu nodes and their corresponding memory nodes are available on your system by using numactl -H:

$ numactl -H
available: 2 nodes (0-1)
node 0 size: 32728 MB
node 0 free: 12519 MB
node 1 size: 32768 MB
node 1 free: 16180 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

numactl also lets you supply specific cores (like taskset) with the –physcpubind or -C. Unlike taskset, though, numactl does not appear to let you change the CPU affinity of a process that is already running.

An alternative syntax to numactl -C is something like

$ numactl -C +0,1,2,3 simulation.x

By prefixing your list of cores with a +, you can have numactl bind to relative cores. When combined with cpusets (which are enabled by default for all jobs on Gordon), the above command will use the 0th, 1st, 2nd, and 3rd core of the job’s given cpuset instead of literally core 0,1,2,3.

3.3. Using OpenMP Runtime Extensions

OpenMP 4.0 now includes standardized controls for binding threads to cores. I haven’t caught up with these changes but I will document them here once I do.

Multithreaded programs compiled with Intel Compilers can utilize Intel’s Thread Affinity Interface for OpenMP applications. Set and export the KMP_AFFINITY env variable to express binding preferences.KMP_AFFINITY has three principal binding strategies:

  • compact fills up one socket before allocating to other sockets
  • scatter evenly spreads threads across all sockets and cores
  • explicit allows you define exactly which cores/sockets to use

Using KMP_AFFINITY=compact will preferentially bind all your threads, one per core, to a single socket before it tries binding them to other sockets. Unfortunately, it will start at socket0 regardless of if other processes (such as another SMP job) is already bound to that socket. You can explicitly specify an offset to force the job to bind to a specific socket, but you need to know exactly what is running on what cores and sockets on your node in order to specify this in your submit script.

You can also explicitly define which cores your job should use. Combined with a little knowledge of your system’s CPU topology ([Intel’s Processor Topology Enumeration tool][intel’s processor enumeration tool] is great for this). If you wanted to run on cores 0, 2, 4, and 6, you would do

export KMP_AFFINITY='proclist=[0,2,4,6],explicit'

GNU’s implementation of OpenMP has a environment variable similar to KMP_AFFINITY calledGOMP_CPU_AFFINITY. Incidentally, Intel’s OpenMP supports GOMP_CPU_AFFINITY, so using this variable may be a relatively portable way to specify thread affinity at runtime. The equivalent GOMP_CPU_AFFINITY for the KMP_AFFINITY I gave above would be:

export GOMP_CPU_AFFINITY='0,2,4,6'

3.4. getfreesocket

I wrote a small perl script called getfreesocket that uses KMP_AFFINITY=explicit (or GOMP_CPU_AFFINITY) and some probing of the Linux OS at runtime to bind SMP jobs to free processor sockets. It should be invoked in a job script something like this:

#!/bin/bash

NPROCS=1
BINARY=${HOME}/bin/whatever

nprocs=$(grep '^physical id' /proc/cpuinfo  | sort -u | wc -l)
ncores=$(grep '^processor' /proc/cpuinfo | sort -u | wc -l)
coresperproc=$((ncores/nprocs))
OMP_NUM_THREADS=$((NPROCS*coresperproc))

freesock=$(getfreesocket -explicit=${NPROCS})
if [ "z$freesock" == "z" ]
then
  echo "Not enough free processors!  aborting"
  exit 1
else
  KMP_AFFINITY="granularity=fine,proclist=[$freesock],explicit"
  GOMP_CPU_AFFINITY="$(echo $freesock | sed -e 's/,/ /g')"
fi

export KMP_AFFINITY OMP_NUM_THREADS GOMP_CPU_AFFINITY

${BINARY}

This was a very simple solution to get single-socket jobs to play nicely on the shared batch system we were using at the Interfacial Molecular Science Laboratory. While numactl is an easier way to accomplish some of this, it still requires that you know what other processes are sharing your node and on what CPU cores they are running. I’ve experienced problems with Linux’s braindead thread scheduling so this getfreesocket finds completely unused sockets that can be fed into taskset,KMP_AFFINITY, or numactl.

This is not as great an issue if your resource manager supports launching jobs within cpusets. Your resource manager will provide a cpuset, and using relative specifiers for numactl cores (e.g., numactl -C +0-3) will bind to the free socket provided by the batch environment. Of course, this will not specifically bind one thread to one core, so using KMP_AFFINITY or GOMP_CPU_AFFINITY may remain necessary.

Concepts Of Linux Programming – Files and Filesystem

The file is the most basic and fundamental abstraction in Linux. Linux follows the everything-is-a-file philosophy (although not as strictly as some other systems, such as Plan 9).Consequently, much interaction occurs via reading of and writing to files, even when the object in question is not what you would consider a normal file.

In order to be accessed, a file must first be opened. Files can be opened for reading, writing, or both. An open file is referenced via a unique descriptor, a mapping from the metadata associated with the open file back to the specific file itself. Inside the Linux kernel, this descriptor is handled by an integer (of the C type int) called the file descriptor, abbreviated fd. File descriptors are shared with user space, and are used directly by user programs to access files. A large part of Linux system programming consists of opening, manipulating, closing, and otherwise using file descriptors.

Regular files

What most of us call “files” are what Linux labels regular files. A regular file contains bytes of data, organized into a linear array called a byte stream. In Linux, no further organization or formatting is specified for a file. The bytes may have any values, and they may be organized within the file in any way. At the system level, Linux does not enforce a structure upon files beyond the byte stream. Some operating systems, such as VMS, provide highly structured files, supporting concepts such as records. Linux does not.

Any of the bytes within a file may be read from or written to. These operations start at a specific byte, which is one’s conceptual “location” within the file. This location is called the file position or file offset. The file position is an essential piece of the metadata that the kernel associates with each open file. When a file is first opened, the file position is zero. Usually, as bytes in the file are read from or written to, byte-by-byte, the file position increases in kind. The file position may also be set manually to a given value, even a value beyond the end of the file. Writing a byte to a file position beyond the end of the file will cause the intervening bytes to be padded with zeros. While it is possible to write bytes in this manner to a position beyond the end of the file, it is not possible to write bytes to a position before the beginning of a file. Such a practice sounds nonsensical, and, indeed, would have little use. The file position starts at zero; it cannot be negative. Writing a byte to the middle of a file overwrites the byte previously located at that offset. Thus, it is not possible to expand a file by writing into the middle of it. Most file writing occurs at the end of the file. The file position’s maximum value is bounded only by the size of the C type used to store it, which is 64 bits on a modern Linux system.

The size of a file is measured in bytes and is called its length. The length, in other words, is simply the number of bytes in the linear array that make up the file. A file’s length can be changed via an operation called truncation. A file can be truncated to a new size smaller than its original size, which results in bytes being removed from the end of the file. Confusingly, given the operation’s name, a file can also be “truncated” to a new size larger than its original size. In that case, the new bytes (which are added to the end of the file) are filled with zeros. A file may be empty (that is, have a length of zero), and thus contain no valid bytes. The maximum file length, as with the maximum file position, is bounded only by limits on the sizes of the C types that the Linux kernel uses to manage files. Specific filesystems, however, may impose their own restrictions, imposing a smaller ceiling on the maximum length.

A single file can be opened more than once, by a different or even the same process. Each open instance of a file is given a unique file descriptor. Conversely, processes can share their file descriptors, allowing a single descriptor to be used by more than one process. The kernel does not impose any restrictions on concurrent file access. Multiple processes are free to read from and write to the same file at the same time. The results of such concurrent accesses rely on the ordering of the individual operations, and are generally unpredictable. User-space programs typically must coordinate amongst themselves to ensure that concurrent file accesses are properly synchronized.

Although files are usually accessed via filenames, they actually are not directly associated with such names. Instead, a file is referenced by an inode (originally short for information node), which is assigned an integer value unique to the filesystem (but not necessarily unique across the whole system). This value is called the inode number, often abbreviated as i-number or ino. An inode stores metadata associated with a file, such as its modification timestamp, owner, type, length, and the location of the file’s data—but no filename! The inode is both a physical object, located on disk in Unix-style filesystems, and a conceptual entity, represented by a data structure in the Linux kernel.

Directories and links

Accessing a file via its inode number is cumbersome (and also a potential security hole), so files are always opened from user space by a name, not an inode number. Directories are used to provide the names with which to access files. A directory acts as a mapping of human-readable names to inode numbers. A name and inode pair is called a link. The physical on-disk form of this mapping—for example, a simple table or a hash—is implemented and managed by the kernel code that supports a given filesystem. Conceptually, a directory is viewed like any normal file, with the difference that it contains only a mapping of names to inodes. The kernel directly uses this mapping to perform name-to-inode resolutions.

When a user-space application requests that a given filename be opened, the kernel opens the directory containing the filename and searches for the given name. From the filename, the kernel obtains the inode number. From the inode number, the inode is found. The inode contains metadata associated with the file, including the on-disk location of the file’s data.

Initially, there is only one directory on the disk, the root directory. This directory is usually denoted by the path /. But, as we all know, there are typically many directories on a system. How does the kernel know whichdirectory to look in to find a given filename?

As mentioned previously, directories are much like regular files. Indeed, they even have associated inodes. Consequently, the links inside of directories can point to the inodes of other directories. This means directories can nest inside of other directories, forming a hierarchy of directories. This, in turn, allows for the use of the pathnames with which all Unix users are familiar—for example,/home/blackbeard/concorde.png.

When the kernel is asked to open a pathname like this, it walks each directory entry (called a dentry inside of the kernel) in the pathname to find the inode of the next entry. In the preceding example, the kernel starts at /, gets the inode for home, goes there, gets the inode for blackbeard, runs there, and finally gets the inode for concorde.png. This operation is called directory or pathname resolution. The Linux kernel also employs a cache, called the dentry cache, to store the results of directory resolutions, providing for speedier lookups in the future given temporal locality.

A pathname that starts at the root directory is said to be fully qualified, and is called an absolute pathname. Some pathnames are not fully qualified; instead, they are provided relative to some other directory (for example, todo/plunder). These paths are called relative pathnames. When provided with a relative pathname, the kernel begins the pathname resolution in the current working directory. From the current working directory, the kernel looks up the directory todo. From there, the kernel gets the inode for plunder. Together, the combination of a relative pathname and the current working directory is fully qualified.

Although directories are treated like normal files, the kernel does not allow them to be opened and manipulated like regular files. Instead, they must be manipulated using a special set of system calls. These system calls allow for the adding and removing of links, which are the only two sensible operations anyhow. If user space were allowed to manipulate directories without the kernel’s mediation, it would be too easy for a single simple error to corrupt the filesystem.

Hard links

Conceptually, nothing covered thus far would prevent multiple names resolving to the same inode. Indeed, this is allowed. When multiple links map different names to the same inode, we call them hard links.

Hard links allow for complex filesystem structures with multiple pathnames pointing to the same data. The hard links can be in the same directory, or in two or more different directories. In either case, the kernel simply resolves the pathname to the correct inode. For example, a specific inode that points to a specific chunk of data can be hard linked from /home/bluebeard/treasure.txtand /home/blackbeard/to_steal.txt.

Deleting a file involves unlinking it from the directory structure, which is done simply by removing its name and inode pair from a directory. Because Linux supports hard links, however, the filesystem cannot destroy the inode and its associated data on every unlink operation. What if another hard link existed elsewhere in the filesystem? To ensure that a file is not destroyed until all links to it are removed, each inode contains a link count that keeps track of the number of links within the filesystem that point to it. When a pathname is unlinked, the link count is decremented by one; only when it reaches zero are the inode and its associated data actually removed from the filesystem.

Symbolic links

Hard links cannot span filesystems because an inode number is meaningless outside of the inode’s own filesystem. To allow links that can span filesystems, and that are a bit simpler and less transparent, Unix systems also implementsymbolic links (often shortened to symlinks).

Symbolic links look like regular files. A symlink has its own inode and data chunk, which contains the complete pathname of the linked-to file. This means symbolic links can point anywhere, including to files and directories that reside on different filesystems, and even to files and directories that do not exist. A symbolic link that points to a nonexistent file is called a broken link.

Symbolic links incur more overhead than hard links because resolving a symbolic link effectively involves resolving two files: the symbolic link and then the linked-to file. Hard links do not incur this additional overhead—there is no difference between accessing a file linked into the filesystem more than once and one linked only once. The overhead of symbolic links is minimal, but it is still considered a negative.

Symbolic links are also more opaque than hard links. Using hard links is entirely transparent; in fact, it takes effort to find out that a file is linked more than once! Manipulating symbolic links, on the other hand, requires special system calls. This lack of transparency is often considered a positive, as the link structure is explicitly made plain, with symbolic links acting more as shortcutsthan as filesystem-internal links.

Special files

Special files are kernel objects that are represented as files. Over the years, Unix systems have supported a handful of different special files. Linux supports four: block device files, character device files, named pipes, and Unix domain sockets. Special files are a way to let certain abstractions fit into the filesystem, continuing the everything-is-a-file paradigm. Linux provides a system call to create a special file.

Device access in Unix systems is performed via device files, which act and look like normal files residing on the filesystem. Device files may be opened, read from, and written to, allowing user space to access and manipulate devices (both physical and virtual) on the system. Unix devices are generally broken into two groups: character devices and block devices. Each type of device has its own special device file.

A character device is accessed as a linear queue of bytes. The device driver places bytes onto the queue, one by one, and user space reads the bytes in the order that they were placed on the queue. A keyboard is an example of a character device. If the user types “peg,” for example, an application would want to read from the keyboard device the p, the e, and, finally, the g, in exactly that order. When there are no more characters left to read, the device returns end-of-file (EOF). Missing a character, or reading them in any other order, would make little sense. Character devices are accessed via character device files.

A block device, in contrast, is accessed as an array of bytes. The device driver maps the bytes over a seekable device, and user space is free to access any valid bytes in the array, in any order—it might read byte 12, then byte 7, and then byte 12 again. Block devices are generally storage devices. Hard disks, floppy drives, CD-ROM drives, and flash memory are all examples of block devices. They are accessed via block device files.

Named pipes (often called FIFOs, short for “first in, first out”) are aninterprocess communication (IPC) mechanism that provides a communication channel over a file descriptor, accessed via a special file. Regular pipes are the method used to “pipe” the output of one program into the input of another; they are created in memory via a system call and do not exist on any filesystem. Named pipes act like regular pipes but are accessed via a file, called a FIFO special file. Unrelated processes can access this file and communicate.

Sockets are the final type of special file. Sockets are an advanced form of IPC that allow for communication between two different processes, not only on the same machine, but even on two different machines. In fact, sockets form the basis of network and Internet programming. They come in multiple varieties, including the Unix domain socket, which is the form of socket used for communication within the local machine. Whereas sockets communicating over the Internet might use a hostname and port pair for identifying the target of communication, Unix domain sockets use a special file residing on a filesystem, often simply called a socket file.

Filesystems and namespaces

Linux, like all Unix systems, provides a global and unified namespace of files and directories. Some operating systems separate different disks and drives into separate namespaces—for example, a file on a floppy disk might be accessible via the pathname A:\plank.jpg, while the hard drive is located atC:\. In Unix, that same file on a floppy might be accessible via the pathname/media/floppy/plank.jpg or even via /home/captain/stuff/plank.jpg, right alongside files from other media. That is, on Unix, the namespace is unified.

A filesystem is a collection of files and directories in a formal and valid hierarchy. Filesystems may be individually added to and removed from the global namespace of files and directories. These operations are calledmounting and unmounting. Each filesystem is mounted to a specific location in the namespace, known as a mount point. The root directory of the filesystem is then accessible at this mount point. For example, a CD might be mounted at/media/cdrom, making the root of the filesystem on the CD accessible at/media/cdrom. The first filesystem mounted is located in the root of the namespace, /, and is called the root filesystem. Linux systems always have a root filesystem. Mounting other filesystems at other mount points is optional.

Filesystems usually exist physically (i.e., are stored on disk), although Linux also supports virtual filesystems that exist only in memory, and network filesystems that exist on machines across the network. Physical filesystems reside on block storage devices, such as CDs, floppy disks, compact flash cards, or hard drives. Some such devices are partionable, which means that they can be divided up into multiple filesystems, all of which can be manipulated individually. Linux supports a wide range of filesystems—certainly anything that the average user might hope to come across—including media-specific filesystems (for example, ISO9660), network filesystems (NFS), native filesystems (ext4), filesystems from other Unix systems (XFS), and even filesystems from non-Unix systems (FAT).

The smallest addressable unit on a block device is the sector. The sector is a physical attribute of the device. Sectors come in various powers of two, with 512 bytes being quite common. A block device cannot transfer or access a unit of data smaller than a sector and all I/O must occur in terms of one or more sectors.

Likewise, the smallest logically addressable unit on a filesystem is the block. The block is an abstraction of the filesystem, not of the physical media on which the filesystem resides. A block is usually a power-of-two multiple of the sector size. In Linux, blocks are generally larger than the sector, but they must be smaller than the page size (the smallest unit addressable by the memory management unit, a hardware component). Common block sizes are 512 bytes, 1 kilobyte, and 4 kilobytes.

Historically, Unix systems have only a single shared namespace, viewable by all users and all processes on the system. Linux takes an innovative approach and supports per-process namespaces, allowing each process to optionally have a unique view of the system’s file and directory hierarchy. By default, each process inherits the namespace of its parent, but a process may elect to create its own namespace with its own set of mount points and a unique root directory.

Simple utility to allocate memory on a Linux Machine

1. What can I use this for?

  • Test swap
  • Test behaviors on a machine when there is little memory available

————————————————————————-

2. Usage

————————————————————

Installation

cd /tmp
vim memtest.c
<enter the contents in the file and save it>
vim Makefile
<enter the contents in the file and save it>
sudo make install

————————————————————————–

Makefile

all: memtest.c
$(CC) memtest.c -o memtest

install: memtest
install -m 0755 memtest $(PREFIX)/bin/

clean:
rm -rf *o memtest

memtest.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <stdbool.h>
#include <unistd.h>

#if defined(_SC_PHYS_PAGES) && defined(_SC_AVPHYS_PAGES) && defined(_SC_PAGE_SIZE)
#define MEMORY_PERCENTAGE
#endif

#ifdef MEMORY_PERCENTAGE
size_t getTotalSystemMemory(){
long pages = sysconf(_SC_PHYS_PAGES);
long page_size = sysconf(_SC_PAGE_SIZE);
return pages * page_size;
}

size_t getFreeSystemMemory(){
long pages = sysconf(_SC_AVPHYS_PAGES);
long page_size = sysconf(_SC_PAGE_SIZE);
return pages * page_size;
}
#endif

bool eat(long total,int chunk){
long i;
for(i=0;i<total;i+=chunk){
short *buffer=malloc(sizeof(char)*chunk);
if(buffer==NULL){
return false;
}
memset(buffer,0,chunk);
}
return true;
}

int main(int argc, char *argv[]){

#ifdef MEMORY_PERCENTAGE
printf(“Currently total memory: %zd\n”,getTotalSystemMemory());
printf(“Currently avail memory: %zd\n”,getFreeSystemMemory());
#endif

int i;
for(i=0;i<argc;i++){
char *arg=argv[i];
if(strcmp(arg, “-h”)==0 || strcmp(arg,”-?”)==0  || argc==1){
printf(“Usage: eatmemory <size>\n”);
printf(“Size can be specified in megabytes or gigabytes in the following way:\n”);
printf(“#          # Bytes      example: 1024\n”);
printf(“#M         # Megabytes  example: 15M\n”);
printf(“#G         # Gigabytes  example: 2G\n”);
#ifdef MEMORY_PERCENTAGE
printf(“#%%         # Percent    example: 50%%\n”);
#endif
printf(“\n”);
}else if(i>0){
int len=strlen(arg);
char unit=arg[len – 1];
long size=-1;
int chunk=1024;
if(!isdigit(unit) ){
if(unit==’M’ || unit==’G’){
arg[len-1]=0;
size=atol(arg) * (unit==’M’?1024*1024:1024*1024*1024);
}
#ifdef MEMORY_PERCENTAGE
else if (unit==’%’) {
size = (atol(arg) * (long)getFreeSystemMemory())/100;
}
#endif
else{
printf(“Invalid size format\n”);
exit(0);
}
}else{
size=atoi(arg);
}
printf(“Eating %ld bytes in chunks of %d…\n”,size,chunk);
if(eat(size,chunk)){
printf(“Done, press any key to free the memory\n”);
getchar();
}else{
printf(“ERROR: Could not allocate the memory”);
}
}
}

}

————————————————————–

Running

memtest <size>

Size is in number of bytes, megabytes or gigabytes.

—————————————————————————–

Examples

memtest 1024
memtest 10M
memtest 4G