Posix Thread (pthread) Programming

Thread Synchronization

The synchronization problem in parallel threads

An important aspect of parallel programming is asynchronous updates to the same variable
See: click here

Asynchronous Update problem:

Parallel executing threads that updates a common resource (e.g., shared variables), must be serialized (must happen in serie)
In general:

Some access operations are conflicting and these access operations cannot be executed simulateneously
However, not all access operations are conflicting

Conflicting operations

There are 2 access operations on shared variables:

The conflicts are:

Reading by one thread and writing by another thread
Writing by one thread and writing by another thread
NOTE: Reading by one thread and reading by another thread do not conflict

Example: 2 parallel threads try to update a shared (global) variable - some updates can be lost :

Thread 1 on Thread 2 on Memory CPU 1 CPU 2 ============== =================== ================= N = 1234 Read N --> 1234 Add 1 --> 1235 Read N --> 1234 N = 1235 Write N Add 1 --> 1235 N = 1235 Write N

Observation:

Commonly used synchronization mechanisms in shared-memory programming

Different computing environments will demand the use of different synchronization mechanisms (for efficiency reasons)

These are the commonly used synchronization mechanisms in shared-memory parallel programs:

Mutex (or Mutually exclusive locks):
Read/Write Locks
Binary and counting semaphores
Barriers
Conditional variables (test-and-set operation)

Synchronization methods available in PThread
- Not all the above synchronization mechanisms are available in the PThreads library
- Available synchronization methods in POSIX Threads:
- Binary semaphores and Barriers can be implemented using conditional variables
- I will only discuss:
  These are the most useful synchronization techniques for parallel numerical program that need to update shared variables
MUTEX LOCKS

Mutex Locks: Theory

A mutex lock variable has 2 values (states)

A mutex lock is a synchronization object with 2 operations :

Lock
- If the mutex lock is in the unlocked state, the lock will complete
  (I.e., the function will return and the thread continues with the instruction following the lock command).
  The value (state) of the mutex lock is changed to locked
- If the mutex lock is in the locked state, the thread that executes the lock command will block until the value (state) of the mutex lock becomes unlocked
  (I.e., the thread stops execution until the mutex lock becomes unlocked)
  When the value (state) of the mutex lock becomes unlocked, the lock command will complete and changes the state of the mutex lock to locked)
Unlock
- If the mutex lock is in the locked state, the state is changed to unlocked
- If the mutex lock is in the unlocked state, this operation has no effect.
NOTE:

Mutex Locks in PThreads
- Defining a mutex lock variable in PThread:
- Initializing a mutex variable:
  - mutex: is the mutex lock that you want to initialize (pass the address !)
  - attr: is the set of initial property of the mutex lock.
- The most common mutex lock is one where the mutex lock is initially unlock.
  This kind of mutex lock is created using the (default) attribute NULL
  Example:
- Locking a mutex:
  - NOTE: the error-code returned by pthread_mutex_lock() is usually ignored
- Unlocking a mutex: (remember: only the thread that has locked the mutex can unlock it !)

Using mutex lock to synchronize access of shared variables

Consider the asynchronouse update example: (See: click here)

#include <pthread.h> int N; pthread_t tid[100]; // Each thread executes the following function: void *worker(void *arg) { int i, k, s; for (i = 0; i < 10000; i = i + 1) { N = N + 1; } cout << "Added 10000 to N" << endl; return(NULL); /* Thread exits (dies) */ } /* ======================= MAIN ======================= */ int main(int argc, char *argv[]) { int i, num_threads; num_threads = atoi(argv[1]); /* ------ Create threads ------ */ for (i = 0; i < num_threads; i = i + 1) { if ( pthread_create(&tid[i], NULL, worker, NULL) ) { cout << "Cannot create thread" << endl; exit(1); } } N = 0; // Wait for all threads to terminate for (i = 0; i < num_threads; i = i + 1) pthread_join(tid[i], NULL); cout << "N = " << N << endl << endl; exit(0); }

Example Program: (Caveat of multi-threaded programs) --- click here
- Compile with: CC -mt thread02.C

We solve the conflicting access problem by synchronizing the update operations to the shared variable N

Whenever a thread wants to update a shared variable, it must enclose the update operation between a "lock - unlock" pair.

Example:

int N; // SHARED variable pthread_mutex_t N_mutex; // Mutex controlling access to N void *worker(void *arg) { int i; for (i = 0; i < 10000; i = i + 1) { pthread_mutex_lock(&N_mutex); N = N + 1; pthread_mutex_unlock(&N_mutex); } }

Effect:

Many threads are executing simultaneously
The statement
ensures that exactly one thread is sucessful in locking the mutex variable N_mutex.
This particular thread will then be the only thread that will update the variable N, thus ensuring that N is updated sequential (one thread after another)

Example Program: (Demo above code)
- Prog file: click here
Compare the behavior of this program with the one that does not use MUTEX to control access to N: click here

NOTE:

Make sure you unlock a mutex after you are done with accessing the share variable(s).

A common error in parallel programs is forget the unlock call (especially if the call is made after many statments)...
The result is deadlock:

Example Program: (Not unlocking a mutex) --- click here

A lesson in parallel design: parallel numerical integration (estimate Pi)

One of the ways that we can estimate the value of Pi is to compute the definite integral :

Integrate( f(x) = 2.0 / sqrt(1 - x*x) , x = 0 to x = 1 )

Maple:
> integrate(2.0 / sqrt(1 - x*x), x=0..1); 3.141592654

We can use the rectangle-rule to compute the approximate integral

Example:

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int i; int N; double sum; double x, w; N = ...; // Will determine the accuracy // of the approximation w = 1.0/N; // Width of interval sum = 0.0; for (i = 1; i <= N; i = i + 1) { x = w*(i - 0.5); // Get the middle of the interval sum = sum + w*f(x); // Sum the area... } cout << sum; }

Example Program: (Demo above code)
- Prog file (Sequential program for Pi): click here
Compile with:
Run the program with:

Parallel numerical integration - part 1

To obtain a parallel proogram we must consider the program where computation steps can be performed concurrently
The best place to look for opportunity for parallellism is for loops

Example:

double f(double a) { return( 2.0 / sqrt(1 - a*a) ); } int main(int argc, char *argv[]) { int i; int N; double sum; double x, w; N = ...; // Will determine the accuracy of approximation w = 1.0/N; sum = 0.0; for (i = 0; i < N; i = i + 1) { x = w*(i + 0.5); // We can make x non-shared.. sum = sum + w*f(x); // sum is SHARED !!! } cout << sum; }

Observation:

We can perform the summation
in parallel
But: we must use a a mutex variable to synchronize the updates to the variable sum

Division of labor

Block-wise division (2 threads)

Thread 0 computes the "first half" of partial sum
w*f(0w + 0.5w) + w*f(1w +0.5w) + w*f(2w +0.5w) + ...
Thread 1 computes the "second half" of partial sum
w*f(0.5+0w + 0.5w) + w*f(0.5+1w + 0.5w) + w*f(0.5+2w + 0.5w) + ...

Schematically:

values added by values added by thread 1 thread 2 |<--------------------->|<--------------------->|

Interleaved division (easier to code !)

Thread 0 computes the "first half" of partial sum
w*f(0w + 0.5w) + w*f(2w + 0.5w) + w*f(4w + 0.5w) + w*f(6w + 0.5w) + w*f(8w + 0.5w) + ...
Thread 1 computes the "second half" of partial sum
w*f(1w + 0.5w) + w*f(3w + 0.5w) + w*f(5w + 0.5w) + w*f(7w + 0.5w) + w*f(9w + 0.5w) + ...

Schematically:

values added by thread 1 | | | | | | | | | | | | | | V V V V V V V V V V V V V V |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ | | | | | | | | | | | | | | values added by thread 2

Choice of labor division:
We will choose the interleaved labor division for the ease of programming

First parallelization attempt:

/*** Shared variables, but not updated.... ***/ int N; // # intervals double w; // width of one interval int num_threads; // # threads /*** Shared variables, updated !!! ***/ double sum; pthread_mutex_t sum_mutex; // Mutex to control access to sum int main(int argc, char *argv[]) { int Start[100]; // Start index values for each thread pthread_t tid[100]; // Used for pthread_join() int i; N = ...; // Read N in from keyboard... w = 1.0/N; // "Broadcast" w num_threads = ... // Skip distance for each thread sum = 0.0; // Initialized shared variable pthread_mutex_init(&sum_mutex, NULL); // Init mutex /**** Make worker threads... ****/ for (i = 1; i <= N; i = i + 1) { Start[i] = i; // Start index for thread i if ( pthread_create(&tid[i], NULL, PIworker, &Start[i]) ) { cout << "Cannot create thread" << endl; exit(1); } } /**** Wait for worker threads to finish... ****/ for (i = 0; i < num_threads; i = i + 1) pthread_join(tid[i], NULL); cout << sum; }

Worker thread:
void *PIworker(void *arg) { int i, myStart; double x; /*** Get the parameter (which is my starting index) ***/ myStart = * (int *) arg; /*** Compute sum, skipping every "num_threads" items ***/ for (i = myStart; i < N; i = i + num_threads) { x = w * ((double) i + 0.5); // next x pthread_mutex_lock(&sum_mutex); sum = sum + w*f(x); // Add to sum pthread_mutex_unlock(&sum_mutex); } return(NULL); /* Thread exits (dies) */ }

Example Program: (Parallel Pi - version 1) --- click here
- Compile the program using:
- Try run program on compute.mathcs.emory.edu using:
- Then compare the performance numbers with the non-parallel version: click here
  The parallel version is slower than the sequential version !!!

Synchronization bottleneck

When different threads wait for each other (to update a common resource), there is a possibility to create a Synchronization bottleneck:
A key design in parallel programs is minimize synchronization among the threads
A technique to reduce synchronization is:

Example:

Worker thread:
void *PIworker(void *arg) { int i, myStart; double x; double tmp_sum; /*** Get the parameter (which is my starting index) ***/ myStart = * (int *) arg; /*** Compute sum, skipping every "num_threads" items ***/ for (i = myStart; i < N; i = i + num_threads) { x = w * ((double) i + 0.5); // next x tmp_sum = tmp_sum + w*f(x); // No mutex lock needed ! } pthread_mutex_lock(&sum_mutex); sum = sum + tmp_sum; // Synch only ONCE !!! pthread_mutex_unlock(&sum_mutex); return(NULL); /* Thread exits (dies) */ }

Example Program: (Demo above code)
- Prog file: click here
- Compile the program using:
- Try run program on compute.mathcs.emory.edu using:
- NOW compare the performance numbers with the non-parallel version: click here
  - Try: time compute_pi 50000000
  - And: time thread_compute_pi_mt2 50000000 8
  What a difference it can make where you put the synchronization points in a parallel program....

READ/WRITE LOCKS

Read/Write Locks: Theory

A read/write lock variable has 3 values (states)

A read/write lock is a synchronization object with 3 operations :

Read Lock

If the read/write lock is in the unlocked state:
If the read/write lock is in the read locked state:
The value (state) of the read/write lock remains read locked, but a read lock count is increased (so we know how many times a read lock operations have been performed)
If the read/write lock is in the write locked state:
(When the state of the read/write lock does become unlocked, the read lock command will complete and change the state of the read/write lock to read locked)

Write Lock

If the read/write lock is in the unlocked state:
If the read/write lock is in the read locked state:
(When the state of the read/write lock does become unlocked, the write lock command will complete and change the state of the read/write lock to write locked)
If the read/write lock is in the write locked state:
(When the state of the read/write lock does become unlocked, the write lock command will complete and change the state of the read/write lock to write locked)

Unlock

Release the read lock or the write lock
Only the owner of the lock can release the lock
(the function will return an error code when a non-owner thread invokes unlock )

Difference between mutex locks and read/write locks :

Read/Write Locks in PThreads
- Defining a read/write lock variable in Pthreads:
- Initializing a read/write lock variable:
  - rwlock: is the read/write lock that you want to initialize (pass the address !)
  - attr: is the set of initial property of the read/write lock.
  The most common read/write lock is one where the lock is initially in the unlock.
  This kind of mutex lock is created using the (default) attribute null:
- Read lock a "read/write lock":
  - NOTE: if a thread already hold a write lock on a read/write lock, and performs a pthread_rwlock_rdlock() on that lock, then the outcome if undefined (in order words: do NOT try !)
- Write lock a "read/write lock":
  - NOTE: if a thread already hold a read lock or write lock on a read/write lock, and performs a pthread_rwlock_wrlock() on that lock, then the outcome if undefined (in order words: do NOT try !)