Integrate( f(x) = 2.0 / sqrt(1 - x*x) , x = 0 to x = 1 ) |
Maple: > integrate(2.0 / sqrt(1 - x*x), x=0..1); 3.141592654 |
Reason:
integral of 2 / sqrt( 1 - x^2 ) = 2 × arcsin( x )
arcsin(1) = 1/2 Pi
arcsin(0) = 0
|
|
How to approximate:
|
(See the figure above !!!)
|
We can see that:
Integral ~= w * f(0.5w) + w * f(1.5w) + w + f(2.5w) + ...... |
In pseudo code:
Integral = 0;
for (i = 0; i < N; i++)
{
x = (i+0.5)*w;
Integral = Integral + w * f(x); // w * f(x) is area of rectangle
}
|
The entire program in C/C++:
double f(double a)
{
return( 2.0 / sqrt(1 - a*a) );
}
int main(int argc, char *argv[])
{
int i;
int N;
double sum;
double x, w;
N = ...; // Will determine the accuracy of the approximation
w = 1.0/N;
sum = 0.0;
for (i = 0; i < N; i = i + 1)
{
x = (i + 0.5) * w;
sum = sum + w*f(x);
}
cout << sum;
}
|
Compile with:
Run the program with:
Often, a small amount of (shared) information is updated within every execution of the loop.
The program can be speed up when non-conlficting operations are performed concurrently (in parallel), while conlficting operations to the shared information (variable) are performed sequentially
double f(double a)
{
return( 2.0 / sqrt(1 - a*a) );
}
int main(int argc, char *argv[])
{
int i;
int N;
double sum;
double x, w;
N = ...; // Will determine the accuracy of approximation
w = 1.0/N;
sum = 0.0;
for (i = 0; i < N; i = i + 1)
{
x = w*(i + 0.5); // We can make x non-shared..
sum = sum + w*f(x); // sum is SHARED !!!
}
cout << sum;
}
|
      w*(f(0.5w) + f(1.5w) + f(2.5w) + ... + f(0.5-0.5w) )
and thread 2 compute the "second half" of partial sum
      w*(f(0.5+0.5w) + f(0.5+1.5w) + f(0.5+2.5w) + ... + f(1-0.5w) )
Pictorially:
values added by values added by
thread 1 thread 2
|<--------------------->|<--------------------->|
|
      w*(f(0.5w) + f((2 + 0.5)w) + f((4+0.5)w) + ... )
and thread 2 compute the "odd stepped" partial sum
      w*(f((1+0.5)w) + f((3+0.5)w) + f((5+0.5)w) + ... ) Pictorially:
values added by thread 1
| | | | | | | | | | | | | |
V V V V V V V V V V V V V V
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | | | | | | |
values added by thread 2
|
It turns out that the "even stepped" and "odd stepped" approach of partition is more easier to program in many instances !!!
NOTE: We don't access any array, so the paging problem is not applicable !!!
The division of labor is as follows for N threads:
|
/*** Shared variables, but not updated.... ***/
int N; // # intervals
double w; // width of one interval
int num_threads; // # threads
/*** Shared variables, updated !!! ***/
double sum;
pthread_mutex_t sum_mutex; // Mutex to control access to sum
int main(int argc, char *argv[])
{
int Start[100]; // Start index values for each thread
pthread_t tid[100]; // Used for pthread_join()
int i;
N = ...; // Read N in from keyboard...
w = 1.0/N; // "Broadcast" w
num_threads = ... // Skip distance for each thread
sum = 0.0; // Initialized shared variable
/**** Make worker threads... ****/
for (i = 1; i <= N; i = i + 1)
{
Start[i] = i; // Start index for thread i
if ( pthread_create(&tid[i], NULL, PIworker, &Start[i]) )
{
cout << "Cannot create thread" << endl;
exit(1);
}
}
/**** Wait for worker threads to finish... ****/
for (i = 0; i < num_threads; i = i + 1)
pthread_join(tid[i], NULL);
cout << sum;
}
|
Worker thread:
void *PIworker(void *arg)
{
int i, myStart;
double x;
/*** Get the parameter (which is my starting index) ***/
myStart = * (int *) arg;
/*** Compute sum, skipping every "num_threads" items ***/
for (i = myStart; i < N; i = i + num_threads)
{
x = w * ((double) i + 0.5); // next x
pthread_mutex_lock(&sum_mutex);
sum = sum + w*f(x); // Add to sum
pthread_mutex_unlock(&sum_mutex);
}
return(NULL); /* Thread exits (dies) */
}
|
Are you surprised ???
Worker thread:
void *PIworker(void *arg)
{
int i, myStart;
double x;
double tmp; // local non-shared variable
/*** Get the parameter (which is my starting index) ***/
myStart = * (int *) arg;
/*** Compute sum, skipping every "num_threads" items ***/
for (i = myStart; i < N; i = i + num_threads)
{
x = w * ((double) i + 0.5); // next x
tmp = tmp + w*f(x); // No synchr. needed
}
pthread_mutex_lock(&sum_mutex);
sum = sum + tmp; // Synch only ONCE !!!
pthread_mutex_unlock(&sum_mutex);
return(NULL); /* Thread exits (dies) */
}
|
What a difference it can make where you put the synchronization points in a parallel program....