|
When a new tuple arrives from stream 1, all tuples in stream 2 must be searched to identify the matching tuples, and vice versa.
You must look through all tuples in the other stream...
|
|
|
|
|
We will see later why some items needs be be weighted more than other in a join sample
|
input: n = number of items in input set r = numerb of items selected int picked[r]; int i; i = 0; while ( i < r ) { x = random() * r; // x is a random number in [0..r-1] if ( x ∈ picked[0..i-1] ) { already picked; try again; } else { picked[i++] = x; } } |
|
|
The paper can be obtained here: click here
The paper presents more sophisticated algorithms (Algorithm X, Y and Z).
I present the simplest algorithm R which will serve our purpose...
N = 0; // N = Number of items processed Sample[0..r] is the random sample obtain at the end of the algorithm /* ---------------------------------------------------- Step 1: fill the output with the initial N items ---------------------------------------------------- */ for (j = 0; j < r; j++) { Sample[j] = read_next_tuple(); // Get next tuple from input N++; } /* ------------------------------------------------------ Step 2: replace item with new items probabilistically ------------------------------------------------------ */ while ( not EOF ) { t = read_next_tuple(); // Get next tuple from input N = N+1; // N = total number of items M = TRUNC( N*random() ); // Generate random num [0..(N-1)] // Replace item M is M < r (P[success] = r / N if ( M < r ) { Sample[M] = t; } } |
|
|
It is very intuitive...
N = 0; // N = Number of items processed NOTE: algorithm will select an avg. of f*N tuples... while ( not EOF ) { t = read_next_tuple(); // Get next tuple from input // Select tuple if coin toss is success... // random() returns a random number in (0..1) if ( random() < f ) { Sample[N] = t; N = N+1; } } |
|