When a new tuple arrives from stream 1, all tuples in stream 2 must be searched to identify the matching tuples, and vice versa.
You must look through all tuples in the other stream...
We will see later why some items needs be be weighted more than other in a join sample
input: n = number of items in input set r = numerb of items selected int picked[r]; int i; i = 0; while ( i < r ) { x = random() * r; // x is a random number in [0..r-1] if ( x ∈ picked[0..i-1] ) { already picked; try again; } else { picked[i++] = x; } } |
The paper can be obtained here: click here
The paper presents more sophisticated algorithms (Algorithm X, Y and Z).
I present the simplest algorithm R which will serve our purpose...
N = 0; // N = Number of items processed Sample[0..r] is the random sample obtain at the end of the algorithm /* ---------------------------------------------------- Step 1: fill the output with the initial N items ---------------------------------------------------- */ for (j = 0; j < r; j++) { Sample[j] = read_next_tuple(); // Get next tuple from input N++; } /* ------------------------------------------------------ Step 2: replace item with new items probabilistically ------------------------------------------------------ */ while ( not EOF ) { t = read_next_tuple(); // Get next tuple from input N = N+1; // N = total number of items M = TRUNC( N*random() ); // Generate random num [0..(N-1)] // Replace item M is M < r (P[success] = r / N if ( M < r ) { Sample[M] = t; } } |
It is very intuitive...
N = 0; // N = Number of items processed NOTE: algorithm will select an avg. of f*N tuples... while ( not EOF ) { t = read_next_tuple(); // Get next tuple from input // Select tuple if coin toss is success... // random() returns a random number in (0..1) if ( random() < f ) { Sample[N] = t; N = N+1; } } |