|
|
When a new tuple arrives from stream 1, all tuples in stream 2 must be searched to identify the matching tuples, and vice versa.
You must look through all tuples in the other stream...
|
|
|
|
|
|
We will see later why some items needs be be weighted more than other in a join sample
|
input: n = number of items in input set
r = numerb of items selected
int picked[r];
int i;
i = 0;
while ( i < r )
{
x = random() * r; // x is a random number in [0..r-1]
if ( x ∈ picked[0..i-1] )
{
already picked; try again;
}
else
{
picked[i++] = x;
}
}
|
|
|
The paper can be obtained here: click here
The paper presents more sophisticated algorithms (Algorithm X, Y and Z).
I present the simplest algorithm R which will serve our purpose...
N = 0; // N = Number of items processed
Sample[0..r] is the random sample obtain at the end of the algorithm
/* ----------------------------------------------------
Step 1: fill the output with the initial N items
---------------------------------------------------- */
for (j = 0; j < r; j++)
{
Sample[j] = read_next_tuple(); // Get next tuple from input
N++;
}
/* ------------------------------------------------------
Step 2: replace item with new items probabilistically
------------------------------------------------------ */
while ( not EOF )
{
t = read_next_tuple(); // Get next tuple from input
N = N+1; // N = total number of items
M = TRUNC( N*random() ); // Generate random num [0..(N-1)]
// Replace item M is M < r (P[success] = r / N
if ( M < r )
{
Sample[M] = t;
}
}
|
|
|
It is very intuitive...
N = 0; // N = Number of items processed
NOTE: algorithm will select an avg. of f*N tuples...
while ( not EOF )
{
t = read_next_tuple(); // Get next tuple from input
// Select tuple if coin toss is success...
// random() returns a random number in (0..1)
if ( random() < f )
{
Sample[N] = t;
N = N+1;
}
}
|
|