|
|
When a new tuple arrives from stream 1, all tuples in stream 2 must be searched to identify the matching tuples, and vice versa.
You must look through all tuples in the other stream...
|
|
|
|
|
|
|
The paper can be obtained here: click here
The paper presents more sophisticated algorithms (Algorithm X, Y and Z).
I present the simplest algorithm R which will serve our purpose...
N = 0; // N = Number of items processed
Sample[0..r] is the random sample obtain at the end of the algorithm
/* ----------------------------------------------------
Step 1: fill the output with the initial N items
---------------------------------------------------- */
for (j = 0; j < r; j++)
{
Sample[j] = read_next_tuple(); // Get next tuple from input
N++;
}
/* ------------------------------------------------------
Step 2: replace item with new items probabilistically
------------------------------------------------------ */
while ( not EOF )
{
t = read_next_tuple(); // Get next tuple from input
N = N+1; // N = total number of items
M = TRUNC( N*random() ); // Generate random num [0..(N-1)]
// Replace item M is M < r (P[success] = r / N
if ( M < r )
{
Sample[M] = t;
}
}
|
|
|
The algorithm canbe found in the same paper: click here
|
N = 0; // N = Number of items processed
while ( not EOF )
{
t = read_next_tuple(); // Get next tuple from input
N = N+1; // N = current number of tuples
for (j = 0; j < r; j++)
{
// Replace item Sample[j] with probab 1/N
if ( random() < 1/N )
{
Sample[j] = t;
}
}
}
|
It is very intuitive...
N = 0; // N = Number of items processed
NOTE: algorithm will select an avg. of f*N tuples...
while ( not EOF )
{
t = read_next_tuple(); // Get next tuple from input
// Select tuple if coin toss is success...
if ( random() < f )
{
Sample[N] = t;
N = N+1;
}
}
|
|
Example:
|
Example:
|
|
The join operation will produce no output tuples !!!
Input 1: Input 2: Output: --------- ---------- ------------ (a1,b0) (a2,c0) (a2,b1) (a1,c1) (a1,b0,a1,c1), (a1,b0,a1,c2), ..., (a1,b0,a1,ck), (a2,b2) (a1,c2) (a2,b1,a2,c0), (a2,b2,a2,c0), ..., (a2,bk,a2,c0) ... ... (a2,bk) (a1,ck) |
There is only a single tuple (a1,b0) in R1
There is only a single tuple (a1,b0) in R1
|
Answer:
|
|
Example:
|
|
W = 0; // W = Weight
while ( not EOF )
{
t = read_next_tuple(); // Get next tuple from input
W = W + w(t); // Sum of weights so far...
for (j = 0; j < r; j++)
{
if ( random() < w(t)/W )
{
Sample[j] = t;
}
}
}
|
where the input data in the stream may not be known in advance...
|
|
xxxxx
|
He studied the problem of constructing a sample of a join output when a lot of information on the inputs are available:
M = largest frequency of the joining attribute (A) values
in R2 (remember: we know the statistics !)
r = output sample size;
NtuplesOutput = 0;
while (NtuplesOutput < r)
{
Sample a tuple t1 ∈ R1,
uniformly at random;
Sample a random tuple t2 ∈ R2 from among tuples t ∈ R2
that have t.A = t1.A
Output tuple t1 * t2 with probability m2(t2.A)/M,
and with probaility 1 - m2(t2.A)/M, reject the sample
(Not sure what it means, reject one tuple or reject all tuples)
NtuplesOutput += number of tuples outputted;
}
|
And we are mostly interested in stream data...
|
Use Weighted Sampling algorithm (weights from statistics in R2)
to produce a Weighted Sample with Repetition set S1 ⊂ R1
(The weight of sampling w(t) of tuple t is
the frequency count of t.A in R2 !!!)
(The weighted sampling algorithm will end and produce
a set of "r" tuples.)
For each output tuple in sample S1 do
{
let t1 = next tuple in S1;
Sample a random tuple t2 &isin R2 among
the tuples t &isin R2 that have t.A = t1.A
(Note: because R2 is indexed, we can find all tuples
t &isin R2 that have t.A = t1.A
very easily by using the index !)
Output t1 * t2 (join)
}
|
|
Use Weighted Sampling algorithm (weights from statistics in R2)
to produce a Weighted Sample with Repetition set S1 ⊂ R1
(The weight of sampling w(t) of tuple t is
the frequency count of t.A in R2 !!!)
(The weighted sampling algorithm will end and produce
a set of "r" tuples.)
Let S1 consists of the tuples (s1, sr, ..., sr)
For each tuple si &isin S1 do
{
Compute:
Xi = si * R2
// Join tuple si with every tuple in R2
// NOTE: We do this, because R2 is not indexed,
// so we can't find tuples in R2
// with a given value t.A quickly
Pick 1 tuples from Xi
Output the selected tuple
}
|