Cutting down query processing cost through Random Sampling

Motivation...
- The most time-consuming operation in Database processing is the:
- The join operation pairs up tuples from 2 sources based on key values in some attribute(s).
- Example:
  When a new tuple arrives from stream 1, all tuples in stream 2 must be searched to identify the matching tuples, and vice versa.
- Obviously, it takes a lot of processing time to go through all the tuples...
- If you want to find the exact answer, there is no shortcut.
  You must look through all tuples in the other stream...
- However, often, all we need is an approximate answer.
- Example:

Approximate the answer of queries using Random Sampling

The first method that we will study to reduce query processing cost (to meet real time processing requirements) is a simple statistical technique using Random Sampling

Principle of Random Sampling:

A small representative (random) sample is taken from each input.
The join operation is performed on the samples

Results:

The output will be much smaller (= different) than the original result

Question:

Is the (different) output representative for the original (accurate) output ???
I.e.: we can scale the smaller output to obtain an approximate answer for the original output.

Further question:

NOTE: we know that the answer is approximate, and we are willing to sacrifice accuracy for efficiency ...

Random Sampling
- There are 2 types of sampling algorithms:
- In unweighted sampling, every item in the set is weighted "equally"
- In weighted sampling, items in the set is weighted differently
  We will see later why some items needs be be weighted more than other in a join sample
- In this webpage, we will first study unweighted sampling

Unweighted Random Sampling

Unweighted Random Sample:

An unweighted random sample of size r from a set of n items that is given a priori can be found relatively easily:

input: n = number of items in input set r = numerb of items selected int picked[r]; int i; i = 0; while ( i < r ) { x = random() * r; // x is a random number in [0..r-1] if ( x ∈ picked[0..i-1] ) { already picked; try again; } else { picked[i++] = x; } }

Sequential Random Sampling in a stream

Picking a given number (r) of tuples from an input stream presents a new problem in Random Sampling

Problem description:

You are given an input stream of tuples
You need to select r tuples randomly from the input stream
You must decide to select the item as the items arrive.

Difficulty of this problem:

Selecting r tuples from the input is easy if you know in advance how many tuples there will be in the input.

The main difficulty in this problem is:

We cannot store every tuple from the input
(But we do need to store some tuples, just not every tuple)
We want to construct the random sampling by looking at the current input tuple, process that tuple, and disgard it (do not look at it again).
We do not know in advance how many tuples there will be in the input...

Unweighted Sequential Random Sampling in a stream

The paper "Random Sampling with a Reservoir" presents an algorithm that constructs a Random Sample of r items from an input stream
The paper can be obtained here: click here
Each item in the input stream is selected with the same probability (i.e., unweighted)
The paper presents more sophisticated algorithms (Algorithm X, Y and Z).
I present the simplest algorithm R which will serve our purpose...

Uniform Random Sample Without Replacement algorithm:

N = 0; // N = Number of items processed Sample[0..r] is the random sample obtain at the end of the algorithm /* ---------------------------------------------------- Step 1: fill the output with the initial N items ---------------------------------------------------- */ for (j = 0; j < r; j++) { Sample[j] = read_next_tuple(); // Get next tuple from input N++; } /* ------------------------------------------------------ Step 2: replace item with new items probabilistically ------------------------------------------------------ */ while ( not EOF ) { t = read_next_tuple(); // Get next tuple from input N = N+1; // N = total number of items M = TRUNC( N*random() ); // Generate random num [0..(N-1)] // Replace item M is M < r (P[success] = r / N if ( M < r ) { Sample[M] = t; } }

Claims:

Proof:

It should be obvious that r items will be selected (assuming that there are at least r items in the input stream)

The x^th value is selected if

The random number M_x generated at the time that the x^th item arrives is < r
(so that the x^th value will be included in the output
The (x+1)^th, (x+2)^th, ... N^th items do not replace the x^th item
(so that the x^th value will not be replaced

These probabilities are:

Prob[M_x < r] = r/x
Prob[ (x+1)^th replaces the x^th item] = (r/(x+1)) * 1/r = 1/(x+1)
Prob[ (x+1)^th does not replace the x^th item] = = x/(x+1)
Prob[ (x+2)^th replaces the x^th item] = (r/(x+2)) * 1/r = 1/(x+2)
Prob[ (x+2)^th does not replace the x^th item] = = (x+1)/(x+2)
And so on..

Then:

Prob[ x^th item selected and x^th is not replaced by later items] =
= r/x * [ (x/(x+1)) * ((x+1)/(x+2)) * ... * (N-1)/N ]

r/N

A more simple Sequential Random Sampling in a stream

The following algorithm that constructs a Random Sample from an input stream using a coin-tossing method.
Each item is selected with the same probability f (so the sample is unweighted)
It is very intuitive...

Uinform Random Random Sample using coin-toss algorithm:

N = 0; // N = Number of items processed NOTE: algorithm will select an avg. of f*N tuples... while ( not EOF ) { t = read_next_tuple(); // Get next tuple from input // Select tuple if coin toss is success... // random() returns a random number in (0..1) if ( random() < f ) { Sample[N] = t; N = N+1; } }

Problem with this algorithm:
In the next webpage, we will discuss two improved algorithms that can fix this problem.