Basic Algorithms for fining Heavy Elements in Databases
Algorithms to compute an initial solution set F
The following algorithms are the
basis for
better algorithms
to find the solution setF
Scaled-Sampling
S = uniform random sample of R of size s (e.g.: Concise Sample)
for ( each e ∈ S )
{
N
count[e] = count of e scaled by ---;
|S|
if ( count[e] ≥ T )
{
add e to solution set F;
}
}
Properties of the algorithm:
Elements selected are
likely to be
heavy...
However...
Scaled-Sampling can generate
false positives
Scaled-Sampling can generate
false negatives
Positive trait:
The output size can be
very small if
the selection probability
is set low enough
Coarse-Count (probabilistic counting)
h1(x) = hash function that maps a value t of R to 1..m
A[1..m] = m counter variables
/* -----------------------------
initialize counters
----------------------------- */
for ( i = 1; i <= m; i++ )
A[i] = 0;
/* ------------------------------
Hash-count items in R
------------------------------ */
for ( each e ∈ R )
A[ h1( e ) ] ++; // Note: 2 items e1 and e2 that hashes
// to the same value will be counted together...
/* ------------------------------
Form solution set F
------------------------------ */
for ( each v ∈ R )
{
if ( A[ h1( e ) ] ≥ T )
add e to F;
}
Properties of the algorithm:
Coarse-Counting can generate
false positives
Scaled-Samplingdoes not generatefalse negatives !!!!
If an item e
occurs at leastT
times, then
count[ h1(e) ] ≥ T
However:
F generated by
Coarse-Counting
can be
very large....