Basic Algorithms for fining Heavy Elements in Databases

Algorithms to compute an initial solution set F

The following algorithms are the basis for better algorithms to find the solution set F

Scaled-Sampling

S = uniform random sample of R of size s (e.g.: Concise Sample)
for ( each e ∈ S ) { N count[e] = count of e scaled by ---; |S| if ( count[e] ≥ T ) { add e to solution set F; } }

Properties of the algorithm:

Elements selected are likely to be heavy...
However...
Scaled-Sampling can generate false positives
Scaled-Sampling can generate false negatives
Positive trait:

Coarse-Count (probabilistic counting)

h1(x) = hash function that maps a value t of R to 1..m A[1..m] = m counter variables
/* ----------------------------- initialize counters ----------------------------- */ for ( i = 1; i <= m; i++ ) A[i] = 0; /* ------------------------------ Hash-count items in R ------------------------------ */ for ( each e ∈ R ) A[ h1( e ) ] ++; // Note: 2 items e1 and e2 that hashes // to the same value will be counted together... /* ------------------------------ Form solution set F ------------------------------ */ for ( each v ∈ R ) { if ( A[ h1( e ) ] ≥ T ) add e to F; }

Properties of the algorithm:

Coarse-Counting can generate false positives
Scaled-Sampling does not generate false negatives !!!!
If an item e occurs at least T times, then count[ h1(e) ] ≥ T
However:

Improving the basic algorithms
- Next webpage.....