|
Input set:
11 21 24 61 81 39 89 56 12 51
After sorting:
11 12 21 24 39 51 56 61 81 89
The 0.1-quantile = 11
The 0.2-quantile = 12
etc.
Special case:
The median = 0.5 quantile = 39
|
Then the sorted elements are scanned to find the one at position ⌊ φ × N ⌋
|
|
|
Input set:
11 21 24 61 81 39 89 56 12 51
After sorting:
11 12 21 24 39 51 56 61 81 89
Special case:
The median = 0.5 quantile = 39
The ε-approximate of the 0.5-quantile = {24, 39, 51}
|
|
|
|
Example: ε = 0.1
Input set:
11 21 24 61 81 39 89 56 12 51
After sorting:
11 12 21 24 39 51 56 61 81 89
^
|
#3
|
|
Input: 45 89 98 12 13 55 14 24 26
After sorting:
Rank: 1 2 3 4 5 6 7 8 9
-------+----------------------------
Input: 12 13 14 24 26 45 55 89 98
|
Goal:
|
(Because every possible element can be queried and you cannot make any error !)
Original input: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 12 13 14 24 26 45 55 89 98 Retain only these elements: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 26 89 Approximate answers to quantile queries: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 13 13 26 26 26 89 89 89 |
Example usage:
|
Conclusion:
|
(you will see this fact when we discuss the algorithm)
Original input: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 12 13 14 24 26 45 55 89 98 Retain only these elements: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 26 89 Approximate answers to quantile queries: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 13 13 26 26 26 89 89 89 |
( [v1,min1,max1], [v2,min2,max2], ..., [vm,minm,maxm] )
where:
vi = the value that covers the φ-quantile range
mini = start position of the φ-quantile range
maxi = ending position of the φ-quantile range
|
Suppose the input stream is:
12 13 14 24 26 45 55 89 98 ...(more data coming)
(For ease of understanding, here is the sorted list of the input number:
12 13 14 24 26 45 55 89 98
)
The algorithm represents the current state with:
[13, 1, 3] [26, 4, 6] [89, 7, 9]
|
Now suppose the next arriving value is 17:
The input stream is now:
12 13 14 24 26 45 55 89 98 17...(more data coming)
(For ease of understanding, here is the sorted list of the input number:
12 13 14 17 24 26 45 55 89 98
)
The algorithm must modify the state in the data structure to:
[13, 1, 3] [17, 4, 4] [26, 5, 7] [89, 8, 10]
^^^^^^^^^^ ^^^^^ ^^^^^
inserts 17 but must also change indices
in later entries !!!
|
This data structure requires a large number of operations per inserted value
Although it is useful, it is not efficient
( [v1,g1], [v2,g2], ..., [vm,gm] )
where:
vi = the value that covers the φ-quantile range
gi = number of positions covered by the value
|
Now suppose the next arriving value is 17:
The input stream is now:
12 13 14 24 26 45 55 89 98 17...(more data coming)
(For ease of understanding, here is the sorted list of the input number:
12 13 14 17 24 26 45 55 89 98
)
The algorithm must modify the state in the data structure to:
[13, 3] [17, 1] [26, 3] [89, 3]
^^^^^^^ ^^^ ^^^
inserts 17 but the other information
does not need to be updated !!!
|
How to read the data structure:
|
|
Original input: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 12 13 14 24 26 45 55 89 98 Retain only these elements: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 26 89 Coverage provided by each entry: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 13 13 26 26 26 89 89 89 |
|
Example: in the above summary
Graphically:
|
|
( [v0,g0,Δ0], [v1,g1,Δ1], [v2,g2,Δ2], ..., [vs-1,gs-1,Δs-1] )
where:
vi = the value that covers the φ-quantile range
gi = see definition above
Δi = see definition above
|
|
|
|
|
(v0, g0, Δ0) (v1, g1, Δ1) (v2, g2, Δ2) (5, 1, 4) (7, 3, 3) (10, 4, 0) rmin(v0) = 1 rmax(v0) = 1 + 4 = 5 rmin(v1) = 1 + 3 = 4 rmax(v0) = 4 + 3 = 7 rmin(v1) = 4 + 4 = 8 rmax(v0) = 8 + 0 = 8 |
|
|
|
Case 1: r > n-e
|
Case 2: r ≤ n-e
|
|
|
N = 0;
while ( not EOS )
{
/* -----------------------
Delete phase
----------------------- */
if ( N mod ( 1/(2 ε) ) == 0 )
delete elements from summary;
v = next value in stream;
/* -----------------------
Insert phase
----------------------- */
insert v into summary;
N++;
}
|
Important:
|
v = next value in input
/* --------------------------------------------
Find insert position for v in S
-------------------------------------------- */
Find a tuple (vi, gi, Δi) ∈ S such that: vi-1 ≤ v < vi
if ( v < v0 || v > vs-1 )
Δ = 0; // New min or max value
else
Δ = gi + Δi - 1;
INSERT "(v, 1, Δ)" into S between vi and vi+1;
|
|
Proof:
|
|
Proof:
|
|
So as long as we maintain this property, the information in the summary will allow us to answer any φ-quantile query with ε accuracy
But it is also the most complex part of the algorithm
I will discuss deleting one tuple first...
|
Proof:
|
|
Proof:
|
|
|
|
*** ε is the margin error (a parameter of the algorithm)
S = {}; // S contains the summary structure, which is:
// <(v0, g0, Δ0), (v1, g1, Δ1) ... >
// NOTE: S is an ordered list !!!
N = 0; // Number of items processed
while ( not EOS )
{
/* ---------------------------------------------
Delete phase:
executed once every 1/(2×ε) insertions
--------------------------------------------- */
if ( N % ⌊1/(2×ε)⌋ == 0 )
{
/* --------------------------------------------------
Delete unnecessary entries in summary
(while keeping the smallest and largest elements)
-------------------------------------------------- */
for ( i = s-1; i ≥ 2; i = j - 1 )
{
j = i-1;
while ( j ≥ 1 && gj + ... + gi + Δi < 2εN )
{
j--;
}
j++; // We went one index too far in the while...
if ( j < i )
{
replace entries j, .., i with the entry (vi, gj+ ... + gi, Δi);
}
}
}
/* ------------------------------------
Insert phase
------------------------------------ */
v = next value in input
/* --------------------------------------------
Find insert position for v in S
-------------------------------------------- */
Find a tuple (vi, gi, Δi) ∈ S such that: vi-1 ≤ v < vi
if ( v is inserted at the head or tail of S )
Δ = 0;
else
Δ = gi + Δi - 1 // This is the allowable "wiggle room"
INSERT "(v, 1, Δ)" into S between vi-1 and vi;
N++;
}
|
|
|