|
Input set: 11 21 24 61 81 39 89 56 12 51 After sorting: 11 12 21 24 39 51 56 61 81 89 The 0.1-quantile = 11 The 0.2-quantile = 12 etc. Special case: The median = 0.5 quantile = 39 |
Then the sorted elements are scanned to find the one at position ⌊ φ × N ⌋
|
|
|
Input set: 11 21 24 61 81 39 89 56 12 51 After sorting: 11 12 21 24 39 51 56 61 81 89Special case: The median = 0.5 quantile = 39 The ε-approximate of the 0.5-quantile = {24, 39, 51} |
|
|
|
Example: ε = 0.1
Input set: 11 21 24 61 81 39 89 56 12 51 After sorting: 11 12 21 24 39 51 56 61 81 89 ^ | #3 |
|
Input: 45 89 98 12 13 55 14 24 26 After sorting: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Input: 12 13 14 24 26 45 55 89 98 |
Goal:
|
(Because every possible element can be queried and you cannot make any error !)
Original input: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 12 13 14 24 26 45 55 89 98 Retain only these elements: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 26 89 Approximate answers to quantile queries: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 13 13 26 26 26 89 89 89 |
Example usage:
|
Conclusion:
|
(you will see this fact when we discuss the algorithm)
Original input: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 12 13 14 24 26 45 55 89 98 Retain only these elements: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 26 89 Approximate answers to quantile queries: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 13 13 26 26 26 89 89 89 |
( [v1,min1,max1], [v2,min2,max2], ..., [vm,minm,maxm] ) where: vi = the value that covers the φ-quantile range mini = start position of the φ-quantile range maxi = ending position of the φ-quantile range |
Suppose the input stream is: 12 13 14 24 26 45 55 89 98 ...(more data coming) (For ease of understanding, here is the sorted list of the input number: 12 13 14 24 26 45 55 89 98 ) The algorithm represents the current state with: [13, 1, 3] [26, 4, 6] [89, 7, 9] |
Now suppose the next arriving value is 17:
The input stream is now: 12 13 14 24 26 45 55 89 98 17...(more data coming) (For ease of understanding, here is the sorted list of the input number: 12 13 14 17 24 26 45 55 89 98 ) The algorithm must modify the state in the data structure to: [13, 1, 3] [17, 4, 4] [26, 5, 7] [89, 8, 10] ^^^^^^^^^^ ^^^^^ ^^^^^ inserts 17 but must also change indices in later entries !!! |
This data structure requires a large number of operations per inserted value
Although it is useful, it is not efficient
( [v1,g1], [v2,g2], ..., [vm,gm] ) where: vi = the value that covers the φ-quantile range gi = number of positions covered by the value |
Now suppose the next arriving value is 17:
The input stream is now: 12 13 14 24 26 45 55 89 98 17...(more data coming) (For ease of understanding, here is the sorted list of the input number: 12 13 14 17 24 26 45 55 89 98 ) The algorithm must modify the state in the data structure to: [13, 3] [17, 1] [26, 3] [89, 3] ^^^^^^^ ^^^ ^^^ inserts 17 but the other information does not need to be updated !!! |
How to read the data structure:
|
|
Original input: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 12 13 14 24 26 45 55 89 98 Retain only these elements: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 26 89 Coverage provided by each entry: Rank: 1 2 3 4 5 6 7 8 9 -------+---------------------------- Ranked: 13 13 13 26 26 26 89 89 89 |
|
Example: in the above summary
Graphically:
|
![]() |
( [v0,g0,Δ0], [v1,g1,Δ1], [v2,g2,Δ2], ..., [vs-1,gs-1,Δs-1] ) where: vi = the value that covers the φ-quantile range gi = see definition above Δi = see definition above |
|
|
|
|
(v0, g0, Δ0) (v1, g1, Δ1) (v2, g2, Δ2) (5, 1, 4) (7, 3, 3) (10, 4, 0) rmin(v0) = 1 rmax(v0) = 1 + 4 = 5 rmin(v1) = 1 + 3 = 4 rmax(v0) = 4 + 3 = 7 rmin(v1) = 4 + 4 = 8 rmax(v0) = 8 + 0 = 8 |
|
|
|
Case 1: r > n-e
|
Case 2: r ≤ n-e
|
|
|
N = 0; while ( not EOS ) { /* ----------------------- Delete phase ----------------------- */ if ( N mod ( 1/(2 ε) ) == 0 ) delete elements from summary; v = next value in stream; /* ----------------------- Insert phase ----------------------- */ insert v into summary; N++; } |
Important:
|
v = next value in input /* -------------------------------------------- Find insert position for v in S -------------------------------------------- */ Find a tuple (vi, gi, Δi) ∈ S such that: vi-1 ≤ v < vi if ( v < v0 || v > vs-1 ) Δ = 0; // New min or max value else Δ = gi + Δi - 1; INSERT "(v, 1, Δ)" into S between vi and vi+1; |
|
Proof:
|
|
Proof:
|
|
So as long as we maintain this property, the information in the summary will allow us to answer any φ-quantile query with ε accuracy
But it is also the most complex part of the algorithm
I will discuss deleting one tuple first...
|
Proof:
|
|
Proof:
|
|
|
|
*** ε is the margin error (a parameter of the algorithm) S = {}; // S contains the summary structure, which is: // <(v0, g0, Δ0), (v1, g1, Δ1) ... > // NOTE: S is an ordered list !!! N = 0; // Number of items processed while ( not EOS ) { /* --------------------------------------------- Delete phase: executed once every 1/(2×ε) insertions --------------------------------------------- */ if ( N % ⌊1/(2×ε)⌋ == 0 ) { /* -------------------------------------------------- Delete unnecessary entries in summary (while keeping the smallest and largest elements) -------------------------------------------------- */ for ( i = s-1; i ≥ 2; i = j - 1 ) { j = i-1; while ( j ≥ 1 && gj + ... + gi + Δi < 2εN ) { j--; } j++; // We went one index too far in the while... if ( j < i ) { replace entries j, .., i with the entry (vi, gj+ ... + gi, Δi); } } } /* ------------------------------------ Insert phase ------------------------------------ */ v = next value in input /* -------------------------------------------- Find insert position for v in S -------------------------------------------- */ Find a tuple (vi, gi, Δi) ∈ S such that: vi-1 ≤ v < vi if ( v is inserted at the head or tail of S ) Δ = 0; else Δ = gi + Δi - 1 // This is the allowable "wiggle room" INSERT "(v, 1, Δ)" into S between vi-1 and vi; N++; } |
|
|