|
|
(where SqError(a,b) = the minimum squared error sum of input values for xa... xb - which is also the mininum error for a single bucket histogram on these values).
Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 OPT[k][i]: i: 1 2 3 4 5 6 7 8 9 10 11 12 13 k 1 0 0.5 2.0 5.0 10.0 17.5 28.0 42.0 60.0 82.5 110.0 143.0 182.0 ... 2 0 0.0 0.5 1.0 2.5 4.0 7.0 10.0 15.0 20.0 27.5 35.0 45.5 ... 3 0 0 0.0 0.5 1.0 1.5 3.0 4.5 6.0 9.0 12.0 15.0 20.0 ... |
|
Example:
Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 OPT[k][i]: i: 1 2 3 4 5 6 7 8 9 10 11 12 13 k 1 0 0.5 2.0 5.0 10.0 17.5 28.0 42.0 60.0 82.5 110.0 143.0 182.0 ... 2 0 0.0 0.5 1.0 2.5 4.0 7.0 10.0 15.0 20.0 27.5 35.0 45.5 ... 2' 0 0 0 0 2.5 2.5 7.0 7.0 7.0 20.0 20.0 20.0 20.0 3 0 0 0.0 0.5 1.0 1.5 3.0 4.5 6.0 9.0 12.0 15.0 20.0 ... |
|
|
|
|
|
|
Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 OPT[k][i]: i: 1 2 3 4 5 6 7 8 9 10 11 12 13 k 1 0 0.5 2.0 5.0 10.0 17.5 28.0 42.0 60.0 82.5 110.0 143.0 182.0 ... 2 0 0.0 0.5 1.0 2.5 4.0 7.0 10.0 15.0 20.0 27.5 35.0 45.5 ... 2' 0 0 0 0 2.5 2.5 7.0 7.0 7.0 20.0 20.0 20.0 20.0 Err: 0 0 0.5 1.0 0 1.5 0 3.0 8.0 0 7.5 15.0 25.0 3 0 0 0.0 0.5 1.0 1.5 3.0 4.5 6.0 9.0 12.0 15.0 20.0 ... |
The error can be reduced by using 5 buckets to approximate OPT[k][2]: :
Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 OPT[k][i]: i: 1 2 3 4 5 6 7 8 9 10 11 12 13 k 1 0 0.5 2.0 5.0 10.0 17.5 28.0 42.0 60.0 82.5 110.0 143.0 182.0 ... 2 0 0.0 0.5 1.0 2.5 4.0 7.0 10.0 15.0 20.0 27.5 35.0 45.5 ... 2' 0 0 0 0 2.5 2.5 7.0 7.0 7.0 20.0 20.0 35.0 35.0 Err: 0 0 0.5 1.0 0 1.5 0 3.0 8.0 0 7.5 0.0 10.0 3 0 0 0.0 0.5 1.0 1.5 3.0 4.5 6.0 9.0 12.0 15.0 20.0 ... |
Notations: OPT[k][i] and SOL[i][k]:
Input data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19
OPT[k][i]:
i: 1 2 3 4 5 6 7 8 9 10 11 12 13
k
1 0 0.5 2.0 5.0 10.0 17.5 28.0 42.0 60.0 82.5 110.0 143.0 182.0 ...
--> OPT[2][i] 0 0.0 0.5 1.0 2.5 4.0 7.0 10.0 15.0 20.0 27.5 35.0 45.5 ...
--> Sol[2][i] 0 0 0 0 2.5 2.5 7.0 7.0 7.0 20.0 20.0 35.0 35.0
Err: 0 0 0.5 1.0 0 1.5 0 3.0 8.0 0 7.5 0.0 10.0
3 0 0 0.0 0.5 1.0 1.5 3.0 4.5 6.0 9.0 12.0 15.0 20.0 ...
|
How to ensure accuracy:
|
Histogram with p backets used to approximate OPT[p][..]:
x OPT[p][bjp]
|
x OPT[p][ajp] |
| |
| |
| |
----+-----------------------+-------
ajp <-----------------> bjp
|
In other words:
|
OPT[p][bjp] - OPT[p][ajp]
----------------------------- < δ
OPT[p][ajp]
<==> OPT[p][bjp] - OPT[p][ajp] < δ × OPT[p][ajp]
<==> OPT[p][bjp] < (1 + δ) OPT[p][ajp]
|
Requirement:
|
Input data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19
OPT[k][i]:
i: 1 2 3 4 5 6 7 8 9 10 11 12 13
k
1 0 0.5 2.0 5.0 10.0 17.5 28.0 42.0 60.0 82.5 110.0 143.0 182.0 ...
--> OPT[2][i] 0 0.0 0.5 1.0 2.5 4.0 7.0 10.0 15.0 20.0 27.5 35.0 45.5 ...
--> Sol[2][i] 0 0.0 0.5 0.5 2.5 2.5 7.0 7.0 15.0 15.0 27.5 27.5 27.5 ...
+------+ +-------+ +---------+ +---------++----------+ +-----------------+
|
|
Problem: construct an approximate V-optimal histogram with B = 3 buckets
Histogram with 1 bucket:
Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 |
-------+------+------+------+------+------+------+------+---
Min Error: 0.0 | 2.0 | 2.0 | 8.75 | 10.0 | 13.3 | 63.7 | 161.5|
|
Input: 4 2 3 6 5 6 12 16 |
[ 4 2 ] [ 3 ] ===> MinError[1][2] + 0
[ 4 ] [ 2 3 ] ===> MinError[1][1] + (2 - 2.5)2 + (3 - 2.5)2
| |
+-------+
1 bucket optimal
histogram
Using the result from the 1 bucket optimal histogram:
[ 4 2 ] [ 3 ] ===> 2.0 + 0 = 2.0
[ 4 ] [ 2 3 ] ===> 0.0 + 0.5 = 0.5 <---- Min
|
Result: larger than 2*0 --> start a new bucket with value = 0.5
Input: 4 2 3 6 5 6 12 16 |
[ 4 2 3 ] [ 6 ] ===> MinError[1][3] + 0
[ 4 2 ] [ 3 6 ] ===> MinError[1][2] + (3 - 4.5)2 + (6 - 4.5)2
[ 4 ] [ 2 3 6 ] ===> MinError[1][1] + (2 - 3.66)2 + (3 - 3.66)2 + (6 - 3.66)2
| |
+---------+
1 bucket optimal
histogram
Using the result from the 1 bucket optimal histogram:
[ 4 2 3 ] [ 6 ] ===> 2.0 + 0 = 2.0 <--- Min
[ 4 2 ] [ 3 6 ] ===> 2.0 + 4.5 = 6.5
[ 4 ] [ 2 3 6 ] ===> 0.0 + 8.666 = 8.666
|
Result: greater than 2*0.5 --> start a new bucket with value = 2.0
Input: 4 2 3 6 5 6 12 16 |
[ 4 2 3 6 ] [ 5 ] ===> MinError[1][4] + 0
[ 4 2 3 ] [ 6 5 ] ===> MinError[1][3] + (6 - 5.5)2 + (5 - 5.5)2
[ 4 2 ] [ 3 6 5 ] ===> MinError[1][2] + (3 - 4.66)2 + (6 - 4.66)2 + (5 - 4.66)2
[ 4 ] [ 2 3 6 5 ] ===> MinError[1][1] + (2 - 4)2 + (3 - 4)2 + (6 - 4)2 + (5 - 4)2
| |
+-----------+
1 bucket optimal
histogram
Using the result from the 1 bucket optimal histogram:
[ 4 2 3 6 ] [ 5 ] ===> 8.75 + 0 = 8.75
[ 4 2 3 ] [ 6 5 ] ===> 2.0 + 0.5 = 2.5 <--- Min
[ 4 2 ] [ 3 6 5 ] ===> 2.0 + 4.666 = 6.666
[ 4 ] [ 2 3 6 5 ] ===> 0.0 + 10 = 10
|
Result: 2.5 < 2* 2.0 Use 2.0 to approximate
Input: 4 2 3 6 5 6 12 16 |
|
Input: 4 2 3 6 5 6 12 16 |
Important Different between Guha's Algorithm and Jagadish's V-opt. Histogram:
|
Input: 4 2 3 6 5 6 12 16 |
{ 4 2 3 } [ 6 ] ===> MinError[2][3] + 0
{ 4 2 } [ 3 6 ] ===> MinError[2][2] + (3 - 4.5)2 + (6 - 4.5)2
{ 4 } [ 2 3 6 ] ===> MinError[2][1] + (2 - 3.66)2 + (3 - 3.66)2 + (6 - 3.66)2
| |
+---------+
2 bucket optimal
histogram
Using the result from the 2 bucket optimal histogram:
{ 4 2 3 } [ 6 ] ===> 0.5 + 0 = 0.5 <---- Min
{ 4 2 } [ 3 6 ] ===> 0.0 + 4.5 = 4.5
{ 4 } [ 2 3 6 ] ===> 0.0 + 8.666 = 8.666
|
Result: 0.5 > (1 + δ) * 0.0 ==> Start a new bucket !
Input: 4 2 3 6 5 6 12 16 |
{ 4 2 3 6 } [ 5 ] ===> MinError[2][4] + 0
{ 4 2 3 } [ 6 5 ] ===> MinError[2][3] + (6 - 5.5)2 + (5 - 5.5)2
{ 4 2 } [ 3 6 5 ] ===> MinError[2][2] + (3 - 4.66)2 + (6 - 4.66)2 + (5 - 4.66)2
{ 4 } [ 2 3 6 5 ] ===> MinError[2][1] + (2 - 4)2 + (3 - 4)2 + (6 - 4)2 + (5 - 4)2
| |
+-----------+
2 bucket optimal
histogram
Using the result from the 1 bucket optimal histogram:
{ 4 2 3 6 } [ 5 ] ===> 2.0 + 0
{ 4 2 3 } [ 6 5 ] ===> 0.5 + 0.5 = 1.0 <--- Min
{ 4 2 } [ 3 6 5 ] ===> 0.0 + 4.666 = 4.666
{ 4 } [ 2 3 6 5 ] ===> 0.0 + 10.0 = 10.0
|
Result: 1.0 <= (1 + δ) * 0.5 - extend the current bucket
Input: 4 2 3 6 5 6 12 16 |
|
+-----------------------+
| a (left boundary) |
+-----------------------+
| b (right boundary) |
+-----------------------+
| Sol (histogram value |
+-----------------------+
|
Approximate V-optimal Histogram with 2 bucket:
Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 |
-------+------+------+------+------+------+------+------+---
Appr Error: 0.0 | 0.0 | 0.5 | 2.0 | 2.0 | 2.0 | 13.3 | 13.3 |
|
is represented as follows:
|
/* ------------------------------------------------
Help function to compute Error in a bucket
------------------------------------------------ */
SqError(int a, int b)
{
s2 = PP[b] - PP[a];
s1 = P[b] - P[a];
return (s2 - s1*s1/(b-a+1));
}
/* ----------------------------------------------
Prepare arrays to compute error efficiently
---------------------------------------------- */
P[0] = 0;
PP[0] = 0;
for (i = 1; i <= N; i++)
{
P[i] = P[i-1] + xi
PP[i] = PP[i-1] + xi2
}
// Note: We don't approximate OPT[1][..]
// OPT[1][..] is computed exactly using SqError(a,b)
/* ------------------------------------------------------------------
Now we compute the approximate V-opt. histogram with B buckets
Output:
BestError[i][k] = best error of histogram
using k buckets
on data points (1..i)
------------------------------------------------------------------ */
// The dynamic algorithm uses these variables:
//
// k = # buckets
// i = current item - items processed are: (1..i)
// Initialization
/* -----------------------------------------------------
Set up (B-1) linked list Q[k] (k = 2..B) to store
the approximate solution histograms
Each Q[k] is a linked list of records of the form:
Q[k].a = left point of bucket
Q[k].b = right point of bucket
Q[k].Sol = approximate solution at point Q[k].a
----------------------------------------------------- */
for (k = 2; k <= B; k++)
{
Q[k] -> (a = 1, b = 1, Sol = 0); // List element
}
for (k = 1; k <= B; k++)
{
// Find optimal histogram using k buckets
currApprox = Q[k] -> Sol // initially equal to 0
for (i = 1; i <= N; i++)
{
minError = INFINITE; // Start value
// Try every possible size for the last bucket
for (j = 1; j <= i-1; j++) // Last bucket is [j..i]
{
if ( Lookup(Q[k-1], j) + SqError(j+1,i) < minError )
{
minError = Lookup(Q[k-1], j) + SqError(j+1,i);
// Better division found
}
}
if ( minError > (1 + δ) * currApprox )
{
Add new bucket (a = i, b = i, Sol = minError)
to Q[k]
}
}
}
|
|
|
|
|
|
|
|
|
|
|
|