In other words, after you construct the histogram, when someone asks you:
Your answer must be based on the histogram and therefore, you will answer:
It is common to use |hr - xs+1| (absolute value) or (hr - xs+1)2 as expression for the error.
The expression (hr - xs+1)2 is prefered because it is differentiable .
SqError(br) =
∑k=sr..er
(hr - xk)2
The following figure shows what this error represents:
|
What value should hr
be so that
SqError(br)
is minimized
SqError(histogram) =
∑r=1..B
(
∑k=sr..er
(hr - xk)2
)
What are the ranges of the B buckets
so that
SqError(histogram)
is minimized
So, let us now assume that the optimum bucket division has been found already (so the bucket ranges are fixed)...
SqError(histogram) =
∑r=1..B
(
∑k=sr..er
(hr - xk)2
)
is in fact the following sum:
SqError(histogram) =
∑k=s1..e1
(h1 - xk)2
+
∑k=s2..e2
(h2 - xk)2
+ ...
+
∑k=sB..eB
(hB - xk)2
SqError(histogram) =
∑k=s1..e1
(h1 - xk)2
+
∑k=s2..e2
(h2 - xk)2
+ ...
+
∑k=sB..eB
(hB - xk)2
is minimized if individual sums are minimized.
(This sounds subtle, but if just think how much more complex
the minimalization process would be if the boundary
was NOT fixed.
When the boundaries are not fixed, some values of xk
are subtracted from different hr's and
we could not make the above statement....)
It may be clearer if you write the sum out:
Since Sr depends only on
hr, we find that the derivate
is (if it helps, imagine hr to be "x"):
or:
thus:
hr =
(xsr +
xsr+1 +
xsr+2 + ... +
xer)/p
Sr =
∑k=sr..er
(hr - xk)2
Sr =
(hr - xsr)2
+
(hr - xsr+1)2
+
(hr - xsr+2)2
+ .... +
(hr - xer)2
Sr =
(hr - c1)2
+
(hr - c2)2
+
(hr - c3)2
+ .... +
(hr - c?)2
S'r =
2 (hr - c1)
+
2 (hr - c2)
+
2 (hr - c3)
+ .... +
2 (hr - cp)
2 (hr - c1)
+
2 (hr - c2)
+
2 (hr - c3)
+ .... +
2 (hr - cp)
= 0
p × hr =
c1 + c2 + c3
+ .... + cp
hr =
(c1 + c2 + c3
+ .... + cp)/p
the error made in bucket br is minimized.
or:
or:
or:
or:
or:
or:
or:
or:
Now we replace back the clumpsy notation:
|
|
(All we know so far is when we are given a certain bucket division of the histogram, we can find the optimal value hr for each bucket).
During the process of finding the B>optimal partitioning of the histogram boundaries, we will need to compute the error value Sr for each histogram, and so we must find a way to compute it efficiently.
|
|
|
That histogram must use as the last bucket:
We don't know the range of the "best" last bucket.
we will surely have the bases covered.
So when we try every possible range for the last bucket, the remaining ranges must be covered by an optimal histogram of k-1 buckets.
|
Optimal Histogram for [a,b] using k buckets = minx=a..b{ Optimal Histogram of [a..x-1] using k-1 buckets + last bucket is [x..b]} |
After processing 1 item (1):
|
After processing 2 items (2):
|
After processing 3 item (3):
|
After processing 4 item (4):
|
Procedure V-Opt
BestError[i][k] = best error of histogram using k buckets on (1..i)
// Squared Error Function:
SqError(int a, int b)
{
s2 = SqSum[b] - SqSum[a];
s1 = Sum[b] - Sum[a];
return (s2 - s1*s1/(b-a+1));
}
// Prepare arrays to compute error efficiently
Sum[0] = 0;
SqSum[0] = 0;
for (i = 1; i <= N; i++)
{
Sum[i] = Sum[i-1] + xi
SqSum[i] = SqSum[i-1] + xi2
}
// The dynamic algorithm to find the best histogram
//
// k = # buckets
// i = current item - items processed are: (1..i)
for (k = 1; k <= B; k++)
{
// Find optimal histograms for [1..k]
for (i = 1; i <= N; i++)
{
if ( k == 1 )
BestErr[i][k] = SqError(1,i); // Single bucket (easy)
else
{
// Multiple buckets
BestError[i][k] = INFINITE; // Start value
// Try every possible last bucket
for (j = 1; j <= i-1; j++) // Last bucket is [j..i]
{
if ( BestError[j][k-1] + SqError(j+1,i) < BestError[i][k] )
{
BestError[i][k] = BestError[j][k-1] + SqError(j+1,i);
}
}
}
}
}
|