Using most-frequent values histograms to estimate result set size of join operations

Slideshow:

Estimating join result size using a most-frequent values histogram - Problem description

Given: R(a,b) and S(b,c)

And the following most-frequent value histogram:

Uniformity assumption on "other values"

Given: R(a,b) and S(b,c)

The other values in a most-frequent value histogram:

V(R,b) = 14: values of attribute "b" are: 0, 1, 5 + 11 other values

V(S,b) = 13: values of attribute "b" are: 0, 1, 2 + 10 other values

Uniformity assumption on "other values"

Given: R(a,b) and S(b,c)

The other values in a most-frequent value histogram:

T(R) = 1000: # tuples with: "b=0" = 150, "b=1" = 200, "b=5" = 100 + "b=11 other values" = 550

T(S) = 500: # tuples with: "b=0" = 100, "b=1" = 80, "b=5" = 70 + "b=10 other values" = 250

Uniformity assumption on "other values"

Given: R(a,b) and S(b,c)

The other values in a most-frequent value histogram:

"b=11 other values" = 550: 550/11 = 50 tuples per each "other" attribute value

"b=10 other values" = 250: 250/10 = 25 tuples per each "other" attribute value

Estimating the join result set size - Example

Statistics on the input relations:

Frequencies of attribute values: (summary from statistics)

Estimating the join result set size - Example

Statistics on the input relations:

Applying the containment of value set assumption, we have:

Estimating the join result set size - Example

Summary of the statistics:

The join results for the values b = 0, 1, 2, 5 are:

Estimating the join result set size - Example

Summary of the statistics:

The join results for the "other" values:

Estimating the join result set size - Example

Summary of the statistics:

The join results for the "other" values:

Estimating the join result set size - Example

Summary of the statistics:

Estimate for the # tuples in the join result using (most-frequent values) histogram:

Comparing with the "base line" estimation result

Estimate for the # tuples in the join result using (most-frequent values) histogram:

Baseline estimation:

Using most-frequent values histogram to estimate join result

Statistical data on the input relations:

Relation level statistics:

R = R(a,b) S = S(b,c) Statistics on the relations: R S ======================================= T(R) = 1000 T(S) = 500 V(R,b) = 14 V(S,b) = 13

Histogram information on the join attribute values:

R.b: S.b b freq(b) b freq(b) ------------- ------------- 0 150 0 100 1 200 1 80 5 100 2 70 others 550 others 250 ^ ^ | | 14-3=11 values 13-3=10 values

Uniformity assumption:

R.b: 550 tuples has 11 different values <==> 50 tuples per attribute value
S.b: 250 tuples has 10 different values <==> 25 tuples per attribute value

Problem:

Worked out example:

Input:

V(R,b) = 14 T(R) = 1000 V(S,b0 = 13 T(S) = 500 R.b = 0(150) 1(200) 5(100) + 11 other values(50 each) S.b = 0(100) 1(80) 2(70) + 10 other values(25 each)

Applying the containment of value sets assumption, we have:

V(R,b) = 14 T(R) = 1000 V(S,b0 = 13 T(S) = 500 R.b = 0(150) 1(200) 2(50) 5(100) + 10 other values(50 each) S.b = 0(100) 1(80) 2(70) 5(25) + 9 other values(25 each)

Therefore the join result of the value b = 0, 1, 2, 5 are:

V(R,b) = 14 T(R) = 1000 V(S,b0 = 13 T(S) = 500 R.b = 0(150) 1(200) 2(50) 5(100) + 10 other values(50 each) S.b = 0(100) 1(80) 2(70) 5(25) + 9 other values(25 each) | | | | V V V V 15,000 16,000 3,500 2,500 tuples (=150×100)

Computing the join result for the "other" values:

We continue to make the containment of value set assumption Because: V(S,b) ≤ V(R,b) ===> all values of S.b are assumed to be in R.b
Therefore: R.b = { 0, 1, 2, 5, + 10 other values (50 tuples each) } S.b = { 0, 1, 2, 5, + 9 other values (25 tuples each) } We have the following situation: 50 tuples with R.b = v_i | v R.b = { 0, 1, 2, 5, v₁, v₂, v₃, ...., v₉, v₁₀ } S.b = { 0, 1, 2, 5, v₁, v₂, v₃, ...., v₉ } ^ | 25 tuples with S.b = v_i Therefore: R(a,v₁) ⋈ S(v₁,c) results in ==> 50 × 25 tuples R(a,v₂) ⋈ S(v₂,c) results in ==> 50 × 25 tuples ... ... R(a,v₉) ⋈ S(v₉,c) results in ==> 50 × 25 tuples ==> Total = ~ 9 × 50 × 25 tuples

Therefore:

T(R⋈S) ~= 15,000 + 16,000 + 2,500 + 3,500 + 9×50×25 ~= 48,250 tuples

Compare: estimate without using histogram information

Statistics on the relations: R S ======================================= T(R) = 1000 T(S) = 500 V(R,b) = 14 V(S.b) = 13
T(R) × T(S) T(R⋈S) ~= ------------------- max(V(R,b), V(S,b)) 1000 × 500 ~= ------------- max(14, 13) = 500,000/14 ~= 35,714 48,250/35,714 = 1.35 (35% difference)