|
|
Min window size = 3 S = b a a a b a a b a b b a a a a b a | | | | +-----------+ | frequency = 5/6 | | +-----------------+ frequency = 6/9 |
Find the window with the highest possible frequency
|
|
|
|
|
|
NOTES:
|
Example: computing frequency of an itemset a
S = a b a a a b | | +-----+ freq(a,last 2) = 1/2 (0.5) | | +---------+ freq(a,last 3) = 2/3 (0.66666) | | +-------------+ freq(a,last 4) = 3/4 (0.75) <--------- Max freq !! | | +----------------+ freq(a,last 5) = 3/5 (0.6) | | +---------------------+ freq(a,last 6) = 4/6 = 2/3 (0.6666) |
|
Example: maximum window
1 2 3 4 5 6 S = a b a a a b | | +-------------+ freq(a,S) = 3/4 | | +---------------------+ freq(s,S) = 4/6 = 2/3 startmax = 3 |
|
1. Buffer all itemset in memory (disk ?) 2. Find the frequency of itemset I in these windows:
|
|
Important Fact about the maximum frequency:
|
|
|
|
the algorithm is meaningless
|
|
|
|
|
|
and define:
|
Graphically:
S: I1 I2 .... Ip1-1 Ip1 ... Ip2-1 Ip2 ... Ip3-1 .... Ipr ... It | || | | | +-----------++------------+ +---------+ a1 a2 ar |
+----+----+------+----+ | p1 | p2 | ..... | pr | +----+----+------+----+ | a1 | a2 | ..... | ar | +----+----+------+----+ |
Example:
Suppose the border positions are: V V V S = b a a a b a a b a b b a a a a b a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Summary of the stream:
+----+----+----+ | 2 | 12 | 17 | +----+----+----+ | 6 | 4 | 1 | +----+----+----+ (There are 6 a's from pos 2 to 11) (There are 4 a's from pos 12 to 16) (There are 1 a's from pos 17 to ...) |
S: I1 I2 .... Ip1-1 Ip1 ... Ip2-1 Ip2 ... Ip3-1 .... Ipr-1 ... Ipr-1 Ipr ... It | || | | | | | +-----------++------------+ +-------------+ +---------+ a1 a2 ar-1 ar | | +---------+ ar freq(A, S[pr,t]) = ------------ t - pr + 1 | | +-------------------------+ ar-1 + ar freq(A, S[pr,t]) = ------------ t - pr-1 + 1 |
Example:
Suppose the border positions are: V V V S = b a a a b a a b a b b a a a a b a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Frequencies at the border locations:
+----+----+----+ | 2 | 12 | 17 | +----+----+----+ | 6 | 4 | 1 | +----+----+----+ t = 17 freq(a, S[17,17]) = 1/1 freq(a, S[12, 17]) = (4+1)/6 = 5/6 freq(a, S[2, 17]) = (6+4+1)/16 = 11/16 |
|
|
a1 a2 ar-1 ar --------- < --------- < .... < --------- < ------------- p2 - p1 p3 - p2 pr - pr-1 t - pr + 1 |
and the fact that:
|
We can conclude that when mwl = 1, that:
ar mfreqmwl=1(A, S) = ------------- t - pr + 1 |
startmax(A, S) = pr |
|
The algorithm will be generalized later
// A = the item for which the max. frequency is sought // Input consists of item sets, hence the test A ⊆ I St = ∅; while ( not end of stream ) do { I = next item set; // St = current summary // St+1 = summary after processing the next item set St+1 = ∅; // New summary information if ( St == ∅ ) { if ( A ⊆ I ) { St+1 = [(t+1, 1)]; // One border } } else { if ( A ⊆ I ) { /* ----------------------------------------- An addition of A will either: 1. extend the last border region 2. create a new border It will not remove existing border ----------------------------------------- */ if ( ar == t - pr + 1 ) { // Situation: .... A A A A A St+1 = St; Update: ar = ar + 1; // extend last border } else { // Situation: .... A A A A O St+1 = St ⊕ [(t+1), 1]; // create new border } } else { /* ----------------------------------------- An addition of not-A may cause a border to disappear... ----------------------------------------- */ St+1 = St; // Original set of border... // Remove borders that do not satisfy the property (click here) i = r; while ( i > 1 ) { ai ai-1 + ai if ( ------------- ≤ ------------- ) t - pi + 1 t - pi-1 + 1 { ai-1 = ai-1 + ai; // Merge the counts in 2 consecutive segments delete entry (pi, pi) from S |
a1 a2 ar-1 ar --------- < --------- < .... < --------- < ------------- p2 - p1 p3 - p2 pr - pr-1 t - pr + 1 |
from: to: ar ar + 1 ------------- --------------- t - pr + 1 t - pr + 1 + 1 |
And because:
ar ar + 1 ------------- ≤ --------------- t - pr + 1 t - pr + 1 + 1 |
We will maintain the property:
a1 a2 ar-1 ar + 1 --------- < --------- < .... < --------- < ---------------- p2 - p1 p3 - p2 pr - pr-1 t - pr + 1 + 1 |
So, no old border will be deleted
from: to: ar ar + 0 ------------- --------------- t - pr + 1 t - pr + 1 + 1 |
And because:
ar ar ------------- > --------------- t - pr + 1 t - pr + 1 + 1 |
We may violate the property:
a1 a2 ar-1 ar --------- < --------- < .... < --------- ?? ---------------- p2 - p1 p3 - p2 pr - pr-1 t - pr + 1 + 1 |
If the border property is indeed violated, the border position is removed:
Delete this border | V ........ ........ .......... | | | | +--------+ +----------+ ai-1 ai | | +---------------------+ ai-1 + ai That's the reason to set: ai-1 = ai-1 + ai |
|
S = b a a a b a a b a b b a a a a b a |
1 2 3 4 5 6 7 8 9 10 S = b a a a b a a b a b b a a a a b a | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | V | | | | | | | | | | | | | | | | [] | | | | | | | V V V | V | | +-++-++-+ | +-+-+ | | |2||2||2| | |2|6| | | +-++-++-+ | +-+-+ | | |1||2||3| | |3|1| | | +-++-++-+ | +-+-+ | | | | | V V | +-+ +-+-+ | |2| |2|6| | +-+ +-+-+ | |3| |3|2| | +-+ +-+-+ | | V 2 2+3 ----- ? ----- 8-6+1 8-2+1 2 5 --- ? --- 3 7 2 5 --- ≤ --- 3 7 Delete border ! New summary: +-+ |2| +-+ |5| +-+ |
![]() |