|
|
Min window size = 3
S = b a a a b a a b a b b a a a a b a
| | |
| +-----------+
| frequency = 5/6
| |
+-----------------+
frequency = 6/9
|
Find the window with the highest possible frequency
|
|
|
|
|
|
NOTES:
|
Example: computing frequency of an itemset a
S = a b a a a b
| |
+-----+
freq(a,last 2) = 1/2 (0.5)
| |
+---------+
freq(a,last 3) = 2/3 (0.66666)
| |
+-------------+
freq(a,last 4) = 3/4 (0.75) <--------- Max freq !!
| |
+----------------+
freq(a,last 5) = 3/5 (0.6)
| |
+---------------------+
freq(a,last 6) = 4/6 = 2/3 (0.6666)
|
|
Example: maximum window
1 2 3 4 5 6
S = a b a a a b
| |
+-------------+
freq(a,S) = 3/4
| |
+---------------------+
freq(s,S) = 4/6 = 2/3
startmax = 3
|
|
1. Buffer all itemset in memory (disk ?) 2. Find the frequency of itemset I in these windows:
|
|
Important Fact about the maximum frequency:
|
|
|
|
the algorithm is meaningless
|
|
|
|
|
|
and define:
|
Graphically:
S: I1 I2 .... Ip1-1 Ip1 ... Ip2-1 Ip2 ... Ip3-1 .... Ipr ... It
| || | | |
+-----------++------------+ +---------+
a1 a2 ar
|
+----+----+------+----+
| p1 | p2 | ..... | pr |
+----+----+------+----+
| a1 | a2 | ..... | ar |
+----+----+------+----+
|
Example:
Suppose the border positions are:
V V V
S = b a a a b a a b a b b a a a a b a
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
Summary of the stream:
+----+----+----+
| 2 | 12 | 17 |
+----+----+----+
| 6 | 4 | 1 |
+----+----+----+
(There are 6 a's from pos 2 to 11)
(There are 4 a's from pos 12 to 16)
(There are 1 a's from pos 17 to ...)
|
S: I1 I2 .... Ip1-1 Ip1 ... Ip2-1 Ip2 ... Ip3-1 .... Ipr-1 ... Ipr-1 Ipr ... It
| || | | | | |
+-----------++------------+ +-------------+ +---------+
a1 a2 ar-1 ar
| |
+---------+
ar
freq(A, S[pr,t]) = ------------
t - pr + 1
| |
+-------------------------+
ar-1 + ar
freq(A, S[pr,t]) = ------------
t - pr-1 + 1
|
Example:
Suppose the border positions are:
V V V
S = b a a a b a a b a b b a a a a b a
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
Frequencies at the border locations:
+----+----+----+
| 2 | 12 | 17 |
+----+----+----+
| 6 | 4 | 1 |
+----+----+----+
t = 17
freq(a, S[17,17]) = 1/1
freq(a, S[12, 17]) = (4+1)/6 = 5/6
freq(a, S[2, 17]) = (6+4+1)/16 = 11/16
|
|
|
a1 a2 ar-1 ar
--------- < --------- < .... < --------- < -------------
p2 - p1 p3 - p2 pr - pr-1 t - pr + 1
|
and the fact that:
|
We can conclude that when mwl = 1, that:
ar
mfreqmwl=1(A, S) = -------------
t - pr + 1
|
startmax(A, S) = pr
|
|
The algorithm will be generalized later
// A = the item for which the max. frequency is sought
// Input consists of item sets, hence the test A ⊆ I
St = ∅;
while ( not end of stream ) do
{
I = next item set;
// St = current summary
// St+1 = summary after processing the next item set
St+1 = ∅; // New summary information
if ( St == ∅ )
{
if ( A ⊆ I )
{
St+1 = [(t+1, 1)]; // One border
}
}
else
{
if ( A ⊆ I )
{
/* -----------------------------------------
An addition of A will either:
1. extend the last border region
2. create a new border
It will not remove existing border
----------------------------------------- */
if ( ar == t - pr + 1 )
{
// Situation: .... A A A A A
St+1 = St;
Update: ar = ar + 1; // extend last border
}
else
{
// Situation: .... A A A A O
St+1 = St ⊕ [(t+1), 1]; // create new border
}
}
else
{
/* -----------------------------------------
An addition of not-A may cause a border
to disappear...
----------------------------------------- */
St+1 = St; // Original set of border...
// Remove borders that do not satisfy the property (click here)
i = r;
while ( i > 1 )
{
ai ai-1 + ai
if ( ------------- ≤ ------------- )
t - pi + 1 t - pi-1 + 1
{
ai-1 = ai-1 + ai; // Merge the counts in 2 consecutive segments
delete entry (pi, pi) from S
|
a1 a2 ar-1 ar
--------- < --------- < .... < --------- < -------------
p2 - p1 p3 - p2 pr - pr-1 t - pr + 1
|
from: to:
ar ar + 1
------------- ---------------
t - pr + 1 t - pr + 1 + 1
|
And because:
ar ar + 1
------------- ≤ ---------------
t - pr + 1 t - pr + 1 + 1
|
We will maintain the property:
a1 a2 ar-1 ar + 1
--------- < --------- < .... < --------- < ----------------
p2 - p1 p3 - p2 pr - pr-1 t - pr + 1 + 1
|
So, no old border will be deleted
from: to:
ar ar + 0
------------- ---------------
t - pr + 1 t - pr + 1 + 1
|
And because:
ar ar
------------- > ---------------
t - pr + 1 t - pr + 1 + 1
|
We may violate the property:
a1 a2 ar-1 ar
--------- < --------- < .... < --------- ?? ----------------
p2 - p1 p3 - p2 pr - pr-1 t - pr + 1 + 1
|
If the border property is indeed violated, the border position is removed:
Delete this border
|
V
........ ........ ..........
| | | |
+--------+ +----------+
ai-1 ai
| |
+---------------------+
ai-1 + ai
That's the reason to set:
ai-1 = ai-1 + ai
|
|
S = b a a a b a a b a b b a a a a b a |
1 2 3 4 5 6 7 8 9 10
S = b a a a b a a b a b b a a a a b a
| | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | |
V | | | | | | | | | | | | | | | |
[] | | | | | | |
V V V | V | |
+-++-++-+ | +-+-+ | |
|2||2||2| | |2|6| | |
+-++-++-+ | +-+-+ | |
|1||2||3| | |3|1| | |
+-++-++-+ | +-+-+ | |
| | |
V V |
+-+ +-+-+ |
|2| |2|6| |
+-+ +-+-+ |
|3| |3|2| |
+-+ +-+-+ |
|
V
2 2+3
----- ? -----
8-6+1 8-2+1
2 5
--- ? ---
3 7
2 5
--- ≤ ---
3 7
Delete border !
New summary:
+-+
|2|
+-+
|5|
+-+
|
|