The bag difference operator (−bag)
Comment:
no search (hash) structure used
because we
do not have enought
buffers !!!
The 2-pass TPMMS based bag difference
(−bag) algorithm
- Pass 1
(Graphically explained in next slide)
The 2-pass TPMMS based bag difference
(−bag) algorithm
- Pass 1
Use M buffers and
sort the
relation R and S into
chunks of
(sorted) M blocks:
IO cost for
pass 1 =
2 B(R) + 2 B(S)
The 2-pass TPMMS based bag difference
(−bag) algorithm
- Pass 2
Pass 2:
(Graphically explained in next slide)
The 2-pass TPMMS based bag difference
(−bag) algorithm
- Pass 2
Pass 2:
You only need
a program variable to
store
1 tuple for each
relation
(i.e.: do not require an entire buffer)
The 2-pass TPMMS based bag difference
(−bag) algorithm
- Example
- Pass 1
Pass 1:
read chunks of
M blocks from
relations R and S and
sort:
Write the
sorted chunks
(each chunk is M blocks) to
disk
The 2-pass TPMMS based bag difference
(−bag) algorithm
- Example
- Pass 2
(R and
S are
bags !)
Pass 2:
read 1 block from
every (sorted) chunk and
find the
smallest tuple in
each relation:
If R(smallest tuple value) <
S(smallest tuple value) then
output R's tuple
and
advance R
(You can get multiple value if
the value occur multiple times in
R)
If R(smallest tuple value) =
S(smallest tuple value) then
discard R's tuple (subtract !) and
advance R and S
If R(smallest tuple value) >
S(smallest tuple value) then
(discard S's tuple) and
advance S
Cost analysis of
the 2-pass TPMMS based bag difference
(−bag) algorithm
Buffer requirement of
the 2-pass TPMMS based bag difference
(−bag) algorithm
The relation R
and S together
must have
at most
M chunks, because
Pass 2
can use
at most
M buffers:
(We need to use 1 buffer to read 1 sorted chunk)
Buffer requirement of
the 2-pass TPMMS based bag difference
(−bag) algorithm
❮
❯