The bag difference (−_B) operator

Note: we need a search structure that contains (key, count (# occurences)) !

The bag difference (−_B) operator - assumption

The bag difference (−_B) operator - notable fact

Consequently: there are 2 algorithms for −_B !!

The one-pass bag difference S −_B R algorithm

S (smaller) −_B R:

We must index the subtractor set

The one-pass bag difference S −_B R algorithm

The one-pass bag difference S −_B R algorithm - Example

Phase 1: build search structure on S containing (key, count)

The one-pass bag difference S −_B R algorithm - Example

Phase 1: build search structure on S containing (key, count)

The one-pass bag difference S −_B R algorithm - Example

Phase 2: scan R and update count in search structure

The one-pass bag difference S −_B R algorithm - Example

Phase 2: scan R and update count in search structure - completed

The one-pass bag difference S −_B R algorithm - Example

Phase 2: after completion , we can output the bag difference

The one-pass bag difference R −_B S algorithm

R −_B S (smaller):

We must index the subtracted set

The one-pass bag difference R −_B S algorithm

The one-pass bag difference R −_B S algorithm - Example

Phase 1: build search structure on S containing (key, count)

The one-pass bag difference R −_B S algorithm - Example

Phase 1: build search structure on S containing (key, count)

The one-pass bag difference S −_B R algorithm - Example

Phase 2: scan R and update count in search structure - if count = 0, output tuple

The one-pass bag difference S −_B R algorithm - Example

Phase 2: scan R and update count in search structure - if count > 0, decrement count

IO cost and buffer requirement for −_B

❮ ❯

initialize a search structure H on all attributes ; /* ============================================================ Phase 1: Use 1 buffer and scan the SMALLER relation first. Build a search structure on the SMALLER relation The search structure contains a count for the search key ============================================================ */ while ( S has more data blocks ) { read 1 data block in buffer b; for ( each tuple t ∈ b ) { /* ===================================================== We need a search structure H to implement the test t ∈ H efficiently !!! We can use hash table or some bin. search tree ====================================================== */ if ( t ∉ H ) { insert t in H; initialize count(t) = 1 for H; } else { update count(t) = count(t) + 1 for H; } } } /* =================================================== Now we know how many of each element is in S =================================================== */ /* ============================================================ Phase 2: output tuples in S Use 1 buffer and scan the other relation. Use the search structure to remove common elements, but at most count times !! ============================================================ */ while ( R has more data blocks ) { read 1 data block in buffer b; for ( each tuple t ∈ b ) { /* ===================================================== We use search structure H to implement the test t ∈ H efficiently !!! We can use hash table or some bin. search tree ====================================================== */ if ( t ∈ H ) { if ( count(t) > 0 ) { Update count(t)-- in H; // We lost 1 copy of t } } else { // Ignore t, it's not in difference ! } } } /* =================================================== ONLY now we can output the difference =================================================== */ for ( every t ∈ H ) { output count(t) number of tuples t ; }

initialize a search structure H on all attributes in tuples of S; /* ============================================================ Phase 1: (same) Use 1 buffer and scan the SMALLER relation first. Build a search structure on the SMALLER relation The search structure contains a count for the search key ============================================================ */ while ( S has more data blocks ) { read 1 data block in buffer b; for ( each tuple t ∈ b ) { /* ===================================================== We need a search structure H to implement the test t ∈ H efficiently !!! We can use hash table or some bin. search tree ====================================================== */ if ( t ∉ H ) { insert t in H; initialize count(t) = 1 for H; } else { update count(t) = count(t) + 1 for H; } } } /* =================================================== Now we know how many of each element is in S =================================================== */ /* ============================================================ Phase 2: output tuples in R Use 1 buffer and scan the other relation. Use the search structure to "throttle output" of common elements, but for at most count times !! ============================================================ */ while ( R has more data blocks ) { read 1 data block in buffer b; for ( each tuple t ∈ b ) { /* ===================================================== We use search structure H to implement the test t ∈ H efficiently !!! We can use hash table or some bin. search tree ====================================================== */ if ( t ∈ H ) { /* ============================================== Check if there is (still) a copy in S ============================================== */ if ( count(t) > 0 ) { Update count(t)-- in H; // We lost 1 copy of t } else { Output t; // Tuple did not get subtracted ! } else { Output t; // Tuple did not get subtracted ! } } }

One-pass Algorithm for "−B"

One-pass Algorithm for "−_B"