CS554 Syllabus & Progress

CS554 - Advanced Database Systems
Syllabus and Progress

Introduction

Course overview click here

Review of undergraduate database material (my CS377 class notes is here: click here )

Review of Relational Algebra:
- Relations: click here ------
- The Relational (Data) Model click here ------
- Overview of Relational Algebra
  - Intro: click here ------
  - The high school set operators (∪, ∩, −): click here ------
  - The operators σ, π, × and ⋈: click here ------
  - Examples of Relational Algebra: click here ------
- Set functions:
  - The Aggregate (= set) Functions: click here ------
  - Forming sub groups and applying set functions on sub groups: click here ------
  - Using groups and set functions in relational algebra queries: click here ------
- Bags, sets and the δ operator: click here ------
  
  Homework 1: click here
Review of SQL:
- Accessing the SQL database software:
  - Accessing the MySQL server: click here
  - The model of the Company database used in these notes: click here
- Intro to SQL: click here ------
- Intro to the SELECT command
  - The basic select command syntax: click here ------
  - Some SELECT command examples: click here ------
  - DISTINCT and *: click here ------
- Qualifying attributes and aliasing: click here ------
- The tuple conditions that can be used in the where clause: click here ------
- Nested queries:
  - Sub-query: click here ------
  - Nested-query: click here ------
  - Examples of simple (non-correlated) nested queries: click here ------
- Correlated nested queries:
  - Intro: click here ------
  - Examples of Correlated nested queries: click here ------
  - Scoping rules in correlated nested queries: click here ------
- Useful trick to get other attributes using a set of foreign keys in a nested query: click here ------
- Set functions:
  - Using set function in the SELECT clause: click here ------
  - Using set function in the WHERE clause: click here ------
  - Example queries using set functions: click here ------
- Forming groups based on grouping attributes and conditions on a group:
  - Forming groups that have common attribute values:
    - The group by clause: click here ------
    - Examples of group by queries: click here ------
  - Conditions on a group:
    - The having clause: click here ------
    - Processing of an SQL query with all clauses: click here ------
    - Examples of having queries: click here ------
- Virtual table (relation) or view: click here ------
  
  Homework 2: click here

Accessing data stored on disk

Secondary Storage (Disks) management

Storing data files on disks: ------
- Architecture of a hard disk: click here
- Storing data (files) on a hard disk: click here
- Access time of data stored on disks: click here
Techniques to speed up disk operations: Lecture slides *** ------
- Overview: click here
- Striping: click here
- Disk Scheduling: click here
- Pre-fetching: click here
Handling disk failures (RAID):
- Overview of disk failures: click here (+ slides)
- Handling inter-mittent failures: click here (+ slides)
- Handling write failures (due to power failure): click here (+ slides)
- Handling media decay...: click here (+ slides)
- Protection for disk crashes: click here (+ slides)
How to store (data) records on disks: handwritten notes
- Storing records in disk blocks: click here (+ slides)
- Alignment requirements on record fields: click here (+ slides)
- Clustered and unclustered (data) files: click here (+ slides)
- How to store fixed length fields records on disks:
  - Format of a record used to store "fixed length fields": click here (+ slides)
- How to store variable length (size) record fields and variable format records: handwritten notes
  - Storing variable sized records on disks: click here (+ slides)
- Storing variable format records on disks: click here (+ slides)
How the DBMS locate records (on disks and in memory - pointer swizzling): handwritten notes
- Identifying blocks/records:
  - Intro: click here (+ slides)
  - Identifying blocks/records on disk:
    - Where do you use database addresses (that identify data stored on disks): click here (+ slides)
    - Physical database addresses: click here (+ slides)
    - Logical database addresses: click here (+ slides)
    - Identifying and referencing records: click here (+ slides)
  - Identifying blocks/records in memory: click here (+ slides)
- The block/record locating problem:
  - Intro (problem description) -- where is the block right now ?: click here (+ slides)
  - Translation table to locate data blocks/records: click here (+ slides)
- Speeding up block/records access --- pointer swizzling:
  - Intro to pointer swizzling: click here (+ slides)
  - Problems caused by pointer swizzling: click here (+ slides)
  - Unswizzling an address (replacing a memory address back with its database address): click here (+ slides)
  - Un-pin blocks efficiently: click here (+ slides)
Homework 3: click here

Indexing

Introduction (definitions):
- Intro:
  - What is an index: click here (+ slides)
  - Types of indexes: click here (+ slides)
  - Terminology (of indexes): click here (+ slides)
  - Multi-level indices: click here (+ slides)
B-tree and B⁺-tree:
- Structure of the B/B⁺-tree:
  - The case for multi-level indexes: click here (+ slides)
  - Intro to B⁺-trees: click here (+ slides)
  - Structure of an internal node in the B/B⁺-tree: click here (+ slides)
  - Structure of a leaf node in the B/B⁺-tree: click here (+ slides)
  - How many keys to store in a B⁺-tree node: click here (+ slides)
  - Difference between a B-tree and a B⁺-tree: click here (+ slides)
  - Search property of a B⁺-tree: click here (+ slides)
- Searching in a B-tree and B⁺-tree: click here
- Inserting a key into a B-tree and B⁺-tree:
  - Insert (key, recordPtr) in (a leaf node of) the B⁺-tree: click here
  - Insert (key, rightTree) in a internal node of the B⁺-tree: click here
  - Examples insertions in the B⁺-tree: click here (+ slides)
  - How to split a full node (distribute keys): click here (+ slides)
- Deleting a key into a B-tree and B⁺-tree:
  - Deleting (key, recordPtr) from (a leaf node of) the B⁺-tree: click here
  - Deleting (key, rightTreePtr) from a internal node of the B⁺-tree: click here
  - Examples of deletions in the B⁺-tree: click here (+ slides)
Hashing-based indexes:
- Intro to Hash-index: click here (+ slides)
- Performance of Hash-index: click here (+ slides)
- Resizing a hash table: click here (+ slides)
- Prelude to dynamic hashing - changing how I represent the hash buckets: click here (+ slides)
- Extendible Hashing (the first Dynamic hashing technique):
  - Illustrating Extendible Hashing using a decimal number example: click here (+ slides)
  - Intro to Extendible hashing (overview): click here (+ slides)
  - The Extendible Hashing algorithm (implementation details): click here (+ slides)
  - Example Extendible Hashing: click here (+ slides)
  - Other issues of Extendible Hashing (deletion, doubling): click here (+ slides)
- Linear Hashing (improved Dynamic hashing):
  - Intro: click here (+ slides)
  - The Linear Hashing technique: click here (+ slides)
  - Example Linear Hashing: click here (+ slides)
  - Deletion in Linear Hashing: click here
Multi-dimensional indexes:
- Intro -- multi-dimenisional information: click here (+ slides)
- Commonly used queries on multi-dimenisional information: click here (+ slides)
- Motivation for Multi-dimensional indexes: click here (+ slides)
- Overview Multi-dimensional indexes: click here
- Table-based multi-dimensional indexes:
  1. Grid index Files:
    - Introduction: click here (+ slides)
    - Searching and inserting in a grid index: click here (+ slides)
    - Using a Grid index in multi-dim queries: click here (+ slides)
  2. Partitioning hash function:
    - Introduction: click here (+ slides)
    - Using a Partitioning hashing index in multi-dim queries: click here (+ slides)
- Tree-like (tree-based) multi-dimensional indexes:
  1. Multiple-key index:
    - Intro: click here (+ slides)
    - Using a Multiple-key index in multi-dim queries: click here (+ slides)
  2. kd (k-dimensional) tree:
    - Intro: click here (+ slides)
    - Adapting kd-tree for disk storage: click here (+ slides)
    - Lookup, Insert and Delete Operations on a kd-tree: click here (+ slides)
    - Using a kd-tree (for common multi-dim queries): click here (+ slides)
    - Storing a kd-tree on disk: click here (+ slides)
  3. The Quad-tree:
    - Intro: click here (+ slides)
    - Lookup, Insert and Delete Operations on a quad-tree: click here (+ slides)
    - Using a quad-tree (for common multi-dim queries): click here (+ slides)
  4. The Region (R) tree: (paper: click here)
    - Intro: click here (+ slides)
    - Lookup in an R-tree: click here (+ slides)
    - Insert into an R-tree: click here (+ slides)
    - Node partioning algorithm: click here (+ slides)
Bitmap indexes:
- Intro: click here (+ slides)
- Using a bitmap index: click here (+ slides)
Homework 4: click here

Query processing

Cost and constraint on query execution (= processing a physical query plan)

Overview of query processing: click here (+ slides)
Processing (SQL) queries:
- The Physical Query Plan operators: click here (+ slides)
- The cost and constraint in query processing:
  - Costs and Resources used to process a query: click here (+ slides)
  - Important factor that affect the cost of an operator: click here (+ slides)
  - Parameters to express cost and constraint in query processing: click here (+ slides)
- The cost of the basic (relation) access operators (table-scan and index-scan): click here (+ slides)
- Iterators - commonly used technique to implement pipelining: click here (+ slides)
Categories of query processing algorithms: click here (+ slides)

One-pass Algorithms for Query execution (processing a physical query plan)

One-pass algorithms of Physical Operators:
- Unary operators:
  - Selection σ click here (+ slides)
  - Projection π click here (+ slides)
  - Duplicate elimination δ click here (+ slides)
  - Grouping γ click here (+ slides)
- Binary operators:
  - Union ∪
    - Bag union: click here (+ slides)
    - Set union: click here (+ slides)
  - Intersection ∩
    - Bag intersection: click here (+ slides)
    - Set intersection: click here (+ slides)
  - Difference −
    - Bag difference: click here (+ slides)
    - Set difference: click here (+ slides)
  - Product (cartesian product) ×
    - One-pass algorithm: click here (+ slides)
  - Join ⋈ click here (+ slides)
- Summary: click here

The nested-loop Algorithms for Cartesian Product and Join

Intro: click here (+ slides)
The block-based nested-loop cartesian product (×) algorithm: click here (+ slides)
The block-based nested-loop join join (⋈) algorithm: click here (+ slides)
The tuple-based nested-loop join (⋈) algorithm (very suitable for iterator): click here (+ slides)

Homework 5: click here

2-pass Algorithms for Query execution that are based on hashing

Intro to two-pass algorithms: click here (+ slides)
Introduction to 2-pass hashing-based algorithms: click here (+ slides)
Unary operators:
- Selection σ: click here ***
- Projection π: click here ***
- Duplicate elimination δ: click here (+ slides) +++ 1st presentation
  - Partition R using hashing
  - Process each partition using the one-pass algorithm
- Grouping γ: click here (+ slides) +++ 2nd presentation
  - Partition R using hashing
  - Process each partition using the one-pass algorithm
Binary operators:
- Union ∪
  - Bag union: click here ***
  - Set union: click here (+ slides) +++ 1st presentation
    - Partition R and S using hashing
    - Process each partition using the one-pass algorithm
- Intersection ∩
  - Bag intersection: click here (+ slides) +++ 2nd presentation
    - Partition R and S using hashing
    - Process each partition using the one-pass algorithm
  - Set intersection: click here (+ slides) +++ 3rd presentation
    - Partition R and S using hashing
    - Process each partition using the one-pass algorithm
- Difference −
  - Bag difference: --- same procedure as in the +++ presentation
    - Partition R and S using hashing
    - Process each partition using the one-pass algorithm
  - Set difference: --- same procedure as in the +++ presentation
    - Partition R and S using hashing
    - Process each partition using the one-pass algorithm
- Product (cartesian product) × click here (+ slides) ***
- Join ⋈: click here (+ slides) --- same procedure as in the +++ presentation
  - Partition R and S using hashing
  - Process each partition using the one-pass algorithm
Summary: click here

2-pass Algorithms for Query execution that are based on (TPMMS) sorting

The two-pass multiway merge sort (TPMMS) algorithm
- The TPMMS algorithm: click here (+ slides)
- File size limit and performance cost of the TPMMS algorithm: click here (+ slides)
Unary operators:
- Selection σ --- no need: use a one-pass algorithm
- Projection π --- no need: use a one-pass algorithm
- Duplicate elimination δ: click here (+ slides)
- Grouping γ: click here (+ slides)
Binary operators:
- Union ∪
  - Bag union: --- no need: use a one-pass algorithm
  - Set union: click here (+ slides)
- Intersection ∩
  - Bag intersection: click here (+ slides)
  - Set intersection: click here (+ slides)
- Difference −
  - Bag difference: click here (+ slides)
  - Set difference: click here (+ slides)
- Product (cartesian product) × --- same comment as click here
- Join ⋈:
  - Caveat in the Join operation: click here (+ slides)
  - Version 1: for small join sets click here (+ slides)
  - Version 2: for large join sets click here (+ slides)
Summary: click here

Multi-pass algorithms

Multi-pass multiway sort: click here (+ slides)
Multi-pass hash: click here (+ slides)

Algorithms that are based on indexing:

Intro: click here
Clustering and Non-clustering index: click here (+ slides)
Index-based selection σ_A=c:
- Selection using a clustering index: click here (+ slides)
- Selection using a non-clustering index: click here (+ slides)
- Performance comparison of some selection algorithms: click here (+ slides)
The index-based join (index-join algorithm):
- Index-join using a clustering index: click here (+ slides)
- Index-join using a non-clustering index: click here (+ slides)
- Index-join vs. the one-pass join algorithm: click here (+ slides)
Join using an ordered index (the zig-zag join algorithm):
- Review --- the sort-join algorithm: click here (+ slides)
- Adapting the Sort-Join algorith to use an ordered index to access the data file (zig-zag join): click here (+ slides)
- The zig-zag join algorithm using a non-clustering index: click here (+ slides)
- The zig-zag join algorithm using a clustering index: click here (+ slides)
- "Zig-zag" join using one ordered index: click here (+ slides)
Summary: click here
Homework 6: click here

Query optimization

Overview: query optimization: click here (+ slides)

Parsing and pre-processing

Parsing and pre-processing:
- Intro to parsing: click here (+ slides)
- Grammar and re-write rules: click here (+ slides)
- A simplified SQL grammar: click here (+ slides)
- The SQL pre-processor: click here (+ slides)

Converting a Parse Tree into an initial logical query plan (tree)

Converting an SQL command that does not contain a sub-query: click here (+ slides)
Converting an SQL command that contains a sub-query:
- Converting an SQL command with sub-query using 2-argument selection: click here (+ slides)
- Replacing a 2-argument selection in an uncorrelated sub-query: click here (+ slides)
- Replacing a 2-argument selection in an correlated sub-query: click here (+ slides)
Postscript: click here

Algebraic Laws used to transform/optimize logical query plans

Transforming query plans: click here (+ slides)
Commutative Algebraic laws: click here (+ slides)
Associative Algebraic laws: click here (+ slides)
Property of operators that are both commutative and associative: click here (+ slides)
Laws involving σ: click here (+ slides)
Laws involving ⋈ and ×: click here (+ slides)
Laws involving π: click here (+ slides)
Laws involving δ (duplicate elimination): click here (+ slides)
Laws involving γ_L (grouping): click here (+ slides)

Heuristic-based approach to finding the optimal logical query plan

Finding the best logical query plan - intro: click here (+ slides)
The cost of a logical query plan: click here (+ slides)
Heuristic for query optimization: click here (+ slides)
Examples logical query plan optimization: click here (+ slides)
Further optimization --- the best join ordering click here (+ slides)

Prelude to cost-estimation-based approach to finding the optimal join ordering: cost estimation

Intro to finding the best join ordering: click here (+ slides)
Estimating the cost of a logical query plan: click here (+ slides)
Estimating the number of tuples output (produced) by relational algebra operations:
- Review of the basic database statistics: click here
- Estimating the result size of Selection σ: click here (+ slides)
- Estimating the result size of Join ⋈:
  - Intro (simplifying assumptions): click here (+ slides)
  - Estimating the result of R ⋈ S when joining on 1 attribute: click here (+ slides)
  - Estimating the result of R ⋈ S when joining on multiple attributes: click here (+ slides)
Using histogram information to estimate the result set of join operations:
- Intro: click here (+ slides)
- How to use most-frequent values histograms: click here (+ slides)
- How to use equi-width histograms: click here (+ slides)

Homework 6: click here

Cost-estimation-based approach to find the optimal join ordering

Intro: click here (+ slides)
The simplest case: choosing a join order for R⋈S (R⋈S or S⋈R ?): click here (+ slides)
Left-deep trees:
- Intro: click here (+ slides)
- One-pass join algorithm using a left-deep tree vs. other trees: click here (+ slides)
- Nested-loop join algorithm using a left-deep tree vs. other trees: click here (+ slides)
- Recommendation: click here
Finding the best left-deep join tree:
- The Dynamic Programming approach (= exhaustive search):
  - Dynamic Programming: click here (+ slides)
  - Data structure used by the Dyn. Prog. algorithm to find the best join ordering: click here (+ slides)
  - The Dyn. Prog. algorithm: click here (+ slides)
  - A worked out example: click here (+ slides)

A greedy heuristic to find the best left-deep join ordering:

Greedy heuristic to find optimal join ordering in left-deep trees: click here (+ slides)

The physical query plan

Intro: click here (+ slides)
Selecting an algorithm for σ: click here (+ slides)
Guidelines for selecting an algorithm for ⋈: click here (+ slides)
Buffer availability and query execution using pipelining
- Intro - review pipelining: click here (+ slides)
- Buffer utilization of pipelining of the unary operators (σ and π): click here (+ slides)
- Finding the minimal buffer requirement for pipeline execution of join operations: click here (+ slides)
- Algorithm selection for pipeline execution of join (⋈) operations for a given buffer availability M (=101):
  - Intro: click here (+ slides)
  - Example of pipelining ⋈ operators if (B(R⋈S) ≤ 49): click here (+ slides)
  - Example continues for (50 < B(R⋈S) ≤ 5000): click here (+ slides)
  - Example continues further for (B(R⋈S) > 5000): click here (+ slides)
Materialize vs. pipelining: click here (+ slides)

Homework 7: click here

Ensuring database consistency

Recoverability: protecting database from system failure (logging)

Ensuring database integrity against system failures ---- intro to logging click here (+ slides)

Correctness model in Database Systems

Modeling consistency of a database -- database state and transaction: click here (+ slides)
What causes an inconsistent database state: click here (+ slides)
Transactions: click here (+ slides)
Implementing Transactions: click here (+ slides)

Intro to logging: click here (+ slides)

Undo logging:

Intro to undo logging: click here (+ slides)
Recovery using to undo logging: click here (+ slides)

System crash during recovery: click here (+ slides)

Checkpointing the undo log:

Intro to log checkpointing click here (+ slides)

Quiescent log checkpointing: click here (+ slides)

Nonquiescent checkpointing algorithm for the undo log:

Nonquiescent checkpointing procedure for the UNDO log: click here (+ slides)
Recovering using a checkpointed UNDO log: click here (+ slides)

Redo logging:

Intro to redo logging (redo-log write rule): click here (+ slides)

Recovery using to redo logging:

Property of redo log (to understand recovery procedure in redo-logging): click here (+ slides)
Recovery procedure from system failures using redo log: click here (+ slides)

Nonquiescent Checkpointing for REDO log:

Performing a nonquiescent checkpoint on a REDO log: click here (+ slides)
Recovering using a checkpointed REDO log: click here (+ slides)

Undo/Redo logging:

Intro to Undo/redo logging: click here (+ slides)

Recovery using to Undo/redo logging:

Property of undo-redo log (to understand recovery procedure in undo/redo-logging): click here (+ slides)
Recovery procedure from system failures using undo-redo log: click here (+ slides)

Nonquiescent Checkpointing for undo/redo log:

Performing a nonquiescent checkpoint on a UNDO/REDO log: click here (+ slides)
Recovering using a checkpointed UNDO/REDO log: click here (+ slides)

Homework 8: click here
Serializability: correctness of concurrent execution of transactions
- Intro to concurrency control: click here
- Serializability:
  - Serial schedules: click here
  - Serializable schedules: click here
  - Removing semantics of transaction in determining serializability: click here
- Conflict-serializability: a more practical type of serializability
  - Conflicting operations: click here
  - Conflict-serializable schedules (conflict-serializability): click here
  - Precedence graph test for conflict-serializability: click here
  - Proof of correctness of the precedence graph test: click here
- Exclusive locks:
  - Intro to locks: click here
  - Exclusive locks: click here
  - 2-phase locking (sufficient for enforcing conflict-serializability):
    - Intro: click here
    - Proof that 2-phase locking can guarantee conflict-serializability: click here
- Deadlocks: click here
- Shared/Exclusive locking:
  - Intro to Shared/Exclusive locks: click here
  - Lock compatibility matrix: click here
- Upgrading locks:
  - Read first and write later - upgrading a lock ?: click here
  - Upgrading locks: the new "update lock" locking mode: click here
- Increment/decrement locking:
  - Increment/decrement operation: click here
  - The "increment" lock: click here
- Implementation of locks:
  - Architecture of a lock scheduler: click here
  - Architecture of the lock table: click here
  - Handling lock requests: click here
Serializability and Recoverability
- Introduction:
  - Interaction between recoverability and serializability: click here
  - Cascading rollbacks:
    - Intro: click here
    - Rollback method for undo logging: click here
    - Rollback method for redo logging: click here
    - Rollback method for undo/redo logging: click here
- Recoverable schedules:
  - Intro: click here
  - Serializability and recoverability: click here
- Recoverable and serializable schedules:
  - Intro: click here
  - ACR schedules --- a subset of recoverable schedules:
    - Intro: click here
    - An ACR schedule is a recoverable schedule: click here
  - A summary of the schedules: click here
  - Strick 2-phased locking --- enforcing serializable and recoverable schedules: click here
Deadlock
- Intro: click here
- Deadlock detection:
  - using time out: click here
  - using wait-for graph: click here
- Deadlock prevention:
  - using ordered DB elements: click here
  - using wait-die timestamp scheme: click here
  - using wound-wait timestamp scheme: click here
  - Comparing wait-die and wound-wait schemes: click here
High-performance (parallel and distributed) database systems
Parallel Data Processing Algorithms
- Parallel computer architectures: click here
- Tuple storage to support/assist parallel algorithms: click here
- Unary operators:
  - Parallel algorithm for selection σ: click here
  - Parallel algorithm for duplicate elimination δ: click here
  - Parallel algorithm for projection π: click here
  - Cost simplification in parallel data processing: click here
  - Parallel algorithm for grouping γ_L: click here
- Unary operators:
  - Parallel algorithm for ∩: click here
  - Parallel algorithm for ∪: click here
  - Parallel algorithm for −: click here
  - Parallel algorithm for ⋈: click here
The Map-Reduce Parallelism framework
- MapReduce--- a specific parallel processing pattern: click here
- Classic (introductory) MapReduce algorithms:
  - Compute the inverted index: click here
  - Count the number of occurrence of words in documents: click here
- MapReduce Algorithm for Matrix Multiplication: click here
- MapReduce Algorithm for Equi Join: click here
Distributed Databases: query processing
- Introduction:
  - Characteristics of Distributed databases: click here
  - Data fragmentation and sharding in distributed database systems: click here
  - Issues caused by distributed data storage and processing (atomic commit, locks): click here
- Distributed query processing - distributed join:
  - Query processing in distributed system: click here
  - Cost simplication in distributed data processing: click here
  - Reducing the communication cost using semi-join (⋉): click here
  - Using bloom filter to eleminate dangling tuples: click here
  - Removing dangling tuples in join of many relations:
    - Full reducers: click here
    - Acyclic hypergraphs: pre-req to finding full reducers: click here
    - Constructing full reducers for acyclic hypergraphs: click here
Commiting Distributed Transactions
- Introduction: click here
- The 2 phase distributed commit protocol: click here
- Recovery of distributed transactions: click here
"Big Data" systems (and NOSQL)
- Introduction to NOSQL systems: click here
- Characteristic/Emphasis of NOSQL systems (how to achieve high performance): click here
- Data model and query languages of NOSQL systems: click here
- Categories of NOSQL systems: click here
- The CAP Theory: click here
- Key-value based NOSQL Systems:
  - Intro: click here
  - Amazon's DynamoDB: click here
  - Distributed storage in Amazon's DynamoDB (Consistent Hashing): click here
  - Supporting data replication with Consistent Hashing: click here
  - Adding nodes: click here
- Document based NOSQL Systems: click here
- Column based NOSQL Systems: click here
- Graph DB NOSQL Systems: click here

CS554 - Advanced Database Systems Syllabus and Progress

CS554 - Advanced Database Systems
Syllabus and Progress