DDIA: Chapter 3
Storage and Retrieval
- Storage engines!
- log-structured vs. page-oriented
Data Structures That Power Your Database
- simplest database ever => just append to a file and get the last matching entry
- costly to scan through a large number of records
- an index is metadata which helps locate data, derived from primary data
- maintaining additional data incurs overhead on write
Hash Indexes
- e.g. Bitcask
- for an append-based storage, can keep a hash map that maps a key to a byte offset
- compaction - throw away duplicate keys in a log (keep only most recent);
- compaction and merging (of separate segments) can be done in a background thread while serving R/W
- considerations:
- file format: binary (encode length of a string) more efficient than CSV (parse)
- deleting records: append a deletion record (tombstone) - flag to discard
- crash recovery: have to read through files, can store snapshots of segment hash maps
- partially written records: checksums
- concurrency control: one writer thread, concurrent reads
- benefits:
- appending and segment merging are sequential writes - faster than random writes
- immutability = better concurrency and crash recovery
- merging avoids fragmentation issues
- limitations:
- hash table must fit in memory (disk is inefficient)
- range queries are inefficient (keys must be looked u p individually)
SSTables and LSM-Trees
- SSTable: Sorted String Table - key-value pairs are sorted by key
- merging is efficient - mergesort and take value from most recent segment
- because of sorting, can search for a key instead of keeping a key index (can keep a sparse index)
- can group and compress records before wwriting to disk
- keep keys in a sorted tree (memtable); write tree when needed; when a key is needed look at memtable then backwards through segments; merge and compact in bacgkround
- for crash recovery can store unsorted recent on disk
- slow for missing keys (traverse all data - can use Bloom filter)
- e.g. LevelDB, RocksDB, Cassandra, HBase, Bigtable, Lucene term ditionary
- also called LSM-Tree: Log-Structured Merge-Tree
- pros:
- higher write throughput (due to sequential)
- can be compressed better (smaller, less fragmented)
- cons:
- write amplification (repeated compaction) can interfere wwith ongoing R/W
- write bandwidth is shared with compaction
- compaction may n ot be able to keep up with incoming writes
- B-trees also keep KV pairs sorted by key
- Fixed-size blocks/pages (4 KB~) rather than variable-size segments
- pages can refer to another: can construct a tree of pages that contains ranges of keys
- branching factor: number of references to child pages per page
- splitting when adding a key (if there isn’t enough space) into 2 pages and update refs
- n keys = O(log n) depth
- write-ahead-log (WAL or redo log) - append-only file where B-tree modifications are written before applied for crash recovery
- multithreading needs control (latches)
- optimizations:
- can use copy-on-write scheme (useful for concurrency)
- can abbreviate keys
- can rewrite segments closely during merging
- can have references to left and right
- pros:
- each key exists in exactly one place - stronger transactional semantics
- cons:
- must write every data at least twice: write-ahead log and tree page itself (+ potential splitting)
- data is more fragmented - takes more space
- LSM-trees are generally faster for writes, B-trees are generally faster for reads
Other Indexing Structures
- KV indexes = primary key, can also have secondary indexes (e.g. for joins)
- heap file: place where rows are stored elsewhere - efficient for overwriting in place if data is smaller, can be performance penalty for reads (use clustered index - storing indexed row)
- partial solution: covering index or index with included coloumns
- multi-column indexes:
- concatenated index - several fields => one key; works only from left to right for searching a particular key
- multi-dimensional index, e.g. latitude + longitude (B-tree or LSM-tree index can’t do this efficiently, but can translate)
- full-text search and fuzzy indexes:
- in-memory
- some intended for caching, okay if lost: e.g. Memcached
- in-memory relational databases exist, e.g. VoltDB, MemSQL, Oracle TimesTen
- some offer durability, e.g. RAMCloud, Redis, Couchbase (write to disk asyncrhonously)
- faster because they avoid encoding of data structures
- anti-caching: evict least recently used from memory to disk, load when needed
Transaction Processing or Analytics
- transaction: group of writes and reads that form a logical unit
- OLTP: online transaction processing
- small number of records per query fetched by key
- user-input, user-facing
- latest data
- limited by disk seek time
- OLAP: online analytic processing
- aggregate over large number of records
- bulk import (ETL) or event stream
- history of events/data
- SQL often a good fit (drill-down, slicing and dicing)
- limited by disk bandwidth time
Data Warehousing
- data warehouse: separate database for querying that doesn’t affect OLTP operations
- ETL: Extract-Transform-Load
- star schema (dimensional modeling) - foreign key references across tables (fact tables star out to dimension tables)
- snowflake schema - similar to star schema but with subdimensions
Column-Oriented Storage
- don’t store all the values from one row together; store all the values from each column together => only read and parse columns used in a particular query
- almost like index in an array (all rows in same order)
- easy to compress, e.g. by bitmap encoding - number of distinct values can be small
Sort Order in Column Storage
- rows are normally unordered but can impose with index
- can choose by which columns tables are sorted (e.g. date)
- can store multiple sort-orders
- writes are difficult - can’t update-in-place compressed columns
Aggregation: Data Cubes and Materialized Views
- materialized aggregates: cache aggregates for common queries through materialized views
- needs to be updated when underlying data is updated
- data cube (OLAP cube) - precomputed but not as flexible as querying raw data