DDIA: Chapter 3

Storage and Retrieval

Storage engines!
log-structured vs. page-oriented

Data Structures That Power Your Database

simplest database ever => just append to a file and get the last matching entry
- costly to scan through a large number of records
an index is metadata which helps locate data, derived from primary data
maintaining additional data incurs overhead on write

Hash Indexes

e.g. Bitcask
for an append-based storage, can keep a hash map that maps a key to a byte offset
compaction - throw away duplicate keys in a log (keep only most recent);
compaction and merging (of separate segments) can be done in a background thread while serving R/W
considerations:
- file format: binary (encode length of a string) more efficient than CSV (parse)
- deleting records: append a deletion record (tombstone) - flag to discard
- crash recovery: have to read through files, can store snapshots of segment hash maps
- partially written records: checksums
- concurrency control: one writer thread, concurrent reads
benefits:
- appending and segment merging are sequential writes - faster than random writes
- immutability = better concurrency and crash recovery
- merging avoids fragmentation issues
limitations:
- hash table must fit in memory (disk is inefficient)
- range queries are inefficient (keys must be looked u p individually)

SSTables and LSM-Trees

SSTable: Sorted String Table - key-value pairs are sorted by key
- merging is efficient - mergesort and take value from most recent segment
- because of sorting, can search for a key instead of keeping a key index (can keep a sparse index)
- can group and compress records before wwriting to disk
- keep keys in a sorted tree (memtable); write tree when needed; when a key is needed look at memtable then backwards through segments; merge and compact in bacgkround
- for crash recovery can store unsorted recent on disk
- slow for missing keys (traverse all data - can use Bloom filter)
- e.g. LevelDB, RocksDB, Cassandra, HBase, Bigtable, Lucene term ditionary
- also called LSM-Tree: Log-Structured Merge-Tree
- pros:
  - higher write throughput (due to sequential)
  - can be compressed better (smaller, less fragmented)
- cons:
  - write amplification (repeated compaction) can interfere wwith ongoing R/W
  - write bandwidth is shared with compaction
  - compaction may n ot be able to keep up with incoming writes
B-trees also keep KV pairs sorted by key
- Fixed-size blocks/pages (4 KB~) rather than variable-size segments
- pages can refer to another: can construct a tree of pages that contains ranges of keys
- branching factor: number of references to child pages per page
- splitting when adding a key (if there isn’t enough space) into 2 pages and update refs
- n keys = O(log n) depth
- write-ahead-log (WAL or redo log) - append-only file where B-tree modifications are written before applied for crash recovery
- multithreading needs control (latches)
- optimizations:
  - can use copy-on-write scheme (useful for concurrency)
  - can abbreviate keys
  - can rewrite segments closely during merging
  - can have references to left and right
- pros:
  - each key exists in exactly one place - stronger transactional semantics
- cons:
  - must write every data at least twice: write-ahead log and tree page itself (+ potential splitting)
  - data is more fragmented - takes more space
LSM-trees are generally faster for writes, B-trees are generally faster for reads

Other Indexing Structures

KV indexes = primary key, can also have secondary indexes (e.g. for joins)
heap file: place where rows are stored elsewhere - efficient for overwriting in place if data is smaller, can be performance penalty for reads (use clustered index - storing indexed row)
partial solution: covering index or index with included coloumns
multi-column indexes:
- concatenated index - several fields => one key; works only from left to right for searching a particular key
- multi-dimensional index, e.g. latitude + longitude (B-tree or LSM-tree index can’t do this efficiently, but can translate)
full-text search and fuzzy indexes:
- edit distance, trie
in-memory
- some intended for caching, okay if lost: e.g. Memcached
- in-memory relational databases exist, e.g. VoltDB, MemSQL, Oracle TimesTen
- some offer durability, e.g. RAMCloud, Redis, Couchbase (write to disk asyncrhonously)
- faster because they avoid encoding of data structures
- anti-caching: evict least recently used from memory to disk, load when needed

Transaction Processing or Analytics

transaction: group of writes and reads that form a logical unit
OLTP: online transaction processing
- small number of records per query fetched by key
- user-input, user-facing
- latest data
- limited by disk seek time
OLAP: online analytic processing
- aggregate over large number of records
- bulk import (ETL) or event stream
- history of events/data
- SQL often a good fit (drill-down, slicing and dicing)
- limited by disk bandwidth time

Data Warehousing

data warehouse: separate database for querying that doesn’t affect OLTP operations
ETL: Extract-Transform-Load
star schema (dimensional modeling) - foreign key references across tables (fact tables star out to dimension tables)
snowflake schema - similar to star schema but with subdimensions

Column-Oriented Storage

don’t store all the values from one row together; store all the values from each column together => only read and parse columns used in a particular query
almost like index in an array (all rows in same order)
easy to compress, e.g. by bitmap encoding - number of distinct values can be small

Sort Order in Column Storage

rows are normally unordered but can impose with index
can choose by which columns tables are sorted (e.g. date)
can store multiple sort-orders
writes are difficult - can’t update-in-place compressed columns

Aggregation: Data Cubes and Materialized Views

materialized aggregates: cache aggregates for common queries through materialized views
needs to be updated when underlying data is updated
data cube (OLAP cube) - precomputed but not as flexible as querying raw data