DDIA: Chapter 6

Partitioning

For very large datasets or very high query throughput, replication is not enough
partitions (sharding, region, tablet, vnode, vBucket): a piece of data belongs to exactly one partition (most of the time); each partition acts as a small database
partitions serve the purpose of scalability
query throughput can be scaled by adding more nodes
complex queries can be parallelized across nodes

Partitioning and Replication

partitioning is usually combined with replication (scalability + fault tolerance)
a node may store more than one partition

Partitioning of Key-Value Data

should spread data and query load as evenly across nodes as possible
skewed: some partitions have more data or queries
hot spot: a partition with disproportionately high load
distributing randomly ensures evenness, but makes readsh ard

Partitioning by Key Range

key range partitioning: a partition owns all keys from a min to max; can sort on keys
assign a continuous range of keys to each partition
queries can be routed to appropriate ranges
ranges of keys may not be even (uneven data)
boundaries can be chosen manually or automatically
within a partition, can keep keys in osrted order
cons:
- access patterns can lead to hot spots, e.g. timeseries data partitioned by date => choose key carefully

Partitioning by Hash of Key

hash partitioning: a partition owwns a range of hashes; destroys ordering of keys
a good hash should make skewed data uniformly distributed
need not be cryptographically strong
consistent hashing: boundaries chose pseudorandomly (doesn’t work well)
cons:
- lose ability to do range queries (keys are scattered)
compound primary key: key is several columns, and only first part is hashed - rest is used for sorting data (Cassandra) - good for one-to-many relationships

Skewed Workloads and Relieving Hot Spots

responsibility of application to reduce skew (e.g. can add random number to beginning or end of key => consequence that reads have to do additional work and combine data from across keys)

Partitioning and Secondary Indexes

secondary indexes don’t map neatly to partitions

Partitioning Secondary Indexes by Document

document-partitioned indexes (local index): secondary indexes are stored in the same partition as the primary key and value
only a single partition needs to be updated on write
reading requires care: data may be scattered across partitions for a secondary index, requiring a read across all partitions
scatter/gather: read across all partitions and combine results (potentially expensive, but commonly used, e.g. MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB)
try to structure partitioning scheme so that secondary indexes are on a single partition

Partitioning Secondary Indexes by Term

term-partitioned indexes (global index): secondary indexes are partitioned separately using the indexed values (e.g. a range); covers data in all partitions
can make reads more efficient - just a request to the relevant partition and may include records from all partitions of the primary key
writes are slower since data from a single document may now be on multiple partitions
updates to global secondary indexes are often asynchronous due to the potential need for a distributed transaction

Rebalancing Partitions:

change can be due to query throughput, dataset size, machine failure, etc.
rebalancing: moving load from one node to another
after rebalancing, load should be shared fairly
during rebalancing, database should continue to accept reads/writes
minimize data moved

Strategies for rebalancing

hash mod N: could mod 10 hash key (hash as a decimal number)
- problematic: if the number of nodes changes, most of the keys will need to be moved from one node to another
- expensive and requires frequent rebalancing
fixed number of partitions: create many more partitions than nodes and assign several partitions to each nodes
- a new node added can steal partitions from every existing node
- only entire partitions are moved; old partition can be read during transfer
- can balance towards uneven hardware resources
- hard to choose the right number of partitions of size of the dataset is variable
dynamic partitioning: when a partition exceeds a threshold, it splits into two partitions (proportional to the size of the dataset)
- each partition is assigned to one node, each node can handle multiple partitions
- after splitting, a partition can be moved to another node
- caveat that an empty database starts with a single partition, which means until the threshold is hit, all processing is done by a single node
- pre-splitting: configure initial set of partitions on an empty database (e.g. on key distribution)
- can be used on key range-partitioned or hash-partitioned data
partitioning proportionally to nodes: make the number of partitions proportional to the number of nodes (Cassandra)
- fixed number of partitions per node; size of partition grows proportionately
- adding nodes will reduce sizes of partitions
- a new node randomly chooses a fixed number of existing partitions to split and takes half (may split unfairly)
- picking partition boundaries randomly requires hash-based partitioning

Operations: Automatic or Manually Rebalancing:

rebalancing is expensive and may overwhelm resources/result in cascading failure due to a node being busy and perceived as failing (triggering data to be moved off, increasing load even further)
fully automated rebalancing can be convenient but unpredictable

Request Routing

service discovery: how does a client know which partition to request to?
strategies:
- allow a client to contact any node (e.g. via round-robin load balancer), and keep forwarding until partition is found
- send all requests to a routing tier which determines the appropriate node (partition-aware load balancer)
- require clients be aware of partitioning; a client connects directly
how does a routing component learn about changes?
- separate coordination service (e.g. Zookeeper maintains authoritative mapping of partitions to nodes)
- gossip protocol: disseminate changes in cluster state, requests can be sent to any node and a node will forward to the appropriate node - puts more complexity in the database nodes but avoids external service (e.g. Cassandra, Riak)
- learn about routing changes from cluster nodes through routing tier (e.g. Couchbase)
suffient to use DNS for IP address

Parallel Query Execution

simple queries only read/write a single key (or scatter/gather)
massively parallel processing (MPP): breaks down a complex query into execution stages and partitions that can be executed in parallel on different nodes (e.g. join, filter, group, aggregate - often used for analytics)