DDIA: Chapter 6
Partitioning
- For very large datasets or very high query throughput, replication is not enough
- partitions (sharding, region, tablet, vnode, vBucket): a piece of data belongs to exactly one partition (most of the time); each partition acts as a small database
- partitions serve the purpose of scalability
- query throughput can be scaled by adding more nodes
- complex queries can be parallelized across nodes
Partitioning and Replication
- partitioning is usually combined with replication (scalability + fault tolerance)
- a node may store more than one partition
Partitioning of Key-Value Data
- should spread data and query load as evenly across nodes as possible
- skewed: some partitions have more data or queries
- hot spot: a partition with disproportionately high load
- distributing randomly ensures evenness, but makes readsh ard
Partitioning by Key Range
- key range partitioning: a partition owns all keys from a min to max; can sort on keys
- assign a continuous range of keys to each partition
- queries can be routed to appropriate ranges
- ranges of keys may not be even (uneven data)
- boundaries can be chosen manually or automatically
- within a partition, can keep keys in osrted order
- cons:
- access patterns can lead to hot spots, e.g. timeseries data partitioned by date => choose key carefully
Partitioning by Hash of Key
- hash partitioning: a partition owwns a range of hashes; destroys ordering of keys
- a good hash should make skewed data uniformly distributed
- need not be cryptographically strong
- consistent hashing: boundaries chose pseudorandomly (doesn’t work well)
- cons:
- lose ability to do range queries (keys are scattered)
- compound primary key: key is several columns, and only first part is hashed - rest is used for sorting data (Cassandra) - good for one-to-many relationships
Skewed Workloads and Relieving Hot Spots
- responsibility of application to reduce skew (e.g. can add random number to beginning or end of key => consequence that reads have to do additional work and combine data from across keys)
Partitioning and Secondary Indexes
- secondary indexes don’t map neatly to partitions
Partitioning Secondary Indexes by Document
- document-partitioned indexes (local index): secondary indexes are stored in the same partition as the primary key and value
- only a single partition needs to be updated on write
- reading requires care: data may be scattered across partitions for a secondary index, requiring a read across all partitions
- scatter/gather: read across all partitions and combine results (potentially expensive, but commonly used, e.g. MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB)
- try to structure partitioning scheme so that secondary indexes are on a single partition
Partitioning Secondary Indexes by Term
- term-partitioned indexes (global index): secondary indexes are partitioned separately using the indexed values (e.g. a range); covers data in all partitions
- can make reads more efficient - just a request to the relevant partition and may include records from all partitions of the primary key
- writes are slower since data from a single document may now be on multiple partitions
- updates to global secondary indexes are often asynchronous due to the potential need for a distributed transaction
Rebalancing Partitions:
- change can be due to query throughput, dataset size, machine failure, etc.
- rebalancing: moving load from one node to another
- after rebalancing, load should be shared fairly
- during rebalancing, database should continue to accept reads/writes
- minimize data moved
Strategies for rebalancing
- hash mod N: could
mod 10
hash key (hash as a decimal number)
- problematic: if the number of nodes changes, most of the keys will need to be moved from one node to another
- expensive and requires frequent rebalancing
- fixed number of partitions: create many more partitions than nodes and assign several partitions to each nodes
- a new node added can steal partitions from every existing node
- only entire partitions are moved; old partition can be read during transfer
- can balance towards uneven hardware resources
- hard to choose the right number of partitions of size of the dataset is variable
- dynamic partitioning: when a partition exceeds a threshold, it splits into two partitions (proportional to the size of the dataset)
- each partition is assigned to one node, each node can handle multiple partitions
- after splitting, a partition can be moved to another node
- caveat that an empty database starts with a single partition, which means until the threshold is hit, all processing is done by a single node
- pre-splitting: configure initial set of partitions on an empty database (e.g. on key distribution)
- can be used on key range-partitioned or hash-partitioned data
- partitioning proportionally to nodes: make the number of partitions proportional to the number of nodes (Cassandra)
- fixed number of partitions per node; size of partition grows proportionately
- adding nodes will reduce sizes of partitions
- a new node randomly chooses a fixed number of existing partitions to split and takes half (may split unfairly)
- picking partition boundaries randomly requires hash-based partitioning
Operations: Automatic or Manually Rebalancing:
- rebalancing is expensive and may overwhelm resources/result in cascading failure due to a node being busy and perceived as failing (triggering data to be moved off, increasing load even further)
- fully automated rebalancing can be convenient but unpredictable
Request Routing
- service discovery: how does a client know which partition to request to?
- strategies:
- allow a client to contact any node (e.g. via round-robin load balancer), and keep forwarding until partition is found
- send all requests to a routing tier which determines the appropriate node (partition-aware load balancer)
- require clients be aware of partitioning; a client connects directly
- how does a routing component learn about changes?
- separate coordination service (e.g. Zookeeper maintains authoritative mapping of partitions to nodes)
- gossip protocol: disseminate changes in cluster state, requests can be sent to any node and a node will forward to the appropriate node - puts more complexity in the database nodes but avoids external service (e.g. Cassandra, Riak)
- learn about routing changes from cluster nodes through routing tier (e.g. Couchbase)
- suffient to use DNS for IP address
Parallel Query Execution
- simple queries only read/write a single key (or scatter/gather)
- massively parallel processing (MPP): breaks down a complex query into execution stages and partitions that can be executed in parallel on different nodes (e.g. join, filter, group, aggregate - often used for analytics)