DDIA: Chapter 6

DDIA: Chapter 6

Partitioning

  • For very large datasets or very high query throughput, replication is not enough
  • partitions (sharding, region, tablet, vnode, vBucket): a piece of data belongs to exactly one partition (most of the time); each partition acts as a small database
  • partitions serve the purpose of scalability
  • query throughput can be scaled by adding more nodes
  • complex queries can be parallelized across nodes

Partitioning and Replication

  • partitioning is usually combined with replication (scalability + fault tolerance)
  • a node may store more than one partition

Partitioning of Key-Value Data

  • should spread data and query load as evenly across nodes as possible
  • skewed: some partitions have more data or queries
  • hot spot: a partition with disproportionately high load
  • distributing randomly ensures evenness, but makes readsh ard

Partitioning by Key Range

  • key range partitioning: a partition owns all keys from a min to max; can sort on keys
  • assign a continuous range of keys to each partition
  • queries can be routed to appropriate ranges
  • ranges of keys may not be even (uneven data)
  • boundaries can be chosen manually or automatically
  • within a partition, can keep keys in osrted order
  • cons:
    • access patterns can lead to hot spots, e.g. timeseries data partitioned by date => choose key carefully

Partitioning by Hash of Key

  • hash partitioning: a partition owwns a range of hashes; destroys ordering of keys
  • a good hash should make skewed data uniformly distributed
  • need not be cryptographically strong
  • consistent hashing: boundaries chose pseudorandomly (doesn’t work well)
  • cons:
    • lose ability to do range queries (keys are scattered)
  • compound primary key: key is several columns, and only first part is hashed - rest is used for sorting data (Cassandra) - good for one-to-many relationships

Skewed Workloads and Relieving Hot Spots

  • responsibility of application to reduce skew (e.g. can add random number to beginning or end of key => consequence that reads have to do additional work and combine data from across keys)

Partitioning and Secondary Indexes

  • secondary indexes don’t map neatly to partitions

Partitioning Secondary Indexes by Document

  • document-partitioned indexes (local index): secondary indexes are stored in the same partition as the primary key and value
  • only a single partition needs to be updated on write
  • reading requires care: data may be scattered across partitions for a secondary index, requiring a read across all partitions
  • scatter/gather: read across all partitions and combine results (potentially expensive, but commonly used, e.g. MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB)
  • try to structure partitioning scheme so that secondary indexes are on a single partition

Partitioning Secondary Indexes by Term

  • term-partitioned indexes (global index): secondary indexes are partitioned separately using the indexed values (e.g. a range); covers data in all partitions
  • can make reads more efficient - just a request to the relevant partition and may include records from all partitions of the primary key
  • writes are slower since data from a single document may now be on multiple partitions
  • updates to global secondary indexes are often asynchronous due to the potential need for a distributed transaction

Rebalancing Partitions:

  • change can be due to query throughput, dataset size, machine failure, etc.
  • rebalancing: moving load from one node to another
  • after rebalancing, load should be shared fairly
  • during rebalancing, database should continue to accept reads/writes
  • minimize data moved

Strategies for rebalancing

  • hash mod N: could mod 10 hash key (hash as a decimal number)
    • problematic: if the number of nodes changes, most of the keys will need to be moved from one node to another
    • expensive and requires frequent rebalancing
  • fixed number of partitions: create many more partitions than nodes and assign several partitions to each nodes
    • a new node added can steal partitions from every existing node
    • only entire partitions are moved; old partition can be read during transfer
    • can balance towards uneven hardware resources
    • hard to choose the right number of partitions of size of the dataset is variable
  • dynamic partitioning: when a partition exceeds a threshold, it splits into two partitions (proportional to the size of the dataset)
    • each partition is assigned to one node, each node can handle multiple partitions
    • after splitting, a partition can be moved to another node
    • caveat that an empty database starts with a single partition, which means until the threshold is hit, all processing is done by a single node
    • pre-splitting: configure initial set of partitions on an empty database (e.g. on key distribution)
    • can be used on key range-partitioned or hash-partitioned data
  • partitioning proportionally to nodes: make the number of partitions proportional to the number of nodes (Cassandra)
    • fixed number of partitions per node; size of partition grows proportionately
    • adding nodes will reduce sizes of partitions
    • a new node randomly chooses a fixed number of existing partitions to split and takes half (may split unfairly)
    • picking partition boundaries randomly requires hash-based partitioning

Operations: Automatic or Manually Rebalancing:

  • rebalancing is expensive and may overwhelm resources/result in cascading failure due to a node being busy and perceived as failing (triggering data to be moved off, increasing load even further)
  • fully automated rebalancing can be convenient but unpredictable

Request Routing

  • service discovery: how does a client know which partition to request to?
  • strategies:
    • allow a client to contact any node (e.g. via round-robin load balancer), and keep forwarding until partition is found
    • send all requests to a routing tier which determines the appropriate node (partition-aware load balancer)
    • require clients be aware of partitioning; a client connects directly
  • how does a routing component learn about changes?
    • separate coordination service (e.g. Zookeeper maintains authoritative mapping of partitions to nodes)
    • gossip protocol: disseminate changes in cluster state, requests can be sent to any node and a node will forward to the appropriate node - puts more complexity in the database nodes but avoids external service (e.g. Cassandra, Riak)
    • learn about routing changes from cluster nodes through routing tier (e.g. Couchbase)
  • suffient to use DNS for IP address

Parallel Query Execution

  • simple queries only read/write a single key (or scatter/gather)
  • massively parallel processing (MPP): breaks down a complex query into execution stages and partitions that can be executed in parallel on different nodes (e.g. join, filter, group, aggregate - often used for analytics)