DDIA: Chapter 5
Part II Interlude
Distributed Data
- scalability: spread data volume and/or read/write load to more than a single machine
- fault tolerance/high availability: multiple machines provide redundancy
- latency: servers at various locations for geographically distributed access
Scaling to Higher Load
- vertical scaling (scaling up): buy a more powerful machine
- shared-memory architecture: fast interconnect allows any CPU to access any part of memory/disk
- cons: cost grows faster than linearly; limited fault tolerance
- shared-disk architecture: independent CPUs/RAM but data is shared between machines via fast network
- cons: contention and overhead of locking limit scalability
- shared-nothing architecture (horizontal scaling, scaling out): each machine is an independent node , coordination between nodes is done using software/conventional network, no special hardware required
- replication: a copy of the same data is on several nodes - provides redundancy, can increase performance
- partitioning (sharding): splitting data across nodes
Chapter 5
Replication
- replication: keeping a copy of the same data on multiple machines
- to keep data geographically close
- to increase availability in case of failure
- to scale out serving read-based requests
Leaders and Followers
- replica: each node that stores a copy of the databse
- leader-based replication(active/passive, master-slave replication):
- most common
- one replica is designated the leader (master,primary): clients must send write requests to the leader which writes locally
- other replicas are followers (read replicas, slaves, secondaries): data changes are sent from leader as part of a replication log (change stream) and u pdates
- built-in feature of many DBs (PostgreSQL, MySQL, MongoDB, etc.) and some message brokers (Kafka, RabbitMQ)
Synchronous vs. Asynchronous Replication
- synchronous: leader waits until each follower reports the write before making the write visible to client
- follower is guaranteed to have up-to-date-data
- huge disadvantage if follower doesn’t respond/write cannot be processed as iti s blocking
- semi-synchronous: if the synchronous follower is unavailable, an asynchronous follower is m ade synchronous
- asynchronous: leader sends message but doesn’tw ait for response
- weakens durability but is widely used
- if leader fails, non-replicated writes are lost
- can continue processing writes even if followers have fallen behind
- research into performant synchronous replication (e.g. chain replication)
Setting Up New Followers
- copying data files is inconsistent, locking isn’t high availability
- can take snapshots of leader and sync differences since
Handling Node Outages
- follower failure - if a follower crashes it can catch-up from leader based on logs
- leader failure - failover: a follower must be promoted, clients need to be rerouted for writes, followers need to be rerouted for data changes
- can happen manually
- problematic if an old leader rejoins (what to do with out-of-sync writes? discards can cause stampeding inconsistency)
- split brain: two nodes both believe they are the leader
- timeout should be chosen with care
Implementation of Replication Logs
- statement-based replication: leader can log every write request (statement) => need care around nondeterministic functions (
now()
), dependencies on existing data/ordering, statements thath ave side effects
- write-ahead log (WAL) shipping: log can be used to build replicas => closely coupled tos torage engine/format making changes difficult
- logical (row-based) log replication: use different log formats for replication and storage engine (logical log, rather than physical data representation)
- better for backwards compatibility, parsing in external applications
- trigger-based replication: can register custom application code that i s automatically executed (triggers, stored procedures) - has greater overhead
Problems with Replication Lag
- read-scaling architecture
- eventual consistency: followers eventually catchu up to consistency withleader
Reading Your Own Writes
- read-after-write-consistency (read-your-writes-consistency): readers see updates they submitted themselves
- can mitigate by reading user’s own profile from reader but not good when many things are editable
- can approach with timestamp-based methods
- problematic across datacenters
Monotonic Reads
- asynchronous can see data moving backwards in time when reading from a follower where data has propagated and not propagated
- can mitigate by ensuring a user always makes reads from the same replica (have to think about failure)
Consistent Prefix reads
- causality/dependency can be violated when writing from multiple clients
- can mitigate by writing causal events to the same partition, but efficiency cost
Solutions for Replication Lag
- transactions can provide stronger guarantees
Multi-Leader Replication
- multi-leader configuration (master-master, active/active replication): each leader is a follower to other leaders (e.g. multi-datacenter, clients with offline operation, collaborative editing)
- performance: writes can be processed in local datacenters
- tolerance of datacenter outages
- tolerance of network problems
- disadvantage that data may be concurrently modified, often retrofitted solutions
Handling Write Conflicts
- conflict resolution required when concurrent wwrites are allowed
- can make synchronous but would lose multi-leader advantage
- can ensure writes for a record go througha particularr leader
- can give writes, replicas unique IDs
- can try to merge values or use a data structure that preserves all information (resolve conflict later)
- can resolve conflict on write or on read
- can use special data structures/algorithms (conflict-free replicated datatypes (CRDTs) (automatic set/map/list/etc. resolution), mergeable persistent data structures (similar to Git), operational transformation for collaborative editing)
Multi-Leader Replication Topologies
- circular topology: each leader is dependent on subsequent leader
- star topology: each leader depends on centralized leader
- all-to-all topology: all leaders communicate with each other
- circular and star difficult in case of node failure
- timestamps are insufficient for conflict resolution (not sufficiently in sync)
Leaderless Replication
- e.g. Dynamo, Riak, Cassandra, Voldemort
- client or coodrinatorn ode sends writes to several replicas
- consesus-driven reads
Writing to the Database When a Node Is Down
- just write to available nodes
- read will be driven by consensus/quorum
- eventual consistency:
- read repair: detect stale responses during read and update
- anti-entropy process: background process actively looks for differences in data between replicas
- quorums can be tuned:
- n replicas
- w nodes where writes are confirmed
- r nodes queried per read
- can set w = r = (n + 1) / 2 with odd n
Limitations of Quorum Consistency_
- quorums are not necessarily majorities => the set of nodes used by read and write operations need to overlap in at least one node
- smaller r and w means stale valuesa rem ore likely
- monitoring stalenessi s difficult
Sloppy Quorums and Hinted Handoff
- sloppy quorum: accept writes to nodes that are reachable even if they aren’t among the n nodes on which the value usually lives (e.g. due to network interruptions)
- hinted handoff: writes that an available node accepted are sent to appropriate “home” nodes
- multi-datacenter operation: each write from a client is sent to all replicas regardless of datacenter, higher-latency writes to other datacenters can be configured to be asynchronous
Detecting Concurrent Writes
- concurrent writes can lead to inconsistencies in quorum
- overwites would make nodes permanently inconsistent
- last write wins: discarding older writes, costs durability => safe only when a key is written once and immutable (e.g. use UUID as key)
- concurrency is difficult to define => when two operations are unaware of each other
- can read, version, merge (see grocery cart single replica example p. 189)
- version vector: version number per replica in addition to per key