DDIA: Chapter 1

Part I

There are fundamental ideas that apply to all data systems (single or distributed)
- reliability, scalability, maintainability
- data models and query languages
- storage engines
- data encoding (serialization)

an application programming interface (API) usually implementation details from clients, but uses smaller, general-purpose components for a special-purpose data system
reliability: the system should work correctly in case of fault or error
scalability: dealing with growth in volume/load (e.g. data, traffic) or complexity
maintainability: maintaining current behavior and adding new behavior should be productive

fault-tolerant (resilient): able to deal with things that can go wrong faults
failures: different from faults - entire system stops providing service
faults can arise from hardware, software, human error
human error:
- design systems that minimize opportunities for error
- decouple places where people make mistakes from where failure can happen
- test thoroughly at all levels
- allow quick and easy recovery
- set up detailed and clear monitoring
- implement good management practices

a system working reliably today may not with increased load
load parameters: how we describe load (e.g. requests per second, read/write ratio)
describing performance:
- throughput: number of records processed per second; total time it takes to run a fixed-size job on a dataset
- response time: time delta between client request and receiving response
results are distributions, not guaranteed
- arithmetic mean doesn’t tell you how many users experience a delay
- percentiles (e.g. p50, p99)/median are better (“how long to users typically have to wait?")
- tail latencies: high percentiles of response times
service level objectives (SLOs)/service level agreements (SLAs): contract that define expected performance/availability
head-of-line blocking: can result from slow requests in quques
tail latency amplification: multiple calls that slow down calls will result in a higher percentage of slow requests
scaling up (vertical scaling): moving to a more powerful machine, vs. scaling out (horizontal scaling, shared-nothing): distributing load across multiple smaller machines
elastic: automatically add computing resources
architecture for scale tends to be highly specific

operability: make it easy for operations teams to keep a system running smoothly
- monitor health of system
- keep software up to date (including security)
- perform complex maintenance tasks
- avoid dependency on individual machines
- self-healing, but manual control when needed
- etc.
simplicity: make it easy for new engineers to understand the system
- remove accidental complexity
evolvability: make it easy for engineers to make changes to the system in the future