DDIA: Chapter 1
Part I
- There are fundamental ideas that apply to all data systems (single or distributed)
- reliability, scalability, maintainability
- data models and query languages
- storage engines
- data encoding (serialization)
Chapter 1
Reliable, Scalable, and Maintainable Applications
- many applications are data-intensive rather than compute-intensive
- data systems are used everywhere:
- storing data (databases)
- saving the result of an expensive operation (caches)
- searching data (search indexes)
- sending messages to other processes asynchronously (stream processing)
- periodically process a large amount of data (batch processing)
Thinking About Data Systems
- an application programming interface (API) usually implementation details from clients, but uses smaller, general-purpose components for a special-purpose data system
- reliability: the system should work correctly in case of fault or error
- scalability: dealing with growth in volume/load (e.g. data, traffic) or complexity
- maintainability: maintaining current behavior and adding new behavior should be productive
Reliability
- fault-tolerant (resilient): able to deal with things that can go wrong faults
- failures: different from faults - entire system stops providing service
- faults can arise from hardware, software, human error
- human error:
- design systems that minimize opportunities for error
- decouple places where people make mistakes from where failure can happen
- test thoroughly at all levels
- allow quick and easy recovery
- set up detailed and clear monitoring
- implement good management practices
Scalability
- a system working reliably today may not with increased load
- load parameters: how we describe load (e.g. requests per second, read/write ratio)
- describing performance:
- throughput: number of records processed per second; total time it takes to run a fixed-size job on a dataset
- response time: time delta between client request and receiving response
- results are distributions, not guaranteed
- arithmetic mean doesn’t tell you how many users experience a delay
- percentiles (e.g. p50, p99)/median are better (“how long to users typically have to wait?")
- tail latencies: high percentiles of response times
- service level objectives (SLOs)/service level agreements (SLAs): contract that define expected performance/availability
- head-of-line blocking: can result from slow requests in quques
- tail latency amplification: multiple calls that slow down calls will result in a higher percentage of slow requests
- scaling up (vertical scaling): moving to a more powerful machine, vs. scaling out (horizontal scaling, shared-nothing): distributing load across multiple smaller machines
- elastic: automatically add computing resources
- architecture for scale tends to be highly specific
Maintainability
- operability: make it easy for operations teams to keep a system running smoothly
- monitor health of system
- keep software up to date (including security)
- perform complex maintenance tasks
- avoid dependency on individual machines
- self-healing, but manual control when needed
- etc.
- simplicity: make it easy for new engineers to understand the system
- remove accidental complexity
- evolvability: make it easy for engineers to make changes to the system in the future