DDIA: Chapter 1

DDIA: Chapter 1

Part I

  • There are fundamental ideas that apply to all data systems (single or distributed)
    • reliability, scalability, maintainability
    • data models and query languages
    • storage engines
    • data encoding (serialization)

Chapter 1

Reliable, Scalable, and Maintainable Applications

  • many applications are data-intensive rather than compute-intensive
  • data systems are used everywhere:
    • storing data (databases)
    • saving the result of an expensive operation (caches)
    • searching data (search indexes)
    • sending messages to other processes asynchronously (stream processing)
    • periodically process a large amount of data (batch processing)

Thinking About Data Systems

  • an application programming interface (API) usually implementation details from clients, but uses smaller, general-purpose components for a special-purpose data system
  • reliability: the system should work correctly in case of fault or error
  • scalability: dealing with growth in volume/load (e.g. data, traffic) or complexity
  • maintainability: maintaining current behavior and adding new behavior should be productive

Reliability

  • fault-tolerant (resilient): able to deal with things that can go wrong faults
  • failures: different from faults - entire system stops providing service
  • faults can arise from hardware, software, human error
  • human error:
    • design systems that minimize opportunities for error
    • decouple places where people make mistakes from where failure can happen
    • test thoroughly at all levels
    • allow quick and easy recovery
    • set up detailed and clear monitoring
    • implement good management practices

Scalability

  • a system working reliably today may not with increased load
  • load parameters: how we describe load (e.g. requests per second, read/write ratio)
  • describing performance:
    • throughput: number of records processed per second; total time it takes to run a fixed-size job on a dataset
    • response time: time delta between client request and receiving response
  • results are distributions, not guaranteed
    • arithmetic mean doesn’t tell you how many users experience a delay
    • percentiles (e.g. p50, p99)/median are better (“how long to users typically have to wait?")
    • tail latencies: high percentiles of response times
  • service level objectives (SLOs)/service level agreements (SLAs): contract that define expected performance/availability
  • head-of-line blocking: can result from slow requests in quques
  • tail latency amplification: multiple calls that slow down calls will result in a higher percentage of slow requests
  • scaling up (vertical scaling): moving to a more powerful machine, vs. scaling out (horizontal scaling, shared-nothing): distributing load across multiple smaller machines
  • elastic: automatically add computing resources
  • architecture for scale tends to be highly specific

Maintainability

  • operability: make it easy for operations teams to keep a system running smoothly
    • monitor health of system
    • keep software up to date (including security)
    • perform complex maintenance tasks
    • avoid dependency on individual machines
    • self-healing, but manual control when needed
    • etc.
  • simplicity: make it easy for new engineers to understand the system
    • remove accidental complexity
  • evolvability: make it easy for engineers to make changes to the system in the future