DDIA: Chapter 4

DDIA: Chapter 4

Encoding and Evolution

  • How can we change schemas in an application?
  • rolling upgrade (staged rollout): gradually deploying and checking => can be done server-side, avoids downtime
  • client-side requires user action
  • backward compatibility: newwer code can read data written by older code
  • forward compatibility: older code can read data written byn ewere code

Formats for encoding Data

  • in memory, data is kept in data structures optimized for efficient access/manipulation by the CPU
  • on disk or over network, data is encoded
  • encoding (serialization, marshalling): in-memory => byte
  • decoding (parsing, deserialization, unmarshalling): byte => in-memory

Language-Specific Formats

  • languages often have built-in encoding support
    • often language-specific
    • decoding needs to be able to instantiate arbitrary classes, a security risk
    • versioning, compatibility, efficiency are often not priorities

JSON, XML, and Binary Variantes

  • textual, somewhat human readable
  • ambiguity around encoding numbers (e.g. integer vs. string of digits)
  • Unicode is supported, binary is not (can hack with base64)
  • optional schema support for JSON/XML, CSV doesn’t have schema
  • binary encoding
    • e.g. MessagePack for JSON
    • need to include object field names

Thrift and Protocol Buffers

  • binary encoding libraries
  • require a schema
  • IDL: interface definition language
  • come with code generation tool that produces classes that implement the schema
  • Thrift BinaryProtocol: field tags instead of field names encoded
  • Thrift CompactProtocol: field type and tag number are a single byte, variable-length integers - top bit of each byte indicates if more bytes
  • Protocol Buffers: similar to Thrift CompactProtocol
  • fields can be noted as required
  • schema evolution
    • field names can be changed in the schema (since reliance is on tags)
    • can add new fields through new tag numbers (old code can ignore) => forward compatibility
    • backward compatibility if fields have unique tag numbers (however new fields cannot be required, must be optional or default)
    • removing a field is simlar: can remove an optional field (cannot use same tag number again) and the old tag number will be ignored
    • datatypes can also be changed with caveats of size
  • Protocol Buffers have a repeated marker (instead of array or list datatypes) where the same field tag appears multiple times; okay to change an optional into a repeated (old code will see the last data)
  • Thrift has a list datatype - cannot go from single-valued to multi-values, but supports nested lists

Avro

  • subproject of Hadoop
  • two schema languages: Avro IDL for human editing; another based on JSON for machine-reading
  • encoding is just values concatenated together (very compact), indicated by a length prefix
  • schemas are exact
  • writer’s schema and reader’s schema don’t have to be the same, just compatible - Avro can resolve differences
  • forward compatibility: new version of schema as writer; old version of schema as reader
  • backward compatibility: new version of schema as reader; old version of schema as writer
  • can only add or remove a field that has a default value
  • must use a union type to allow a field to be null
  • changing a datatype of a field is possible if the type can be converted
  • changing the name of a field/adding a branch to a union type is backward compatible but not forward compatible
  • large file with lots of records: can include writer’s schema at beginning of file
  • database with individually written records: include a version number per record, and keep a list of schema versions in the database
  • sending records over a network connection: can negotiate the schema version on connection setup
  • friendlier to dynamically generated schemas - no field tags; new schemas can be generated
  • can be used with or without code generation (the file is self-describing)

The Merits of Schemas

  • binary encodings based on schemas:
    • can bem uch more compact by omitting field names
    • schema = up-to-date documentation
    • can check forwward and backward compatibility
    • can generate code in statically typed langauges

Dataflow Through Databases

  • may be a single process accessing the database
  • backwards compatibility in necessary (decode previously written encodings)
  • forward compatibility often required (value in a database may be w itten by a newer version of the code but written by a running older version of the code)
  • need to be careful with unknown fields (old version of code reads new field and rewwrites it back, uninterpreted)
  • migrating to a new schema is possible but expensive
  • formats like Avro are good for data dumps (one go, thereafter immutable), or column-oriented format (e.g. Parquet - for analytics)

Dataflow Through Services:

  • clients and servers over a network
  • services are similar to databases that expose application-specific APIs to allow only I/O allowed byb usiness logic
  • each service owned by one team => data encoding used by servers/clients is compatible across service API versions
  • web service:
    • HTTP protocol
    • REST: not a protocol - emphasizes simple data formats, URLs for identifying resources, use HTTP features for cache control/auth/content type negotiation, less code generation and more automated tooling
    • SOAP: XML-based protocol, avoids using most HTTP features (standard is web service framework; uses Web Services Description Language (WSDL)), messages aren ot designed to be human-readable
  • remote procedure call (RPC): request to a remote network service looks like calling a function or method within the same process (location transparency)
    • local functions calls are predictable and succeed or fail, unlike network requests
    • local function calls return a result or throw an exception or never return; network requeests may return with no result (timeout)
    • network request retries might actually beg etting through but responses are lost (bad without idempotency)
    • network requests are slower
    • larger objects are harder to pass over a network
    • RPC may have to trranslate datatypes across programming languages between client and server
    • RPC is supported on top of many encodings (Thrift, Avro, gRPC for protobufs, etc.)
      • often make remote requests more explicit (futures/promises)
      • gRPC supports streams (call isn ot just one request/one response)
      • service discovery: client can find which IP/port for a service
    • performant, but harder to experiment/debug
    • more common on requests between services under the same organization/datacenter

Message-Passing Dataflow

  • asynchronous message-passing systems: between RPC and databases
  • message: client request, delivered with low latency, not network connection
  • message broker: (message queue, message-oriented middleware) intermediary that stores a message temporarily
  • can act as a buffer
  • can automatically redeliver messages to a crashed process
  • sender does not need to know IP/port
  • one message can be sent to multiple sinks
  • sender is decoupled from recipient (publish model)
  • often one-way (no response from process unless on separate channel
  • asynchronous: send and forget
  • e.g. RabbitMQ, ActiveMQ, HornetQ, NATS, Apache Kafka
  • messages are sent to named queues (topics) - one-way
  • broker ensures messages are delivered to consumers (subscribers) - can be many for the same topic
  • data models usuallyn ot enforced (use any encoding format)
  • actor model: concurrency in a single process
    • avoids problems with threads (race conditions, locking,d eadlock)
    • each actor is one client/entity, may have local state, communicates with other actors by sending and receiving async messages
    • messagesa re not guaranteed
  • distributed actor framework: scales across multiple nodes through message-passing
  • still have to worry about forward/backward compatibility
  • e.g. Akka, Orleans, Erlang OTP