DDIA: Chapter 4
Encoding and Evolution
- How can we change schemas in an application?
- rolling upgrade (staged rollout): gradually deploying and checking => can be done server-side, avoids downtime
- client-side requires user action
- backward compatibility: newwer code can read data written by older code
- forward compatibility: older code can read data written byn ewere code
- in memory, data is kept in data structures optimized for efficient access/manipulation by the CPU
- on disk or over network, data is encoded
- encoding (serialization, marshalling): in-memory => byte
- decoding (parsing, deserialization, unmarshalling): byte => in-memory
- languages often have built-in encoding support
- often language-specific
- decoding needs to be able to instantiate arbitrary classes, a security risk
- versioning, compatibility, efficiency are often not priorities
JSON, XML, and Binary Variantes
- textual, somewhat human readable
- ambiguity around encoding numbers (e.g. integer vs. string of digits)
- Unicode is supported, binary is not (can hack with base64)
- optional schema support for JSON/XML, CSV doesn’t have schema
- binary encoding
- e.g. MessagePack for JSON
- need to include object field names
Thrift and Protocol Buffers
- binary encoding libraries
- require a schema
- IDL: interface definition language
- come with code generation tool that produces classes that implement the schema
- Thrift BinaryProtocol: field tags instead of field names encoded
- Thrift CompactProtocol: field type and tag number are a single byte, variable-length integers - top bit of each byte indicates if more bytes
- Protocol Buffers: similar to Thrift CompactProtocol
- fields can be noted as
required
- schema evolution
- field names can be changed in the schema (since reliance is on tags)
- can add new fields through new tag numbers (old code can ignore) => forward compatibility
- backward compatibility if fields have unique tag numbers (however new fields cannot be
required
, must be optional
or default
)
- removing a field is simlar: can remove an
optional
field (cannot use same tag number again) and the old tag number will be ignored
- datatypes can also be changed with caveats of size
- Protocol Buffers have a
repeated
marker (instead of array or list datatypes) where the same field tag appears multiple times; okay to change an optional
into a repeated
(old code will see the last data)
- Thrift has a list datatype - cannot go from single-valued to multi-values, but supports nested lists
Avro
- subproject of Hadoop
- two schema languages: Avro IDL for human editing; another based on JSON for machine-reading
- encoding is just values concatenated together (very compact), indicated by a length prefix
- schemas are exact
- writer’s schema and reader’s schema don’t have to be the same, just compatible - Avro can resolve differences
- forward compatibility: new version of schema as writer; old version of schema as reader
- backward compatibility: new version of schema as reader; old version of schema as writer
- can only add or remove a field that has a default value
- must use a union type to allow a field to be null
- changing a datatype of a field is possible if the type can be converted
- changing the name of a field/adding a branch to a union type is backward compatible but not forward compatible
- large file with lots of records: can include writer’s schema at beginning of file
- database with individually written records: include a version number per record, and keep a list of schema versions in the database
- sending records over a network connection: can negotiate the schema version on connection setup
- friendlier to dynamically generated schemas - no field tags; new schemas can be generated
- can be used with or without code generation (the file is self-describing)
The Merits of Schemas
- binary encodings based on schemas:
- can bem uch more compact by omitting field names
- schema = up-to-date documentation
- can check forwward and backward compatibility
- can generate code in statically typed langauges
Dataflow Through Databases
- may be a single process accessing the database
- backwards compatibility in necessary (decode previously written encodings)
- forward compatibility often required (value in a database may be w itten by a newer version of the code but written by a running older version of the code)
- need to be careful with unknown fields (old version of code reads new field and rewwrites it back, uninterpreted)
- migrating to a new schema is possible but expensive
- formats like Avro are good for data dumps (one go, thereafter immutable), or column-oriented format (e.g. Parquet - for analytics)
Dataflow Through Services:
- clients and servers over a network
- services are similar to databases that expose application-specific APIs to allow only I/O allowed byb usiness logic
- each service owned by one team => data encoding used by servers/clients is compatible across service API versions
- web service:
- HTTP protocol
- REST: not a protocol - emphasizes simple data formats, URLs for identifying resources, use HTTP features for cache control/auth/content type negotiation, less code generation and more automated tooling
- SOAP: XML-based protocol, avoids using most HTTP features (standard is web service framework; uses Web Services Description Language (WSDL)), messages aren ot designed to be human-readable
- remote procedure call (RPC): request to a remote network service looks like calling a function or method within the same process (location transparency)
- local functions calls are predictable and succeed or fail, unlike network requests
- local function calls return a result or throw an exception or never return; network requeests may return with no result (timeout)
- network request retries might actually beg etting through but responses are lost (bad without idempotency)
- network requests are slower
- larger objects are harder to pass over a network
- RPC may have to trranslate datatypes across programming languages between client and server
- RPC is supported on top of many encodings (Thrift, Avro, gRPC for protobufs, etc.)
- often make remote requests more explicit (futures/promises)
- gRPC supports streams (call isn ot just one request/one response)
- service discovery: client can find which IP/port for a service
- performant, but harder to experiment/debug
- more common on requests between services under the same organization/datacenter
Message-Passing Dataflow
- asynchronous message-passing systems: between RPC and databases
- message: client request, delivered with low latency, not network connection
- message broker: (message queue, message-oriented middleware) intermediary that stores a message temporarily
- can act as a buffer
- can automatically redeliver messages to a crashed process
- sender does not need to know IP/port
- one message can be sent to multiple sinks
- sender is decoupled from recipient (publish model)
- often one-way (no response from process unless on separate channel
- asynchronous: send and forget
- e.g. RabbitMQ, ActiveMQ, HornetQ, NATS, Apache Kafka
- messages are sent to named queues (topics) - one-way
- broker ensures messages are delivered to consumers (subscribers) - can be many for the same topic
- data models usuallyn ot enforced (use any encoding format)
- actor model: concurrency in a single process
- avoids problems with threads (race conditions, locking,d eadlock)
- each actor is one client/entity, may have local state, communicates with other actors by sending and receiving async messages
- messagesa re not guaranteed
- distributed actor framework: scales across multiple nodes through message-passing
- still have to worry about forward/backward compatibility
- e.g. Akka, Orleans, Erlang OTP