DDIA: Chapter 4

Encoding and Evolution

How can we change schemas in an application?
rolling upgrade (staged rollout): gradually deploying and checking => can be done server-side, avoids downtime
client-side requires user action
backward compatibility: newwer code can read data written by older code
forward compatibility: older code can read data written byn ewere code

Formats for encoding Data

in memory, data is kept in data structures optimized for efficient access/manipulation by the CPU
on disk or over network, data is encoded
encoding (serialization, marshalling): in-memory => byte
decoding (parsing, deserialization, unmarshalling): byte => in-memory

Language-Specific Formats

languages often have built-in encoding support
- often language-specific
- decoding needs to be able to instantiate arbitrary classes, a security risk
- versioning, compatibility, efficiency are often not priorities

JSON, XML, and Binary Variantes

textual, somewhat human readable
ambiguity around encoding numbers (e.g. integer vs. string of digits)
Unicode is supported, binary is not (can hack with base64)
optional schema support for JSON/XML, CSV doesn’t have schema
binary encoding
- e.g. MessagePack for JSON
- need to include object field names

Thrift and Protocol Buffers

binary encoding libraries
require a schema
IDL: interface definition language
come with code generation tool that produces classes that implement the schema
Thrift BinaryProtocol: field tags instead of field names encoded
Thrift CompactProtocol: field type and tag number are a single byte, variable-length integers - top bit of each byte indicates if more bytes
Protocol Buffers: similar to Thrift CompactProtocol
fields can be noted as required
schema evolution
- field names can be changed in the schema (since reliance is on tags)
- can add new fields through new tag numbers (old code can ignore) => forward compatibility
- backward compatibility if fields have unique tag numbers (however new fields cannot be required, must be optional or default)
- removing a field is simlar: can remove an optional field (cannot use same tag number again) and the old tag number will be ignored
- datatypes can also be changed with caveats of size
Protocol Buffers have a repeated marker (instead of array or list datatypes) where the same field tag appears multiple times; okay to change an optional into a repeated (old code will see the last data)
Thrift has a list datatype - cannot go from single-valued to multi-values, but supports nested lists

Avro

subproject of Hadoop
two schema languages: Avro IDL for human editing; another based on JSON for machine-reading
encoding is just values concatenated together (very compact), indicated by a length prefix
schemas are exact
writer’s schema and reader’s schema don’t have to be the same, just compatible - Avro can resolve differences
forward compatibility: new version of schema as writer; old version of schema as reader
backward compatibility: new version of schema as reader; old version of schema as writer
can only add or remove a field that has a default value
must use a union type to allow a field to be null
changing a datatype of a field is possible if the type can be converted
changing the name of a field/adding a branch to a union type is backward compatible but not forward compatible
large file with lots of records: can include writer’s schema at beginning of file
database with individually written records: include a version number per record, and keep a list of schema versions in the database
sending records over a network connection: can negotiate the schema version on connection setup
friendlier to dynamically generated schemas - no field tags; new schemas can be generated
can be used with or without code generation (the file is self-describing)

The Merits of Schemas

binary encodings based on schemas:
- can bem uch more compact by omitting field names
- schema = up-to-date documentation
- can check forwward and backward compatibility
- can generate code in statically typed langauges

Dataflow Through Databases

may be a single process accessing the database
backwards compatibility in necessary (decode previously written encodings)
forward compatibility often required (value in a database may be w itten by a newer version of the code but written by a running older version of the code)
need to be careful with unknown fields (old version of code reads new field and rewwrites it back, uninterpreted)
migrating to a new schema is possible but expensive
formats like Avro are good for data dumps (one go, thereafter immutable), or column-oriented format (e.g. Parquet - for analytics)

Dataflow Through Services:

clients and servers over a network
services are similar to databases that expose application-specific APIs to allow only I/O allowed byb usiness logic
each service owned by one team => data encoding used by servers/clients is compatible across service API versions
web service:
- HTTP protocol
- REST: not a protocol - emphasizes simple data formats, URLs for identifying resources, use HTTP features for cache control/auth/content type negotiation, less code generation and more automated tooling
- SOAP: XML-based protocol, avoids using most HTTP features (standard is web service framework; uses Web Services Description Language (WSDL)), messages aren ot designed to be human-readable
remote procedure call (RPC): request to a remote network service looks like calling a function or method within the same process (location transparency)
- local functions calls are predictable and succeed or fail, unlike network requests
- local function calls return a result or throw an exception or never return; network requeests may return with no result (timeout)
- network request retries might actually beg etting through but responses are lost (bad without idempotency)
- network requests are slower
- larger objects are harder to pass over a network
- RPC may have to trranslate datatypes across programming languages between client and server
- RPC is supported on top of many encodings (Thrift, Avro, gRPC for protobufs, etc.)
  - often make remote requests more explicit (futures/promises)
  - gRPC supports streams (call isn ot just one request/one response)
  - service discovery: client can find which IP/port for a service
- performant, but harder to experiment/debug
- more common on requests between services under the same organization/datacenter

Message-Passing Dataflow

asynchronous message-passing systems: between RPC and databases
message: client request, delivered with low latency, not network connection
message broker: (message queue, message-oriented middleware) intermediary that stores a message temporarily
can act as a buffer
can automatically redeliver messages to a crashed process
sender does not need to know IP/port
one message can be sent to multiple sinks
sender is decoupled from recipient (publish model)
often one-way (no response from process unless on separate channel
asynchronous: send and forget
e.g. RabbitMQ, ActiveMQ, HornetQ, NATS, Apache Kafka
messages are sent to named queues (topics) - one-way
broker ensures messages are delivered to consumers (subscribers) - can be many for the same topic
data models usuallyn ot enforced (use any encoding format)
actor model: concurrency in a single process
- avoids problems with threads (race conditions, locking,d eadlock)
- each actor is one client/entity, may have local state, communicates with other actors by sending and receiving async messages
- messagesa re not guaranteed
distributed actor framework: scales across multiple nodes through message-passing
still have to worry about forward/backward compatibility
e.g. Akka, Orleans, Erlang OTP