DDIA: Chapter 2

DDIA: Chapter 2

Data Models and Query Languages

  • Data models are important!

Relational Model Versus Document Model

  • Relational Model
    • e.g. SQL
    • data is organized into relations (unordered tuples = rows)
    • translation is needed between application objects and DB tables
    • some support XML/JSON/etc. in cells, but, difficult to query
  • NoSQL
    • “Not Only SQL” - anything non-relational (e.g. document, graph)
    • need for scale (large datasets, high write throughput), special query operations
    • document models like JSON have better locality (one query vs. multiple), and store nested records
      • e.g. MongoDB
    • document use case: data is in self-contained documents, relationships between documents are rare
    • graph use case: anything is potentially related to everything

Many-to-One and Many-to-Many Relationships

  • IDs vs. plain-text strings helps with consistency and ease of updating/localization/search
  • many-to-one is less suited to the document model (since joins are required)
    • e.g. lots of people may live in one location - finding who lives in one location when the document model object is based on a person is difficult

Relational Versus Document Databases Today

  • Which data model leads to simpler application code?
    • tree of one-to-many relationships, entire tree is loaded at once => document
    • many-to-many => relational, graph
    • depends on the kinds of relationships that exist between data items
  • most document databases do not enforce schema
    • schema-on-read: structure is implicit, interpreted on read
    • schema-on-write: like relational, schema is explicit, written data conforms
  • schema changes/updates can be slow (every row needs to be rewritten)
  • documents usually have storage locality (performance advantage) but only when updating large parts of the document (entire object needs to be rewritten)

Query Languages for Data

  • declarative languages (in data) tend to be easier to write/interpret since properties of data are being specified
  • declarative languages (in data) are more amenable to parallelization

MapReduce Querying

  • in-between declarative and imperative
  • based on map (collect) and reduce (fold or inject) type operations
  • must be pure functions (can only use input data, no DB additional queries)
  • amenable to distributed execution

Graph-Like Data Models

  • good for many-to-many relationships

Property Graphs

  • vertex:
    • unique identifier
    • incoming/outgoing edges
    • collection of properties (KV pairs)
  • edge:
    • unique identifier
    • vertex start (tail vertex), vertex end (head vertex)
    • label describing relationship
    • collection of properties (KV pairs)
  • efficient to traverse
  • amenable to chains of relationships (e.g. living in Country -> State -> City), recursion

Graph Queries in SQL

  • graph can be mapped to a relational database, but have to know what queries to perform beforehand

Triple-Stores and SPARQL

  • modeled as (subject, predicate, object)