Data Modeling & Database Architecture

Overview

Data modeling is the process of defining how data is stored, connected, and accessed. The right model depends on the access patterns, consistency requirements, and scale characteristics of each bounded context. This skill covers modeling techniques across all major database paradigms, drawn from Martin Kleppmann's Designing Data-Intensive Applications (Chapters 2-3) and industry-proven schema design patterns.

Relational Modeling

Normalization (1NF through 3NF)

Normalization eliminates data redundancy and update anomalies by organizing data into well-structured tables.

Normal Form	Rule	Example Violation
1NF	Atomic values only; no repeating groups	A `phones` column containing comma-separated numbers
2NF	1NF + no partial dependencies on composite keys	`order_items(order_id, product_id, product_name)` -- `product_name` depends only on `product_id`
3NF	2NF + no transitive dependencies	`employees(id, dept_id, dept_name)` -- `dept_name` depends on `dept_id`, not `id`

Process: Start normalized (3NF). This gives you the cleanest write path and prevents update anomalies.

Denormalization for Read Performance

When read-heavy workloads suffer from excessive joins, selectively denormalize:

sql

-- Normalized: requires JOIN on every read
SELECT o.id, c.name, c.email
FROM orders o JOIN customers c ON o.customer_id = c.id;

-- Denormalized: customer_name stored directly on orders
SELECT id, customer_name, customer_email FROM orders;

Trade-off: Faster reads, but writes must update data in multiple places. Use denormalization when:

•Read frequency vastly exceeds write frequency for the denormalized data.
•Join complexity is measurably hurting query performance.
•You accept the cost of maintaining consistency across copies.

For visual schema documentation, cross-reference specs/diagramming/erd.

Document Modeling (MongoDB-Style)

Embedding vs. Referencing

Strategy	When to Use	Example
Embed	Data is read together, 1:1 or 1:few relationship, child is owned by parent	`{ order: { items: [...] } }`
Reference	Data is read independently, 1:many or many:many, child is shared	`{ order: { item_ids: [...] } }`

Rule of thumb: Embed what you read together. Reference what you update independently.

Document Schema Design Patterns

Pattern	Problem It Solves	Technique
Subset	Documents too large; only part is needed per query	Store frequently accessed fields in main doc; archive rest in separate collection
Computed	Expensive aggregation on every read	Pre-compute and store derived values; update on write or via background job
Extended Reference	Frequent joins to get a few fields from related docs	Copy the most-accessed fields from the referenced document into the referencing one
Bucket	Too many small documents (e.g., IoT readings)	Group related data into time-bucketed documents (e.g., one doc per hour of readings)
Outlier	A few documents have vastly more sub-items than typical	Flag outlier docs and overflow extra data into separate documents

javascript

// Bucket pattern example -- sensor readings grouped by hour
{
  sensor_id: "temp-42",
  bucket_start: ISODate("2024-01-15T14:00:00Z"),
  readings: [
    { ts: ISODate("2024-01-15T14:00:12Z"), value: 22.5 },
    { ts: ISODate("2024-01-15T14:01:45Z"), value: 22.7 },
    // ... up to ~200 readings per bucket
  ],
  count: 2
}

Graph Modeling (Neo4j-Style)

Graph databases model data as nodes (entities), relationships (edges with direction and type), and properties (key-value pairs on both).

Core Concepts

code

(:Person {name: "Alice"})
   -[:WORKS_AT {since: 2020}]->
(:Company {name: "Acme"})
   -[:LOCATED_IN]->
(:City {name: "Seattle"})

•Nodes represent entities (labeled for type).
•Relationships have a type, direction, and optional properties.
•Properties are key-value pairs on nodes or relationships.

Cypher Basics

cypher

// Find friends of friends who work at the same company
MATCH (me:Person {name: "Alice"})-[:FRIEND]->(friend)-[:FRIEND]->(fof),
      (me)-[:WORKS_AT]->(company)<-[:WORKS_AT]-(fof)
WHERE me <> fof
RETURN fof.name, company.name

When to Use Graph Modeling

•Social networks (friends, followers, connections)
•Recommendation engines (users who bought X also bought Y)
•Fraud detection (detecting rings of related accounts)
•Knowledge graphs and ontologies
•Network/infrastructure mapping

Key-Value and Wide-Column

Redis Data Structures

Structure	Use Case	Example
String	Cache, counters, simple values	`SET user:42:name "Alice"`
Hash	Object storage, partial updates	`HSET user:42 name "Alice" email "a@b.com"`
List	Queues, activity feeds, recent items	`LPUSH notifications:42 "New message"`
Set	Tags, unique collections, intersections	`SADD user:42:skills "go" "rust"`
Sorted Set	Leaderboards, ranked feeds, priority queues	`ZADD leaderboard 9500 "player:7"`
Stream	Event log, message queue	`XADD events * type "order" id "123"`

DynamoDB Single-Table Design

Single-table design places multiple entity types in one table using carefully designed partition keys (PK) and sort keys (SK):

code

PK                  SK                    Data
USER#alice          PROFILE               { name, email, ... }
USER#alice          ORDER#2024-001        { total, status, ... }
USER#alice          ORDER#2024-002        { total, status, ... }
ORG#acme            METADATA              { plan, created, ... }
ORG#acme            MEMBER#alice          { role, joined, ... }

Advantages: Eliminates joins; supports multiple access patterns via GSIs. Trade-off: Complex to design; requires knowing access patterns upfront.

Cassandra Partition Keys

•Partition key determines which node stores the data -- must distribute evenly.
•Clustering columns determine sort order within a partition.
•Design tables around queries, not entities. One table per query pattern is common.

cql

CREATE TABLE sensor_readings (
    sensor_id TEXT,
    reading_date DATE,
    reading_time TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((sensor_id, reading_date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);

Time-Series Data

InfluxDB / TimescaleDB Concepts

Concept	InfluxDB Term	TimescaleDB Term	Purpose
Identifier	Measurement	Hypertable	Logical table / collection
Indexed metadata	Tags	Indexed columns	Fast filtering (e.g., `sensor_id`, `region`)
Measured values	Fields	Data columns	The actual readings (e.g., `temperature`, `cpu_pct`)
Time	Timestamp	Time column	Required; basis for partitioning and queries

Retention Policies & Downsampling

code

Raw data (1-second granularity) → keep 7 days
Downsampled to 1-minute averages → keep 90 days
Downsampled to 1-hour averages  → keep 2 years

Design retention policies early: time-series data grows fast. Continuous queries or materialized views automate downsampling.

Microservices Data Architecture

Database per Service vs. Shared Database

Approach	Pros	Cons
Database per service	Independent deployment, technology freedom, fault isolation	Cross-service queries require APIs; distributed transactions are hard
Shared database	Easy joins, familiar tooling, simple transactions	Tight coupling, schema changes require coordination, single point of failure

Recommendation: Default to database-per-service for microservices. Use the Saga pattern or event-driven architecture for cross-service consistency. Reserve shared databases for small teams or closely related services where the coordination cost of separation outweighs the coupling cost.

Schema Migration Strategies

Forward-Only Migrations

Every migration moves the schema forward. Rollback is achieved by deploying a new forward migration that reverses the change. Simplest approach; works well for most teams.

Expand-Contract (Parallel Change)

For zero-downtime deployments when changing column types, renaming fields, or splitting tables:

code

Phase 1 — EXPAND
  Add new column (keep old column)
  Deploy code that writes to BOTH columns
  Backfill new column from old column

Phase 2 — MIGRATE
  Deploy code that reads from new column
  Verify all consumers use new column

Phase 3 — CONTRACT
  Deploy code that stops writing to old column
  Drop old column

This pattern prevents breaking changes during rolling deployments where old and new code versions run simultaneously.

Canonical Reference

•Designing Data-Intensive Applications by Martin Kleppmann, Chapters 2-3 -- data models, query languages, storage engines, and the trade-offs between relational, document, and graph models.

Best Practices

•Model around access patterns, not just entity relationships. The best schema depends on how you query, not just what you store.
•Start normalized (3NF) for relational databases; denormalize only when you have measured evidence of read-path bottlenecks.
•In document databases, embed what you read together and reference what you update independently.
•For DynamoDB and Cassandra, know your access patterns before designing the table schema -- retrofitting is painful.
•Use expand-contract migrations for any schema change in a system that requires zero-downtime deployments.
•Time-series data needs retention policies from day one. Without them, storage costs grow unbounded.
•In microservices, prefer database-per-service for independence, and accept the complexity of cross-service data coordination.