Files
jormun-db/ARCHITECTURE.md
2026-02-16 00:18:20 -05:00

10 KiB

JormunDB Architecture

This document explains the internal architecture of JormunDB, including design decisions, storage formats, and the arena-per-request memory management pattern.

Table of Contents

Overview

JormunDB is a DynamoDB-compatible database server that speaks the DynamoDB wire protocol. It uses RocksDB for persistent storage and is written in Odin for elegant memory management.

Key Design Goals

  1. Zero allocation ceremony - No explicit defer free() or error handling for every allocation
  2. Binary storage - Efficient TLV encoding instead of JSON
  3. API compatibility - Drop-in replacement for DynamoDB
  4. Performance - RocksDB-backed with efficient key encoding

Why Odin?

The original implementation in Zig suffered from explicit allocator threading:

// Zig version - explicit allocator everywhere
fn handleRequest(allocator: std.mem.Allocator, request: []const u8) !Response {
    const parsed = try parseJson(allocator, request);
    defer parsed.deinit(allocator);

    const item = try storage.getItem(allocator, parsed.table_name, parsed.key);
    defer if (item) |i| freeItem(allocator, i);

    const response = try serializeResponse(allocator, item);
    defer allocator.free(response);

    return response; // Wait, we deferred the free!
}

Odin's context allocator system eliminates this:

// Odin version - implicit context allocator
handle_request :: proc(request: []byte) -> Response {
    // All allocations use context.allocator automatically
    parsed := parse_json(request)
    item := storage_get_item(parsed.table_name, parsed.key)
    response := serialize_response(item)

    return response
    // Everything freed when arena is destroyed
}

Memory Management

JormunDB uses a two-allocator strategy:

1. Arena Allocator (Request-Scoped)

Every HTTP request gets its own arena:

handle_connection :: proc(conn: net.TCP_Socket) {
    // Create arena for this request (4MB)
    arena: mem.Arena
    mem.arena_init(&arena, make([]byte, mem.Megabyte * 4))
    defer mem.arena_destroy(&arena)

    // Set context allocator
    context.allocator = mem.arena_allocator(&arena)

    // All downstream code uses context.allocator
    request := parse_http_request(conn)    // uses arena
    response := handle_request(request)     // uses arena
    send_response(conn, response)           // uses arena

    // Arena is freed here - everything cleaned up automatically
}

Benefits:

  • No individual free() calls needed
  • No errdefer cleanup
  • No use-after-free bugs
  • No memory leaks from forgotten frees
  • Predictable performance (no GC pauses)

2. Default Allocator (Long-Lived Data)

The default allocator (typically context.allocator at program start) is used for:

  • Table metadata
  • Table locks (sync.RW_Mutex)
  • Engine state
  • Items returned from storage layer (copied to request arena when needed)

Storage Format

Binary Keys (Varint-Prefixed Segments)

All keys use varint length prefixes for space efficiency:

Meta key:  [0x01][len][table_name]
Data key:  [0x02][len][table_name][len][pk_value][len][sk_value]?
GSI key:   [0x03][len][table_name][len][index_name][len][gsi_pk][len][gsi_sk]?
LSI key:   [0x04][len][table_name][len][index_name][len][pk][len][lsi_sk]

Example Data Key:

Table: "Users"
PK: "user:123"
SK: "profile"

Encoded:
[0x02]          // Entity type (Data)
[0x05]          // Table name length (5)
Users           // Table name bytes
[0x08]          // PK length (8)
user:123        // PK bytes
[0x07]          // SK length (7)
profile         // SK bytes

Item Encoding (TLV Format)

Items use Tag-Length-Value encoding for space efficiency:

Format:
[attr_count:varint]
  [name_len:varint][name:bytes][type_tag:u8][value_len:varint][value:bytes]...

Type Tags:
  String  = 0x01    Number = 0x02    Binary = 0x03
  Bool    = 0x04    Null   = 0x05
  SS      = 0x10    NS     = 0x11    BS     = 0x12
  List    = 0x20    Map    = 0x21

Example Item:

{
  "id": {"S": "user123"},
  "age": {"N": "30"}
}

Encoded as:

[0x02]              // 2 attributes
  [0x02]            // name length (2)
  id                // name bytes
  [0x01]            // type tag (String)
  [0x07]            // value length (7)
  user123           // value bytes

  [0x03]            // name length (3)
  age               // name bytes
  [0x02]            // type tag (Number)
  [0x02]            // value length (2)
  30                // value bytes (stored as string)

Request Flow

1. HTTP POST / arrives
   ↓
2. Create arena allocator (4MB)
   Set context.allocator = arena_allocator
   ↓
3. Parse HTTP headers
   Extract X-Amz-Target → Operation
   ↓
4. Parse JSON body
   Convert DynamoDB JSON → internal types
   ↓
5. Route to handler (e.g., handle_put_item)
   ↓
6. Storage engine operation
   - Build binary key
   - Encode item to TLV
   - RocksDB put/get/delete
   ↓
7. Build response
   - Serialize item to DynamoDB JSON
   - Format HTTP response
   ↓
8. Send response
   ↓
9. Destroy arena
   All request memory freed automatically

Concurrency Model

Table-Level RW Locks

Each table has a reader-writer lock:

Storage_Engine :: struct {
    db:                 rocksdb.DB,
    table_locks:        map[string]^sync.RW_Mutex,
    table_locks_mutex:  sync.Mutex,
}

Read Operations (GetItem, Query, Scan):

  • Acquire shared lock
  • Multiple readers can run concurrently
  • Writers are blocked

Write Operations (PutItem, DeleteItem, UpdateItem):

  • Acquire exclusive lock
  • Only one writer at a time
  • All readers are blocked

Thread Safety

  • RocksDB handles are thread-safe (column family-based)
  • Table metadata is protected by locks
  • Request arenas are thread-local (no sharing)

Error Handling

Odin uses explicit error returns via or_return:

// Odin error handling
parse_json :: proc(data: []byte) -> (Item, bool) {
    parsed := json.parse(data) or_return
    item := json_to_item(parsed) or_return
    return item, true
}

// Usage
item := parse_json(request.body) or_else {
    return error_response(.ValidationException, "Invalid JSON")
}

No exceptions, no panic-recover patterns. Every error path is explicit.

DynamoDB Wire Protocol

Request Format

POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.PutItem
Content-Type: application/x-amz-json-1.0

{
  "TableName": "Users",
  "Item": {
    "id": {"S": "user123"},
    "name": {"S": "Alice"}
  }
}

Response Format

HTTP/1.1 200 OK
Content-Type: application/x-amz-json-1.0
x-amzn-RequestId: local-request-id

{}

Error Format

{
  "__type": "com.amazonaws.dynamodb.v20120810#ResourceNotFoundException",
  "message": "Table not found"
}

Performance Characteristics

Time Complexity

Operation Complexity Notes
PutItem O(log n) RocksDB LSM tree insert
GetItem O(log n) RocksDB point lookup
DeleteItem O(log n) RocksDB deletion
Query O(log n + m) n = items in table, m = result set
Scan O(n) Full table scan

Space Complexity

  • Binary keys: ~20-100 bytes (vs 50-200 bytes JSON)
  • Binary items: ~30% smaller than JSON
  • Varint encoding saves space on small integers

Benchmarks (Expected)

Based on Zig version performance:

Operation          Throughput      Latency (p50)
PutItem            ~5,000/sec      ~0.2ms
GetItem            ~7,000/sec      ~0.14ms
Query (1 item)     ~8,000/sec      ~0.12ms
Scan (1000 items)  ~20/sec         ~50ms

Future Enhancements

Planned Features

  1. UpdateExpression - SET/REMOVE/ADD/DELETE operations
  2. FilterExpression - Post-query filtering
  3. ProjectionExpression - Return subset of attributes
  4. Global Secondary Indexes - Query by non-key attributes
  5. Local Secondary Indexes - Alternate sort keys
  6. BatchWriteItem - Batch mutations
  7. BatchGetItem - Batch reads
  8. Transactions - ACID multi-item operations

Optimization Opportunities

  1. Connection pooling - Reuse HTTP connections
  2. Bloom filters - Faster negative lookups
  3. Compression - LZ4/Zstd on large items
  4. Caching layer - Hot item cache
  5. Parallel scan - Segment-based scanning

Debugging

Enable Verbose Logging

make run VERBOSE=1

Inspect RocksDB

# Use ldb tool to inspect database
ldb --db=./data scan
ldb --db=./data get <key_hex>

Memory Profiling

Odin's tracking allocator can detect leaks:

when ODIN_DEBUG {
    track: mem.Tracking_Allocator
    mem.tracking_allocator_init(&track, context.allocator)
    context.allocator = mem.tracking_allocator(&track)

    defer {
        for _, leak in track.allocation_map {
            fmt.printfln("Leaked %d bytes at %p", leak.size, leak.location)
        }
    }
}

Migration from Zig Version

The Zig version (ZynamoDB) used the same binary storage format, so existing RocksDB databases can be read by JormunDB without migration.

Compatibility

  • Binary key format (byte-compatible)
  • Binary item format (byte-compatible)
  • Table metadata (JSON, compatible)
  • HTTP wire protocol (identical)

Breaking Changes

None - JormunDB can open ZynamoDB databases directly.


Contributing

When contributing to JormunDB:

  1. Use the context allocator - All request-scoped allocations should use context.allocator
  2. Avoid manual frees - Let the arena handle it
  3. Long-lived data - Use the default allocator explicitly
  4. Test thoroughly - Run make test before committing
  5. Format code - Run make fmt before committing

References