jormun-db/ARCHITECTURE.md at cd4ee1cbd7dd735b9e48a7a74650599b79a19588

Files

biondizzle ffd3eda63c make batch operations work

2026-02-16 00:18:20 -05:00

10 KiB

Raw Blame History

JormunDB Architecture

This document explains the internal architecture of JormunDB, including design decisions, storage formats, and the arena-per-request memory management pattern.

Overview
Why Odin?
Memory Management
Storage Format
Module Structure
Request Flow
Concurrency Model

Overview

JormunDB is a DynamoDB-compatible database server that speaks the DynamoDB wire protocol. It uses RocksDB for persistent storage and is written in Odin for elegant memory management.

Key Design Goals

Zero allocation ceremony - No explicit defer free() or error handling for every allocation
Binary storage - Efficient TLV encoding instead of JSON
API compatibility - Drop-in replacement for DynamoDB
Performance - RocksDB-backed with efficient key encoding

Why Odin?

The original implementation in Zig suffered from explicit allocator threading:

// Zig version - explicit allocator everywhere
fn handleRequest(allocator: std.mem.Allocator, request: []const u8) !Response {
    const parsed = try parseJson(allocator, request);
    defer parsed.deinit(allocator);

    const item = try storage.getItem(allocator, parsed.table_name, parsed.key);
    defer if (item) |i| freeItem(allocator, i);

    const response = try serializeResponse(allocator, item);
    defer allocator.free(response);

    return response; // Wait, we deferred the free!
}

Odin's context allocator system eliminates this:

// Odin version - implicit context allocator
handle_request :: proc(request: []byte) -> Response {
    // All allocations use context.allocator automatically
    parsed := parse_json(request)
    item := storage_get_item(parsed.table_name, parsed.key)
    response := serialize_response(item)

    return response
    // Everything freed when arena is destroyed
}

Memory Management

JormunDB uses a two-allocator strategy:

1. Arena Allocator (Request-Scoped)

Every HTTP request gets its own arena:

handle_connection :: proc(conn: net.TCP_Socket) {
    // Create arena for this request (4MB)
    arena: mem.Arena
    mem.arena_init(&arena, make([]byte, mem.Megabyte * 4))
    defer mem.arena_destroy(&arena)

    // Set context allocator
    context.allocator = mem.arena_allocator(&arena)

    // All downstream code uses context.allocator
    request := parse_http_request(conn)    // uses arena
    response := handle_request(request)     // uses arena
    send_response(conn, response)           // uses arena

    // Arena is freed here - everything cleaned up automatically
}

Benefits:

No individual free() calls needed
No errdefer cleanup
No use-after-free bugs
No memory leaks from forgotten frees
Predictable performance (no GC pauses)

2. Default Allocator (Long-Lived Data)

The default allocator (typically context.allocator at program start) is used for:

Table metadata
Table locks (sync.RW_Mutex)
Engine state
Items returned from storage layer (copied to request arena when needed)

Storage Format

Binary Keys (Varint-Prefixed Segments)

All keys use varint length prefixes for space efficiency:

Meta key:  [0x01][len][table_name]
Data key:  [0x02][len][table_name][len][pk_value][len][sk_value]?
GSI key:   [0x03][len][table_name][len][index_name][len][gsi_pk][len][gsi_sk]?
LSI key:   [0x04][len][table_name][len][index_name][len][pk][len][lsi_sk]

Example Data Key:

Table: "Users"
PK: "user:123"
SK: "profile"

Encoded:
[0x02]          // Entity type (Data)
[0x05]          // Table name length (5)
Users           // Table name bytes
[0x08]          // PK length (8)
user:123        // PK bytes
[0x07]          // SK length (7)
profile         // SK bytes

Item Encoding (TLV Format)

Items use Tag-Length-Value encoding for space efficiency:

Format:
[attr_count:varint]
  [name_len:varint][name:bytes][type_tag:u8][value_len:varint][value:bytes]...

Type Tags:
  String  = 0x01    Number = 0x02    Binary = 0x03
  Bool    = 0x04    Null   = 0x05
  SS      = 0x10    NS     = 0x11    BS     = 0x12
  List    = 0x20    Map    = 0x21

Example Item:

{
  "id": {"S": "user123"},
  "age": {"N": "30"}
}

Encoded as:

[0x02]              // 2 attributes
  [0x02]            // name length (2)
  id                // name bytes
  [0x01]            // type tag (String)
  [0x07]            // value length (7)
  user123           // value bytes

  [0x03]            // name length (3)
  age               // name bytes
  [0x02]            // type tag (Number)
  [0x02]            // value length (2)
  30                // value bytes (stored as string)

Request Flow

1. HTTP POST / arrives
   ↓
2. Create arena allocator (4MB)
   Set context.allocator = arena_allocator
   ↓
3. Parse HTTP headers
   Extract X-Amz-Target → Operation
   ↓
4. Parse JSON body
   Convert DynamoDB JSON → internal types
   ↓
5. Route to handler (e.g., handle_put_item)
   ↓
6. Storage engine operation
   - Build binary key
   - Encode item to TLV
   - RocksDB put/get/delete
   ↓
7. Build response
   - Serialize item to DynamoDB JSON
   - Format HTTP response
   ↓
8. Send response
   ↓
9. Destroy arena
   All request memory freed automatically

Concurrency Model

Table-Level RW Locks

Each table has a reader-writer lock:

Storage_Engine :: struct {
    db:                 rocksdb.DB,
    table_locks:        map[string]^sync.RW_Mutex,
    table_locks_mutex:  sync.Mutex,
}

Read Operations (GetItem, Query, Scan):

Acquire shared lock
Multiple readers can run concurrently
Writers are blocked

Write Operations (PutItem, DeleteItem, UpdateItem):

Acquire exclusive lock
Only one writer at a time
All readers are blocked

Thread Safety

RocksDB handles are thread-safe (column family-based)
Table metadata is protected by locks
Request arenas are thread-local (no sharing)

Error Handling

Odin uses explicit error returns via or_return:

// Odin error handling
parse_json :: proc(data: []byte) -> (Item, bool) {
    parsed := json.parse(data) or_return
    item := json_to_item(parsed) or_return
    return item, true
}

// Usage
item := parse_json(request.body) or_else {
    return error_response(.ValidationException, "Invalid JSON")
}

No exceptions, no panic-recover patterns. Every error path is explicit.

DynamoDB Wire Protocol

Request Format

POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.PutItem
Content-Type: application/x-amz-json-1.0

{
  "TableName": "Users",
  "Item": {
    "id": {"S": "user123"},
    "name": {"S": "Alice"}
  }
}

Response Format

HTTP/1.1 200 OK
Content-Type: application/x-amz-json-1.0
x-amzn-RequestId: local-request-id

{}

Error Format

{
  "__type": "com.amazonaws.dynamodb.v20120810#ResourceNotFoundException",
  "message": "Table not found"
}

Performance Characteristics

Time Complexity

Operation	Complexity	Notes
PutItem	O(log n)	RocksDB LSM tree insert
GetItem	O(log n)	RocksDB point lookup
DeleteItem	O(log n)	RocksDB deletion
Query	O(log n + m)	n = items in table, m = result set
Scan	O(n)	Full table scan

Space Complexity

Binary keys: ~20-100 bytes (vs 50-200 bytes JSON)
Binary items: ~30% smaller than JSON
Varint encoding saves space on small integers

Benchmarks (Expected)

Based on Zig version performance:

Operation          Throughput      Latency (p50)
PutItem            ~5,000/sec      ~0.2ms
GetItem            ~7,000/sec      ~0.14ms
Query (1 item)     ~8,000/sec      ~0.12ms
Scan (1000 items)  ~20/sec         ~50ms

Future Enhancements

Planned Features

UpdateExpression - SET/REMOVE/ADD/DELETE operations
FilterExpression - Post-query filtering
ProjectionExpression - Return subset of attributes
Global Secondary Indexes - Query by non-key attributes
Local Secondary Indexes - Alternate sort keys
BatchWriteItem - Batch mutations
BatchGetItem - Batch reads
Transactions - ACID multi-item operations

Optimization Opportunities

Connection pooling - Reuse HTTP connections
Bloom filters - Faster negative lookups
Compression - LZ4/Zstd on large items
Caching layer - Hot item cache
Parallel scan - Segment-based scanning

Debugging

Enable Verbose Logging

make run VERBOSE=1

Inspect RocksDB

# Use ldb tool to inspect database
ldb --db=./data scan
ldb --db=./data get <key_hex>

Memory Profiling

Odin's tracking allocator can detect leaks:

when ODIN_DEBUG {
    track: mem.Tracking_Allocator
    mem.tracking_allocator_init(&track, context.allocator)
    context.allocator = mem.tracking_allocator(&track)

    defer {
        for _, leak in track.allocation_map {
            fmt.printfln("Leaked %d bytes at %p", leak.size, leak.location)
        }
    }
}

Migration from Zig Version

The Zig version (ZynamoDB) used the same binary storage format, so existing RocksDB databases can be read by JormunDB without migration.

Compatibility

✅ Binary key format (byte-compatible)
✅ Binary item format (byte-compatible)
✅ Table metadata (JSON, compatible)
✅ HTTP wire protocol (identical)

Breaking Changes

None - JormunDB can open ZynamoDB databases directly.

Contributing

When contributing to JormunDB:

Use the context allocator - All request-scoped allocations should use context.allocator
Avoid manual frees - Let the arena handle it
Long-lived data - Use the default allocator explicitly
Test thoroughly - Run make test before committing
Format code - Run make fmt before committing

10 KiB

Raw Blame History

JormunDB Architecture

Table of Contents

Overview

Key Design Goals

Why Odin?

Memory Management

1. Arena Allocator (Request-Scoped)

2. Default Allocator (Long-Lived Data)

Storage Format

Binary Keys (Varint-Prefixed Segments)

Item Encoding (TLV Format)

Request Flow

Concurrency Model

Table-Level RW Locks

Thread Safety

Error Handling

DynamoDB Wire Protocol

Request Format

Response Format

Error Format

Performance Characteristics

Time Complexity

Space Complexity

Benchmarks (Expected)

Future Enhancements

Planned Features

Optimization Opportunities

Debugging

Enable Verbose Logging

Inspect RocksDB

Memory Profiling

Migration from Zig Version

Compatibility

Breaking Changes

Contributing

References

10 KiB Raw Blame History

JormunDB Architecture

Table of Contents

Overview

Key Design Goals

Why Odin?

Memory Management

1. Arena Allocator (Request-Scoped)

2. Default Allocator (Long-Lived Data)

Storage Format

Binary Keys (Varint-Prefixed Segments)

Item Encoding (TLV Format)

Request Flow

Concurrency Model

Table-Level RW Locks

Thread Safety

Error Handling

DynamoDB Wire Protocol

Request Format

Response Format

Error Format

Performance Characteristics

Time Complexity

Space Complexity

Benchmarks (Expected)

Future Enhancements

Planned Features

Optimization Opportunities

Debugging

Enable Verbose Logging

Inspect RocksDB

Memory Profiling

Migration from Zig Version

Compatibility

Breaking Changes

Contributing

References

10 KiB

Raw Blame History