## JormunDB Architecture # !!THIS IS NO LONGER ENTIRELY ACCURATE IGNORE OR UPDATE WITH ACCURATE INFO!! This document explains the internal architecture of JormunDB, including design decisions, storage formats, and the arena-per-request memory management pattern. ## Table of Contents - [Overview](#overview) - [Why Odin?](#why-odin) - [Memory Management](#memory-management) - [Storage Format](#storage-format) - [Module Structure](#module-structure) - [Request Flow](#request-flow) - [Concurrency Model](#concurrency-model) ## Overview JormunDB is a DynamoDB-compatible database server that speaks the DynamoDB wire protocol. It uses RocksDB for persistent storage and is written in Odin for elegant memory management. ### Key Design Goals 1. **Zero allocation ceremony** - No explicit `defer free()` or error handling for every allocation 2. **Binary storage** - Efficient TLV encoding instead of JSON 3. **API compatibility** - Drop-in replacement for DynamoDB 4. **Performance** - RocksDB-backed with efficient key encoding ## Why Odin? The original implementation in Zig suffered from explicit allocator threading: ```zig // Zig version - explicit allocator everywhere fn handleRequest(allocator: std.mem.Allocator, request: []const u8) !Response { const parsed = try parseJson(allocator, request); defer parsed.deinit(allocator); const item = try storage.getItem(allocator, parsed.table_name, parsed.key); defer if (item) |i| freeItem(allocator, i); const response = try serializeResponse(allocator, item); defer allocator.free(response); return response; // Wait, we deferred the free! } ``` Odin's context allocator system eliminates this: ```odin // Odin version - implicit context allocator handle_request :: proc(request: []byte) -> Response { // All allocations use context.allocator automatically parsed := parse_json(request) item := storage_get_item(parsed.table_name, parsed.key) response := serialize_response(item) return response // Everything freed when arena is destroyed } ``` ## Memory Management JormunDB uses a two-allocator strategy: ### 1. Arena Allocator (Request-Scoped) Every HTTP request gets its own arena: ```odin handle_connection :: proc(conn: net.TCP_Socket) { // Create arena for this request (4MB) arena: mem.Arena mem.arena_init(&arena, make([]byte, mem.Megabyte * 4)) defer mem.arena_destroy(&arena) // Set context allocator context.allocator = mem.arena_allocator(&arena) // All downstream code uses context.allocator request := parse_http_request(conn) // uses arena response := handle_request(request) // uses arena send_response(conn, response) // uses arena // Arena is freed here - everything cleaned up automatically } ``` **Benefits:** - No individual `free()` calls needed - No `errdefer` cleanup - No use-after-free bugs - No memory leaks from forgotten frees - Predictable performance (no GC pauses) ### 2. Default Allocator (Long-Lived Data) The default allocator (typically `context.allocator` at program start) is used for: - Table metadata - Table locks (sync.RW_Mutex) - Engine state - Items returned from storage layer (copied to request arena when needed) ## Storage Format ### Binary Keys (Varint-Prefixed Segments) All keys use varint length prefixes for space efficiency: ``` Meta key: [0x01][len][table_name] Data key: [0x02][len][table_name][len][pk_value][len][sk_value]? GSI key: [0x03][len][table_name][len][index_name][len][gsi_pk][len][gsi_sk]? LSI key: [0x04][len][table_name][len][index_name][len][pk][len][lsi_sk] ``` **Example Data Key:** ``` Table: "Users" PK: "user:123" SK: "profile" Encoded: [0x02] // Entity type (Data) [0x05] // Table name length (5) Users // Table name bytes [0x08] // PK length (8) user:123 // PK bytes [0x07] // SK length (7) profile // SK bytes ``` ### Item Encoding (TLV Format) Items use Tag-Length-Value encoding for space efficiency: ``` Format: [attr_count:varint] [name_len:varint][name:bytes][type_tag:u8][value_len:varint][value:bytes]... Type Tags: String = 0x01 Number = 0x02 Binary = 0x03 Bool = 0x04 Null = 0x05 SS = 0x10 NS = 0x11 BS = 0x12 List = 0x20 Map = 0x21 ``` **Example Item:** ```json { "id": {"S": "user123"}, "age": {"N": "30"} } ``` Encoded as: ``` [0x02] // 2 attributes [0x02] // name length (2) id // name bytes [0x01] // type tag (String) [0x07] // value length (7) user123 // value bytes [0x03] // name length (3) age // name bytes [0x02] // type tag (Number) [0x02] // value length (2) 30 // value bytes (stored as string) ``` ## Request Flow ``` 1. HTTP POST / arrives ↓ 2. Create arena allocator (4MB) Set context.allocator = arena_allocator ↓ 3. Parse HTTP headers Extract X-Amz-Target → Operation ↓ 4. Parse JSON body Convert DynamoDB JSON → internal types ↓ 5. Route to handler (e.g., handle_put_item) ↓ 6. Storage engine operation - Build binary key - Encode item to TLV - RocksDB put/get/delete ↓ 7. Build response - Serialize item to DynamoDB JSON - Format HTTP response ↓ 8. Send response ↓ 9. Destroy arena All request memory freed automatically ``` ## Concurrency Model ### Table-Level RW Locks Each table has a reader-writer lock: ```odin Storage_Engine :: struct { db: rocksdb.DB, table_locks: map[string]^sync.RW_Mutex, table_locks_mutex: sync.Mutex, } ``` **Read Operations** (GetItem, Query, Scan): - Acquire shared lock - Multiple readers can run concurrently - Writers are blocked **Write Operations** (PutItem, DeleteItem, UpdateItem): - Acquire exclusive lock - Only one writer at a time - All readers are blocked ### Thread Safety - RocksDB handles are thread-safe (column family-based) - Table metadata is protected by locks - Request arenas are thread-local (no sharing) ## Error Handling Odin uses explicit error returns via `or_return`: ```odin // Odin error handling parse_json :: proc(data: []byte) -> (Item, bool) { parsed := json.parse(data) or_return item := json_to_item(parsed) or_return return item, true } // Usage item := parse_json(request.body) or_else { return error_response(.ValidationException, "Invalid JSON") } ``` No exceptions, no panic-recover patterns. Every error path is explicit. ## DynamoDB Wire Protocol ### Request Format ``` POST / HTTP/1.1 X-Amz-Target: DynamoDB_20120810.PutItem Content-Type: application/x-amz-json-1.0 { "TableName": "Users", "Item": { "id": {"S": "user123"}, "name": {"S": "Alice"} } } ``` ### Response Format ``` HTTP/1.1 200 OK Content-Type: application/x-amz-json-1.0 x-amzn-RequestId: local-request-id {} ``` ### Error Format ```json { "__type": "com.amazonaws.dynamodb.v20120810#ResourceNotFoundException", "message": "Table not found" } ``` ## Performance Characteristics ### Time Complexity | Operation | Complexity | Notes | |-----------|-----------|-------| | PutItem | O(log n) | RocksDB LSM tree insert | | GetItem | O(log n) | RocksDB point lookup | | DeleteItem | O(log n) | RocksDB deletion | | Query | O(log n + m) | n = items in table, m = result set | | Scan | O(n) | Full table scan | ### Space Complexity - Binary keys: ~20-100 bytes (vs 50-200 bytes JSON) - Binary items: ~30% smaller than JSON - Varint encoding saves space on small integers ### Benchmarks (Expected) Based on Zig version performance: ``` Operation Throughput Latency (p50) PutItem ~5,000/sec ~0.2ms GetItem ~7,000/sec ~0.14ms Query (1 item) ~8,000/sec ~0.12ms Scan (1000 items) ~20/sec ~50ms ``` ## Future Enhancements ### Planned Features 1. **UpdateExpression** - SET/REMOVE/ADD/DELETE operations 2. **FilterExpression** - Post-query filtering 3. **ProjectionExpression** - Return subset of attributes 4. **Global Secondary Indexes** - Query by non-key attributes 5. **Local Secondary Indexes** - Alternate sort keys 6. **BatchWriteItem** - Batch mutations 7. **BatchGetItem** - Batch reads 8. **Transactions** - ACID multi-item operations ### Optimization Opportunities 1. **Connection pooling** - Reuse HTTP connections 2. **Bloom filters** - Faster negative lookups 3. **Compression** - LZ4/Zstd on large items 4. **Caching layer** - Hot item cache 5. **Parallel scan** - Segment-based scanning ## Debugging ### Enable Verbose Logging ```bash make run VERBOSE=1 ``` ### Inspect RocksDB ```bash # Use ldb tool to inspect database ldb --db=./data scan ldb --db=./data get ``` ### Memory Profiling Odin's tracking allocator can detect leaks: ```odin when ODIN_DEBUG { track: mem.Tracking_Allocator mem.tracking_allocator_init(&track, context.allocator) context.allocator = mem.tracking_allocator(&track) defer { for _, leak in track.allocation_map { fmt.printfln("Leaked %d bytes at %p", leak.size, leak.location) } } } ``` ## Migration from Zig Version The Zig version (ZynamoDB) used the same binary storage format, so existing RocksDB databases can be read by JormunDB without migration. ### Compatibility - ✅ Binary key format (byte-compatible) - ✅ Binary item format (byte-compatible) - ✅ Table metadata (JSON, compatible) - ✅ HTTP wire protocol (identical) ### Breaking Changes None - JormunDB can open ZynamoDB databases directly. --- ## Contributing When contributing to JormunDB: 1. **Use the context allocator** - All request-scoped allocations should use `context.allocator` 2. **Avoid manual frees** - Let the arena handle it 3. **Long-lived data** - Use the default allocator explicitly 4. **Test thoroughly** - Run `make test` before committing 5. **Format code** - Run `make fmt` before committing ## References - [Odin Language](https://odin-lang.org/) - [RocksDB Wiki](https://github.com/facebook/rocksdb/wiki) - [DynamoDB API Reference](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/) - [Varint Encoding](https://developers.google.com/protocol-buffers/docs/encoding#varints)