Files
jormun-db/ARCHITECTURE.md

427 lines
11 KiB
Markdown
Raw Normal View History

2026-02-15 08:55:22 -05:00
## JormunDB Architecture
This document explains the internal architecture of JormunDB, including design decisions, storage formats, and the arena-per-request memory management pattern.
## Table of Contents
- [Overview](#overview)
- [Why Odin?](#why-odin)
- [Memory Management](#memory-management)
- [Storage Format](#storage-format)
- [Module Structure](#module-structure)
- [Request Flow](#request-flow)
- [Concurrency Model](#concurrency-model)
## Overview
JormunDB is a DynamoDB-compatible database server that speaks the DynamoDB wire protocol. It uses RocksDB for persistent storage and is written in Odin for elegant memory management.
### Key Design Goals
1. **Zero allocation ceremony** - No explicit `defer free()` or error handling for every allocation
2. **Binary storage** - Efficient TLV encoding instead of JSON
3. **API compatibility** - Drop-in replacement for DynamoDB Local
4. **Performance** - RocksDB-backed with efficient key encoding
## Why Odin?
The original implementation in Zig suffered from explicit allocator threading:
```zig
// Zig version - explicit allocator everywhere
fn handleRequest(allocator: std.mem.Allocator, request: []const u8) !Response {
const parsed = try parseJson(allocator, request);
defer parsed.deinit(allocator);
const item = try storage.getItem(allocator, parsed.table_name, parsed.key);
defer if (item) |i| freeItem(allocator, i);
const response = try serializeResponse(allocator, item);
defer allocator.free(response);
return response; // Wait, we deferred the free!
}
```
Odin's context allocator system eliminates this:
```odin
// Odin version - implicit context allocator
handle_request :: proc(request: []byte) -> Response {
// All allocations use context.allocator automatically
parsed := parse_json(request)
item := storage_get_item(parsed.table_name, parsed.key)
response := serialize_response(item)
return response
// Everything freed when arena is destroyed
}
```
## Memory Management
JormunDB uses a two-allocator strategy:
### 1. Arena Allocator (Request-Scoped)
Every HTTP request gets its own arena:
```odin
handle_connection :: proc(conn: net.TCP_Socket) {
// Create arena for this request (4MB)
arena: mem.Arena
mem.arena_init(&arena, make([]byte, mem.Megabyte * 4))
defer mem.arena_destroy(&arena)
// Set context allocator
context.allocator = mem.arena_allocator(&arena)
// All downstream code uses context.allocator
request := parse_http_request(conn) // uses arena
response := handle_request(request) // uses arena
send_response(conn, response) // uses arena
// Arena is freed here - everything cleaned up automatically
}
```
**Benefits:**
- No individual `free()` calls needed
- No `errdefer` cleanup
- No use-after-free bugs
- No memory leaks from forgotten frees
- Predictable performance (no GC pauses)
### 2. Default Allocator (Long-Lived Data)
The default allocator (typically `context.allocator` at program start) is used for:
- Table metadata
- Table locks (sync.RW_Mutex)
- Engine state
- Items returned from storage layer (copied to request arena when needed)
## Storage Format
### Binary Keys (Varint-Prefixed Segments)
All keys use varint length prefixes for space efficiency:
```
Meta key: [0x01][len][table_name]
Data key: [0x02][len][table_name][len][pk_value][len][sk_value]?
GSI key: [0x03][len][table_name][len][index_name][len][gsi_pk][len][gsi_sk]?
LSI key: [0x04][len][table_name][len][index_name][len][pk][len][lsi_sk]
```
**Example Data Key:**
```
Table: "Users"
PK: "user:123"
SK: "profile"
Encoded:
[0x02] // Entity type (Data)
[0x05] // Table name length (5)
Users // Table name bytes
[0x08] // PK length (8)
user:123 // PK bytes
[0x07] // SK length (7)
profile // SK bytes
```
### Item Encoding (TLV Format)
Items use Tag-Length-Value encoding for space efficiency:
```
Format:
[attr_count:varint]
[name_len:varint][name:bytes][type_tag:u8][value_len:varint][value:bytes]...
Type Tags:
String = 0x01 Number = 0x02 Binary = 0x03
Bool = 0x04 Null = 0x05
SS = 0x10 NS = 0x11 BS = 0x12
List = 0x20 Map = 0x21
```
**Example Item:**
```json
{
"id": {"S": "user123"},
"age": {"N": "30"}
}
```
Encoded as:
```
[0x02] // 2 attributes
[0x02] // name length (2)
id // name bytes
[0x01] // type tag (String)
[0x07] // value length (7)
user123 // value bytes
[0x03] // name length (3)
age // name bytes
[0x02] // type tag (Number)
[0x02] // value length (2)
30 // value bytes (stored as string)
```
## Module Structure
```
jormundb/
├── main.odin # Entry point, HTTP server
├── rocksdb/ # RocksDB C FFI bindings
│ └── rocksdb.odin # db_open, db_put, db_get, etc.
├── dynamodb/ # DynamoDB protocol implementation
│ ├── types.odin # Core types (Attribute_Value, Item, Key, etc.)
│ ├── json.odin # DynamoDB JSON parsing/serialization
│ ├── storage.odin # Storage engine (CRUD, scan, query)
│ └── handler.odin # HTTP request handlers
├── key_codec/ # Binary key encoding
│ └── key_codec.odin # build_data_key, decode_data_key, etc.
└── item_codec/ # Binary TLV item encoding
└── item_codec.odin # encode, decode
```
## Request Flow
```
1. HTTP POST / arrives
2. Create arena allocator (4MB)
Set context.allocator = arena_allocator
3. Parse HTTP headers
Extract X-Amz-Target → Operation
4. Parse JSON body
Convert DynamoDB JSON → internal types
5. Route to handler (e.g., handle_put_item)
6. Storage engine operation
- Build binary key
- Encode item to TLV
- RocksDB put/get/delete
7. Build response
- Serialize item to DynamoDB JSON
- Format HTTP response
8. Send response
9. Destroy arena
All request memory freed automatically
```
## Concurrency Model
### Table-Level RW Locks
Each table has a reader-writer lock:
```odin
Storage_Engine :: struct {
db: rocksdb.DB,
table_locks: map[string]^sync.RW_Mutex,
table_locks_mutex: sync.Mutex,
}
```
**Read Operations** (GetItem, Query, Scan):
- Acquire shared lock
- Multiple readers can run concurrently
- Writers are blocked
**Write Operations** (PutItem, DeleteItem, UpdateItem):
- Acquire exclusive lock
- Only one writer at a time
- All readers are blocked
### Thread Safety
- RocksDB handles are thread-safe (column family-based)
- Table metadata is protected by locks
- Request arenas are thread-local (no sharing)
## Error Handling
Odin uses explicit error returns via `or_return`:
```odin
// Odin error handling
parse_json :: proc(data: []byte) -> (Item, bool) {
parsed := json.parse(data) or_return
item := json_to_item(parsed) or_return
return item, true
}
// Usage
item := parse_json(request.body) or_else {
return error_response(.ValidationException, "Invalid JSON")
}
```
No exceptions, no panic-recover patterns. Every error path is explicit.
## DynamoDB Wire Protocol
### Request Format
```
POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.PutItem
Content-Type: application/x-amz-json-1.0
{
"TableName": "Users",
"Item": {
"id": {"S": "user123"},
"name": {"S": "Alice"}
}
}
```
### Response Format
```
HTTP/1.1 200 OK
Content-Type: application/x-amz-json-1.0
x-amzn-RequestId: local-request-id
{}
```
### Error Format
```json
{
"__type": "com.amazonaws.dynamodb.v20120810#ResourceNotFoundException",
"message": "Table not found"
}
```
## Performance Characteristics
### Time Complexity
| Operation | Complexity | Notes |
|-----------|-----------|-------|
| PutItem | O(log n) | RocksDB LSM tree insert |
| GetItem | O(log n) | RocksDB point lookup |
| DeleteItem | O(log n) | RocksDB deletion |
| Query | O(log n + m) | n = items in table, m = result set |
| Scan | O(n) | Full table scan |
### Space Complexity
- Binary keys: ~20-100 bytes (vs 50-200 bytes JSON)
- Binary items: ~30% smaller than JSON
- Varint encoding saves space on small integers
### Benchmarks (Expected)
Based on Zig version performance:
```
Operation Throughput Latency (p50)
PutItem ~5,000/sec ~0.2ms
GetItem ~7,000/sec ~0.14ms
Query (1 item) ~8,000/sec ~0.12ms
Scan (1000 items) ~20/sec ~50ms
```
## Future Enhancements
### Planned Features
1. **UpdateExpression** - SET/REMOVE/ADD/DELETE operations
2. **FilterExpression** - Post-query filtering
3. **ProjectionExpression** - Return subset of attributes
4. **Global Secondary Indexes** - Query by non-key attributes
5. **Local Secondary Indexes** - Alternate sort keys
6. **BatchWriteItem** - Batch mutations
7. **BatchGetItem** - Batch reads
8. **Transactions** - ACID multi-item operations
### Optimization Opportunities
1. **Connection pooling** - Reuse HTTP connections
2. **Bloom filters** - Faster negative lookups
3. **Compression** - LZ4/Zstd on large items
4. **Caching layer** - Hot item cache
5. **Parallel scan** - Segment-based scanning
## Debugging
### Enable Verbose Logging
```bash
make run VERBOSE=1
```
### Inspect RocksDB
```bash
# Use ldb tool to inspect database
ldb --db=./data scan
ldb --db=./data get <key_hex>
```
### Memory Profiling
Odin's tracking allocator can detect leaks:
```odin
when ODIN_DEBUG {
track: mem.Tracking_Allocator
mem.tracking_allocator_init(&track, context.allocator)
context.allocator = mem.tracking_allocator(&track)
defer {
for _, leak in track.allocation_map {
fmt.printfln("Leaked %d bytes at %p", leak.size, leak.location)
}
}
}
```
## Migration from Zig Version
The Zig version (ZynamoDB) used the same binary storage format, so existing RocksDB databases can be read by JormunDB without migration.
### Compatibility
- ✅ Binary key format (byte-compatible)
- ✅ Binary item format (byte-compatible)
- ✅ Table metadata (JSON, compatible)
- ✅ HTTP wire protocol (identical)
### Breaking Changes
None - JormunDB can open ZynamoDB databases directly.
---
## Contributing
When contributing to JormunDB:
1. **Use the context allocator** - All request-scoped allocations should use `context.allocator`
2. **Avoid manual frees** - Let the arena handle it
3. **Long-lived data** - Use the default allocator explicitly
4. **Test thoroughly** - Run `make test` before committing
5. **Format code** - Run `make fmt` before committing
## References
- [Odin Language](https://odin-lang.org/)
- [RocksDB Wiki](https://github.com/facebook/rocksdb/wiki)
- [DynamoDB API Reference](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/)
- [Varint Encoding](https://developers.google.com/protocol-buffers/docs/encoding#varints)