427 lines
11 KiB
Markdown
427 lines
11 KiB
Markdown
|
|
## JormunDB Architecture
|
||
|
|
|
||
|
|
This document explains the internal architecture of JormunDB, including design decisions, storage formats, and the arena-per-request memory management pattern.
|
||
|
|
|
||
|
|
## Table of Contents
|
||
|
|
|
||
|
|
- [Overview](#overview)
|
||
|
|
- [Why Odin?](#why-odin)
|
||
|
|
- [Memory Management](#memory-management)
|
||
|
|
- [Storage Format](#storage-format)
|
||
|
|
- [Module Structure](#module-structure)
|
||
|
|
- [Request Flow](#request-flow)
|
||
|
|
- [Concurrency Model](#concurrency-model)
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
JormunDB is a DynamoDB-compatible database server that speaks the DynamoDB wire protocol. It uses RocksDB for persistent storage and is written in Odin for elegant memory management.
|
||
|
|
|
||
|
|
### Key Design Goals
|
||
|
|
|
||
|
|
1. **Zero allocation ceremony** - No explicit `defer free()` or error handling for every allocation
|
||
|
|
2. **Binary storage** - Efficient TLV encoding instead of JSON
|
||
|
|
3. **API compatibility** - Drop-in replacement for DynamoDB Local
|
||
|
|
4. **Performance** - RocksDB-backed with efficient key encoding
|
||
|
|
|
||
|
|
## Why Odin?
|
||
|
|
|
||
|
|
The original implementation in Zig suffered from explicit allocator threading:
|
||
|
|
|
||
|
|
```zig
|
||
|
|
// Zig version - explicit allocator everywhere
|
||
|
|
fn handleRequest(allocator: std.mem.Allocator, request: []const u8) !Response {
|
||
|
|
const parsed = try parseJson(allocator, request);
|
||
|
|
defer parsed.deinit(allocator);
|
||
|
|
|
||
|
|
const item = try storage.getItem(allocator, parsed.table_name, parsed.key);
|
||
|
|
defer if (item) |i| freeItem(allocator, i);
|
||
|
|
|
||
|
|
const response = try serializeResponse(allocator, item);
|
||
|
|
defer allocator.free(response);
|
||
|
|
|
||
|
|
return response; // Wait, we deferred the free!
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Odin's context allocator system eliminates this:
|
||
|
|
|
||
|
|
```odin
|
||
|
|
// Odin version - implicit context allocator
|
||
|
|
handle_request :: proc(request: []byte) -> Response {
|
||
|
|
// All allocations use context.allocator automatically
|
||
|
|
parsed := parse_json(request)
|
||
|
|
item := storage_get_item(parsed.table_name, parsed.key)
|
||
|
|
response := serialize_response(item)
|
||
|
|
|
||
|
|
return response
|
||
|
|
// Everything freed when arena is destroyed
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Memory Management
|
||
|
|
|
||
|
|
JormunDB uses a two-allocator strategy:
|
||
|
|
|
||
|
|
### 1. Arena Allocator (Request-Scoped)
|
||
|
|
|
||
|
|
Every HTTP request gets its own arena:
|
||
|
|
|
||
|
|
```odin
|
||
|
|
handle_connection :: proc(conn: net.TCP_Socket) {
|
||
|
|
// Create arena for this request (4MB)
|
||
|
|
arena: mem.Arena
|
||
|
|
mem.arena_init(&arena, make([]byte, mem.Megabyte * 4))
|
||
|
|
defer mem.arena_destroy(&arena)
|
||
|
|
|
||
|
|
// Set context allocator
|
||
|
|
context.allocator = mem.arena_allocator(&arena)
|
||
|
|
|
||
|
|
// All downstream code uses context.allocator
|
||
|
|
request := parse_http_request(conn) // uses arena
|
||
|
|
response := handle_request(request) // uses arena
|
||
|
|
send_response(conn, response) // uses arena
|
||
|
|
|
||
|
|
// Arena is freed here - everything cleaned up automatically
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- No individual `free()` calls needed
|
||
|
|
- No `errdefer` cleanup
|
||
|
|
- No use-after-free bugs
|
||
|
|
- No memory leaks from forgotten frees
|
||
|
|
- Predictable performance (no GC pauses)
|
||
|
|
|
||
|
|
### 2. Default Allocator (Long-Lived Data)
|
||
|
|
|
||
|
|
The default allocator (typically `context.allocator` at program start) is used for:
|
||
|
|
|
||
|
|
- Table metadata
|
||
|
|
- Table locks (sync.RW_Mutex)
|
||
|
|
- Engine state
|
||
|
|
- Items returned from storage layer (copied to request arena when needed)
|
||
|
|
|
||
|
|
## Storage Format
|
||
|
|
|
||
|
|
### Binary Keys (Varint-Prefixed Segments)
|
||
|
|
|
||
|
|
All keys use varint length prefixes for space efficiency:
|
||
|
|
|
||
|
|
```
|
||
|
|
Meta key: [0x01][len][table_name]
|
||
|
|
Data key: [0x02][len][table_name][len][pk_value][len][sk_value]?
|
||
|
|
GSI key: [0x03][len][table_name][len][index_name][len][gsi_pk][len][gsi_sk]?
|
||
|
|
LSI key: [0x04][len][table_name][len][index_name][len][pk][len][lsi_sk]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Example Data Key:**
|
||
|
|
```
|
||
|
|
Table: "Users"
|
||
|
|
PK: "user:123"
|
||
|
|
SK: "profile"
|
||
|
|
|
||
|
|
Encoded:
|
||
|
|
[0x02] // Entity type (Data)
|
||
|
|
[0x05] // Table name length (5)
|
||
|
|
Users // Table name bytes
|
||
|
|
[0x08] // PK length (8)
|
||
|
|
user:123 // PK bytes
|
||
|
|
[0x07] // SK length (7)
|
||
|
|
profile // SK bytes
|
||
|
|
```
|
||
|
|
|
||
|
|
### Item Encoding (TLV Format)
|
||
|
|
|
||
|
|
Items use Tag-Length-Value encoding for space efficiency:
|
||
|
|
|
||
|
|
```
|
||
|
|
Format:
|
||
|
|
[attr_count:varint]
|
||
|
|
[name_len:varint][name:bytes][type_tag:u8][value_len:varint][value:bytes]...
|
||
|
|
|
||
|
|
Type Tags:
|
||
|
|
String = 0x01 Number = 0x02 Binary = 0x03
|
||
|
|
Bool = 0x04 Null = 0x05
|
||
|
|
SS = 0x10 NS = 0x11 BS = 0x12
|
||
|
|
List = 0x20 Map = 0x21
|
||
|
|
```
|
||
|
|
|
||
|
|
**Example Item:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"id": {"S": "user123"},
|
||
|
|
"age": {"N": "30"}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Encoded as:
|
||
|
|
```
|
||
|
|
[0x02] // 2 attributes
|
||
|
|
[0x02] // name length (2)
|
||
|
|
id // name bytes
|
||
|
|
[0x01] // type tag (String)
|
||
|
|
[0x07] // value length (7)
|
||
|
|
user123 // value bytes
|
||
|
|
|
||
|
|
[0x03] // name length (3)
|
||
|
|
age // name bytes
|
||
|
|
[0x02] // type tag (Number)
|
||
|
|
[0x02] // value length (2)
|
||
|
|
30 // value bytes (stored as string)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Module Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
jormundb/
|
||
|
|
├── main.odin # Entry point, HTTP server
|
||
|
|
├── rocksdb/ # RocksDB C FFI bindings
|
||
|
|
│ └── rocksdb.odin # db_open, db_put, db_get, etc.
|
||
|
|
├── dynamodb/ # DynamoDB protocol implementation
|
||
|
|
│ ├── types.odin # Core types (Attribute_Value, Item, Key, etc.)
|
||
|
|
│ ├── json.odin # DynamoDB JSON parsing/serialization
|
||
|
|
│ ├── storage.odin # Storage engine (CRUD, scan, query)
|
||
|
|
│ └── handler.odin # HTTP request handlers
|
||
|
|
├── key_codec/ # Binary key encoding
|
||
|
|
│ └── key_codec.odin # build_data_key, decode_data_key, etc.
|
||
|
|
└── item_codec/ # Binary TLV item encoding
|
||
|
|
└── item_codec.odin # encode, decode
|
||
|
|
```
|
||
|
|
|
||
|
|
## Request Flow
|
||
|
|
|
||
|
|
```
|
||
|
|
1. HTTP POST / arrives
|
||
|
|
↓
|
||
|
|
2. Create arena allocator (4MB)
|
||
|
|
Set context.allocator = arena_allocator
|
||
|
|
↓
|
||
|
|
3. Parse HTTP headers
|
||
|
|
Extract X-Amz-Target → Operation
|
||
|
|
↓
|
||
|
|
4. Parse JSON body
|
||
|
|
Convert DynamoDB JSON → internal types
|
||
|
|
↓
|
||
|
|
5. Route to handler (e.g., handle_put_item)
|
||
|
|
↓
|
||
|
|
6. Storage engine operation
|
||
|
|
- Build binary key
|
||
|
|
- Encode item to TLV
|
||
|
|
- RocksDB put/get/delete
|
||
|
|
↓
|
||
|
|
7. Build response
|
||
|
|
- Serialize item to DynamoDB JSON
|
||
|
|
- Format HTTP response
|
||
|
|
↓
|
||
|
|
8. Send response
|
||
|
|
↓
|
||
|
|
9. Destroy arena
|
||
|
|
All request memory freed automatically
|
||
|
|
```
|
||
|
|
|
||
|
|
## Concurrency Model
|
||
|
|
|
||
|
|
### Table-Level RW Locks
|
||
|
|
|
||
|
|
Each table has a reader-writer lock:
|
||
|
|
|
||
|
|
```odin
|
||
|
|
Storage_Engine :: struct {
|
||
|
|
db: rocksdb.DB,
|
||
|
|
table_locks: map[string]^sync.RW_Mutex,
|
||
|
|
table_locks_mutex: sync.Mutex,
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Read Operations** (GetItem, Query, Scan):
|
||
|
|
- Acquire shared lock
|
||
|
|
- Multiple readers can run concurrently
|
||
|
|
- Writers are blocked
|
||
|
|
|
||
|
|
**Write Operations** (PutItem, DeleteItem, UpdateItem):
|
||
|
|
- Acquire exclusive lock
|
||
|
|
- Only one writer at a time
|
||
|
|
- All readers are blocked
|
||
|
|
|
||
|
|
### Thread Safety
|
||
|
|
|
||
|
|
- RocksDB handles are thread-safe (column family-based)
|
||
|
|
- Table metadata is protected by locks
|
||
|
|
- Request arenas are thread-local (no sharing)
|
||
|
|
|
||
|
|
## Error Handling
|
||
|
|
|
||
|
|
Odin uses explicit error returns via `or_return`:
|
||
|
|
|
||
|
|
```odin
|
||
|
|
// Odin error handling
|
||
|
|
parse_json :: proc(data: []byte) -> (Item, bool) {
|
||
|
|
parsed := json.parse(data) or_return
|
||
|
|
item := json_to_item(parsed) or_return
|
||
|
|
return item, true
|
||
|
|
}
|
||
|
|
|
||
|
|
// Usage
|
||
|
|
item := parse_json(request.body) or_else {
|
||
|
|
return error_response(.ValidationException, "Invalid JSON")
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
No exceptions, no panic-recover patterns. Every error path is explicit.
|
||
|
|
|
||
|
|
## DynamoDB Wire Protocol
|
||
|
|
|
||
|
|
### Request Format
|
||
|
|
|
||
|
|
```
|
||
|
|
POST / HTTP/1.1
|
||
|
|
X-Amz-Target: DynamoDB_20120810.PutItem
|
||
|
|
Content-Type: application/x-amz-json-1.0
|
||
|
|
|
||
|
|
{
|
||
|
|
"TableName": "Users",
|
||
|
|
"Item": {
|
||
|
|
"id": {"S": "user123"},
|
||
|
|
"name": {"S": "Alice"}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Response Format
|
||
|
|
|
||
|
|
```
|
||
|
|
HTTP/1.1 200 OK
|
||
|
|
Content-Type: application/x-amz-json-1.0
|
||
|
|
x-amzn-RequestId: local-request-id
|
||
|
|
|
||
|
|
{}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Error Format
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"__type": "com.amazonaws.dynamodb.v20120810#ResourceNotFoundException",
|
||
|
|
"message": "Table not found"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Performance Characteristics
|
||
|
|
|
||
|
|
### Time Complexity
|
||
|
|
|
||
|
|
| Operation | Complexity | Notes |
|
||
|
|
|-----------|-----------|-------|
|
||
|
|
| PutItem | O(log n) | RocksDB LSM tree insert |
|
||
|
|
| GetItem | O(log n) | RocksDB point lookup |
|
||
|
|
| DeleteItem | O(log n) | RocksDB deletion |
|
||
|
|
| Query | O(log n + m) | n = items in table, m = result set |
|
||
|
|
| Scan | O(n) | Full table scan |
|
||
|
|
|
||
|
|
### Space Complexity
|
||
|
|
|
||
|
|
- Binary keys: ~20-100 bytes (vs 50-200 bytes JSON)
|
||
|
|
- Binary items: ~30% smaller than JSON
|
||
|
|
- Varint encoding saves space on small integers
|
||
|
|
|
||
|
|
### Benchmarks (Expected)
|
||
|
|
|
||
|
|
Based on Zig version performance:
|
||
|
|
|
||
|
|
```
|
||
|
|
Operation Throughput Latency (p50)
|
||
|
|
PutItem ~5,000/sec ~0.2ms
|
||
|
|
GetItem ~7,000/sec ~0.14ms
|
||
|
|
Query (1 item) ~8,000/sec ~0.12ms
|
||
|
|
Scan (1000 items) ~20/sec ~50ms
|
||
|
|
```
|
||
|
|
|
||
|
|
## Future Enhancements
|
||
|
|
|
||
|
|
### Planned Features
|
||
|
|
|
||
|
|
1. **UpdateExpression** - SET/REMOVE/ADD/DELETE operations
|
||
|
|
2. **FilterExpression** - Post-query filtering
|
||
|
|
3. **ProjectionExpression** - Return subset of attributes
|
||
|
|
4. **Global Secondary Indexes** - Query by non-key attributes
|
||
|
|
5. **Local Secondary Indexes** - Alternate sort keys
|
||
|
|
6. **BatchWriteItem** - Batch mutations
|
||
|
|
7. **BatchGetItem** - Batch reads
|
||
|
|
8. **Transactions** - ACID multi-item operations
|
||
|
|
|
||
|
|
### Optimization Opportunities
|
||
|
|
|
||
|
|
1. **Connection pooling** - Reuse HTTP connections
|
||
|
|
2. **Bloom filters** - Faster negative lookups
|
||
|
|
3. **Compression** - LZ4/Zstd on large items
|
||
|
|
4. **Caching layer** - Hot item cache
|
||
|
|
5. **Parallel scan** - Segment-based scanning
|
||
|
|
|
||
|
|
## Debugging
|
||
|
|
|
||
|
|
### Enable Verbose Logging
|
||
|
|
|
||
|
|
```bash
|
||
|
|
make run VERBOSE=1
|
||
|
|
```
|
||
|
|
|
||
|
|
### Inspect RocksDB
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Use ldb tool to inspect database
|
||
|
|
ldb --db=./data scan
|
||
|
|
ldb --db=./data get <key_hex>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Memory Profiling
|
||
|
|
|
||
|
|
Odin's tracking allocator can detect leaks:
|
||
|
|
|
||
|
|
```odin
|
||
|
|
when ODIN_DEBUG {
|
||
|
|
track: mem.Tracking_Allocator
|
||
|
|
mem.tracking_allocator_init(&track, context.allocator)
|
||
|
|
context.allocator = mem.tracking_allocator(&track)
|
||
|
|
|
||
|
|
defer {
|
||
|
|
for _, leak in track.allocation_map {
|
||
|
|
fmt.printfln("Leaked %d bytes at %p", leak.size, leak.location)
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Migration from Zig Version
|
||
|
|
|
||
|
|
The Zig version (ZynamoDB) used the same binary storage format, so existing RocksDB databases can be read by JormunDB without migration.
|
||
|
|
|
||
|
|
### Compatibility
|
||
|
|
|
||
|
|
- ✅ Binary key format (byte-compatible)
|
||
|
|
- ✅ Binary item format (byte-compatible)
|
||
|
|
- ✅ Table metadata (JSON, compatible)
|
||
|
|
- ✅ HTTP wire protocol (identical)
|
||
|
|
|
||
|
|
### Breaking Changes
|
||
|
|
|
||
|
|
None - JormunDB can open ZynamoDB databases directly.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Contributing
|
||
|
|
|
||
|
|
When contributing to JormunDB:
|
||
|
|
|
||
|
|
1. **Use the context allocator** - All request-scoped allocations should use `context.allocator`
|
||
|
|
2. **Avoid manual frees** - Let the arena handle it
|
||
|
|
3. **Long-lived data** - Use the default allocator explicitly
|
||
|
|
4. **Test thoroughly** - Run `make test` before committing
|
||
|
|
5. **Format code** - Run `make fmt` before committing
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- [Odin Language](https://odin-lang.org/)
|
||
|
|
- [RocksDB Wiki](https://github.com/facebook/rocksdb/wiki)
|
||
|
|
- [DynamoDB API Reference](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/)
|
||
|
|
- [Varint Encoding](https://developers.google.com/protocol-buffers/docs/encoding#varints)
|