first commit
This commit is contained in:
426
ARCHITECTURE.md
Normal file
426
ARCHITECTURE.md
Normal file
@@ -0,0 +1,426 @@
|
||||
## JormunDB Architecture
|
||||
|
||||
This document explains the internal architecture of JormunDB, including design decisions, storage formats, and the arena-per-request memory management pattern.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Why Odin?](#why-odin)
|
||||
- [Memory Management](#memory-management)
|
||||
- [Storage Format](#storage-format)
|
||||
- [Module Structure](#module-structure)
|
||||
- [Request Flow](#request-flow)
|
||||
- [Concurrency Model](#concurrency-model)
|
||||
|
||||
## Overview
|
||||
|
||||
JormunDB is a DynamoDB-compatible database server that speaks the DynamoDB wire protocol. It uses RocksDB for persistent storage and is written in Odin for elegant memory management.
|
||||
|
||||
### Key Design Goals
|
||||
|
||||
1. **Zero allocation ceremony** - No explicit `defer free()` or error handling for every allocation
|
||||
2. **Binary storage** - Efficient TLV encoding instead of JSON
|
||||
3. **API compatibility** - Drop-in replacement for DynamoDB Local
|
||||
4. **Performance** - RocksDB-backed with efficient key encoding
|
||||
|
||||
## Why Odin?
|
||||
|
||||
The original implementation in Zig suffered from explicit allocator threading:
|
||||
|
||||
```zig
|
||||
// Zig version - explicit allocator everywhere
|
||||
fn handleRequest(allocator: std.mem.Allocator, request: []const u8) !Response {
|
||||
const parsed = try parseJson(allocator, request);
|
||||
defer parsed.deinit(allocator);
|
||||
|
||||
const item = try storage.getItem(allocator, parsed.table_name, parsed.key);
|
||||
defer if (item) |i| freeItem(allocator, i);
|
||||
|
||||
const response = try serializeResponse(allocator, item);
|
||||
defer allocator.free(response);
|
||||
|
||||
return response; // Wait, we deferred the free!
|
||||
}
|
||||
```
|
||||
|
||||
Odin's context allocator system eliminates this:
|
||||
|
||||
```odin
|
||||
// Odin version - implicit context allocator
|
||||
handle_request :: proc(request: []byte) -> Response {
|
||||
// All allocations use context.allocator automatically
|
||||
parsed := parse_json(request)
|
||||
item := storage_get_item(parsed.table_name, parsed.key)
|
||||
response := serialize_response(item)
|
||||
|
||||
return response
|
||||
// Everything freed when arena is destroyed
|
||||
}
|
||||
```
|
||||
|
||||
## Memory Management
|
||||
|
||||
JormunDB uses a two-allocator strategy:
|
||||
|
||||
### 1. Arena Allocator (Request-Scoped)
|
||||
|
||||
Every HTTP request gets its own arena:
|
||||
|
||||
```odin
|
||||
handle_connection :: proc(conn: net.TCP_Socket) {
|
||||
// Create arena for this request (4MB)
|
||||
arena: mem.Arena
|
||||
mem.arena_init(&arena, make([]byte, mem.Megabyte * 4))
|
||||
defer mem.arena_destroy(&arena)
|
||||
|
||||
// Set context allocator
|
||||
context.allocator = mem.arena_allocator(&arena)
|
||||
|
||||
// All downstream code uses context.allocator
|
||||
request := parse_http_request(conn) // uses arena
|
||||
response := handle_request(request) // uses arena
|
||||
send_response(conn, response) // uses arena
|
||||
|
||||
// Arena is freed here - everything cleaned up automatically
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- No individual `free()` calls needed
|
||||
- No `errdefer` cleanup
|
||||
- No use-after-free bugs
|
||||
- No memory leaks from forgotten frees
|
||||
- Predictable performance (no GC pauses)
|
||||
|
||||
### 2. Default Allocator (Long-Lived Data)
|
||||
|
||||
The default allocator (typically `context.allocator` at program start) is used for:
|
||||
|
||||
- Table metadata
|
||||
- Table locks (sync.RW_Mutex)
|
||||
- Engine state
|
||||
- Items returned from storage layer (copied to request arena when needed)
|
||||
|
||||
## Storage Format
|
||||
|
||||
### Binary Keys (Varint-Prefixed Segments)
|
||||
|
||||
All keys use varint length prefixes for space efficiency:
|
||||
|
||||
```
|
||||
Meta key: [0x01][len][table_name]
|
||||
Data key: [0x02][len][table_name][len][pk_value][len][sk_value]?
|
||||
GSI key: [0x03][len][table_name][len][index_name][len][gsi_pk][len][gsi_sk]?
|
||||
LSI key: [0x04][len][table_name][len][index_name][len][pk][len][lsi_sk]
|
||||
```
|
||||
|
||||
**Example Data Key:**
|
||||
```
|
||||
Table: "Users"
|
||||
PK: "user:123"
|
||||
SK: "profile"
|
||||
|
||||
Encoded:
|
||||
[0x02] // Entity type (Data)
|
||||
[0x05] // Table name length (5)
|
||||
Users // Table name bytes
|
||||
[0x08] // PK length (8)
|
||||
user:123 // PK bytes
|
||||
[0x07] // SK length (7)
|
||||
profile // SK bytes
|
||||
```
|
||||
|
||||
### Item Encoding (TLV Format)
|
||||
|
||||
Items use Tag-Length-Value encoding for space efficiency:
|
||||
|
||||
```
|
||||
Format:
|
||||
[attr_count:varint]
|
||||
[name_len:varint][name:bytes][type_tag:u8][value_len:varint][value:bytes]...
|
||||
|
||||
Type Tags:
|
||||
String = 0x01 Number = 0x02 Binary = 0x03
|
||||
Bool = 0x04 Null = 0x05
|
||||
SS = 0x10 NS = 0x11 BS = 0x12
|
||||
List = 0x20 Map = 0x21
|
||||
```
|
||||
|
||||
**Example Item:**
|
||||
```json
|
||||
{
|
||||
"id": {"S": "user123"},
|
||||
"age": {"N": "30"}
|
||||
}
|
||||
```
|
||||
|
||||
Encoded as:
|
||||
```
|
||||
[0x02] // 2 attributes
|
||||
[0x02] // name length (2)
|
||||
id // name bytes
|
||||
[0x01] // type tag (String)
|
||||
[0x07] // value length (7)
|
||||
user123 // value bytes
|
||||
|
||||
[0x03] // name length (3)
|
||||
age // name bytes
|
||||
[0x02] // type tag (Number)
|
||||
[0x02] // value length (2)
|
||||
30 // value bytes (stored as string)
|
||||
```
|
||||
|
||||
## Module Structure
|
||||
|
||||
```
|
||||
jormundb/
|
||||
├── main.odin # Entry point, HTTP server
|
||||
├── rocksdb/ # RocksDB C FFI bindings
|
||||
│ └── rocksdb.odin # db_open, db_put, db_get, etc.
|
||||
├── dynamodb/ # DynamoDB protocol implementation
|
||||
│ ├── types.odin # Core types (Attribute_Value, Item, Key, etc.)
|
||||
│ ├── json.odin # DynamoDB JSON parsing/serialization
|
||||
│ ├── storage.odin # Storage engine (CRUD, scan, query)
|
||||
│ └── handler.odin # HTTP request handlers
|
||||
├── key_codec/ # Binary key encoding
|
||||
│ └── key_codec.odin # build_data_key, decode_data_key, etc.
|
||||
└── item_codec/ # Binary TLV item encoding
|
||||
└── item_codec.odin # encode, decode
|
||||
```
|
||||
|
||||
## Request Flow
|
||||
|
||||
```
|
||||
1. HTTP POST / arrives
|
||||
↓
|
||||
2. Create arena allocator (4MB)
|
||||
Set context.allocator = arena_allocator
|
||||
↓
|
||||
3. Parse HTTP headers
|
||||
Extract X-Amz-Target → Operation
|
||||
↓
|
||||
4. Parse JSON body
|
||||
Convert DynamoDB JSON → internal types
|
||||
↓
|
||||
5. Route to handler (e.g., handle_put_item)
|
||||
↓
|
||||
6. Storage engine operation
|
||||
- Build binary key
|
||||
- Encode item to TLV
|
||||
- RocksDB put/get/delete
|
||||
↓
|
||||
7. Build response
|
||||
- Serialize item to DynamoDB JSON
|
||||
- Format HTTP response
|
||||
↓
|
||||
8. Send response
|
||||
↓
|
||||
9. Destroy arena
|
||||
All request memory freed automatically
|
||||
```
|
||||
|
||||
## Concurrency Model
|
||||
|
||||
### Table-Level RW Locks
|
||||
|
||||
Each table has a reader-writer lock:
|
||||
|
||||
```odin
|
||||
Storage_Engine :: struct {
|
||||
db: rocksdb.DB,
|
||||
table_locks: map[string]^sync.RW_Mutex,
|
||||
table_locks_mutex: sync.Mutex,
|
||||
}
|
||||
```
|
||||
|
||||
**Read Operations** (GetItem, Query, Scan):
|
||||
- Acquire shared lock
|
||||
- Multiple readers can run concurrently
|
||||
- Writers are blocked
|
||||
|
||||
**Write Operations** (PutItem, DeleteItem, UpdateItem):
|
||||
- Acquire exclusive lock
|
||||
- Only one writer at a time
|
||||
- All readers are blocked
|
||||
|
||||
### Thread Safety
|
||||
|
||||
- RocksDB handles are thread-safe (column family-based)
|
||||
- Table metadata is protected by locks
|
||||
- Request arenas are thread-local (no sharing)
|
||||
|
||||
## Error Handling
|
||||
|
||||
Odin uses explicit error returns via `or_return`:
|
||||
|
||||
```odin
|
||||
// Odin error handling
|
||||
parse_json :: proc(data: []byte) -> (Item, bool) {
|
||||
parsed := json.parse(data) or_return
|
||||
item := json_to_item(parsed) or_return
|
||||
return item, true
|
||||
}
|
||||
|
||||
// Usage
|
||||
item := parse_json(request.body) or_else {
|
||||
return error_response(.ValidationException, "Invalid JSON")
|
||||
}
|
||||
```
|
||||
|
||||
No exceptions, no panic-recover patterns. Every error path is explicit.
|
||||
|
||||
## DynamoDB Wire Protocol
|
||||
|
||||
### Request Format
|
||||
|
||||
```
|
||||
POST / HTTP/1.1
|
||||
X-Amz-Target: DynamoDB_20120810.PutItem
|
||||
Content-Type: application/x-amz-json-1.0
|
||||
|
||||
{
|
||||
"TableName": "Users",
|
||||
"Item": {
|
||||
"id": {"S": "user123"},
|
||||
"name": {"S": "Alice"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Response Format
|
||||
|
||||
```
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/x-amz-json-1.0
|
||||
x-amzn-RequestId: local-request-id
|
||||
|
||||
{}
|
||||
```
|
||||
|
||||
### Error Format
|
||||
|
||||
```json
|
||||
{
|
||||
"__type": "com.amazonaws.dynamodb.v20120810#ResourceNotFoundException",
|
||||
"message": "Table not found"
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Time Complexity
|
||||
|
||||
| Operation | Complexity | Notes |
|
||||
|-----------|-----------|-------|
|
||||
| PutItem | O(log n) | RocksDB LSM tree insert |
|
||||
| GetItem | O(log n) | RocksDB point lookup |
|
||||
| DeleteItem | O(log n) | RocksDB deletion |
|
||||
| Query | O(log n + m) | n = items in table, m = result set |
|
||||
| Scan | O(n) | Full table scan |
|
||||
|
||||
### Space Complexity
|
||||
|
||||
- Binary keys: ~20-100 bytes (vs 50-200 bytes JSON)
|
||||
- Binary items: ~30% smaller than JSON
|
||||
- Varint encoding saves space on small integers
|
||||
|
||||
### Benchmarks (Expected)
|
||||
|
||||
Based on Zig version performance:
|
||||
|
||||
```
|
||||
Operation Throughput Latency (p50)
|
||||
PutItem ~5,000/sec ~0.2ms
|
||||
GetItem ~7,000/sec ~0.14ms
|
||||
Query (1 item) ~8,000/sec ~0.12ms
|
||||
Scan (1000 items) ~20/sec ~50ms
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
|
||||
1. **UpdateExpression** - SET/REMOVE/ADD/DELETE operations
|
||||
2. **FilterExpression** - Post-query filtering
|
||||
3. **ProjectionExpression** - Return subset of attributes
|
||||
4. **Global Secondary Indexes** - Query by non-key attributes
|
||||
5. **Local Secondary Indexes** - Alternate sort keys
|
||||
6. **BatchWriteItem** - Batch mutations
|
||||
7. **BatchGetItem** - Batch reads
|
||||
8. **Transactions** - ACID multi-item operations
|
||||
|
||||
### Optimization Opportunities
|
||||
|
||||
1. **Connection pooling** - Reuse HTTP connections
|
||||
2. **Bloom filters** - Faster negative lookups
|
||||
3. **Compression** - LZ4/Zstd on large items
|
||||
4. **Caching layer** - Hot item cache
|
||||
5. **Parallel scan** - Segment-based scanning
|
||||
|
||||
## Debugging
|
||||
|
||||
### Enable Verbose Logging
|
||||
|
||||
```bash
|
||||
make run VERBOSE=1
|
||||
```
|
||||
|
||||
### Inspect RocksDB
|
||||
|
||||
```bash
|
||||
# Use ldb tool to inspect database
|
||||
ldb --db=./data scan
|
||||
ldb --db=./data get <key_hex>
|
||||
```
|
||||
|
||||
### Memory Profiling
|
||||
|
||||
Odin's tracking allocator can detect leaks:
|
||||
|
||||
```odin
|
||||
when ODIN_DEBUG {
|
||||
track: mem.Tracking_Allocator
|
||||
mem.tracking_allocator_init(&track, context.allocator)
|
||||
context.allocator = mem.tracking_allocator(&track)
|
||||
|
||||
defer {
|
||||
for _, leak in track.allocation_map {
|
||||
fmt.printfln("Leaked %d bytes at %p", leak.size, leak.location)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Migration from Zig Version
|
||||
|
||||
The Zig version (ZynamoDB) used the same binary storage format, so existing RocksDB databases can be read by JormunDB without migration.
|
||||
|
||||
### Compatibility
|
||||
|
||||
- ✅ Binary key format (byte-compatible)
|
||||
- ✅ Binary item format (byte-compatible)
|
||||
- ✅ Table metadata (JSON, compatible)
|
||||
- ✅ HTTP wire protocol (identical)
|
||||
|
||||
### Breaking Changes
|
||||
|
||||
None - JormunDB can open ZynamoDB databases directly.
|
||||
|
||||
---
|
||||
|
||||
## Contributing
|
||||
|
||||
When contributing to JormunDB:
|
||||
|
||||
1. **Use the context allocator** - All request-scoped allocations should use `context.allocator`
|
||||
2. **Avoid manual frees** - Let the arena handle it
|
||||
3. **Long-lived data** - Use the default allocator explicitly
|
||||
4. **Test thoroughly** - Run `make test` before committing
|
||||
5. **Format code** - Run `make fmt` before committing
|
||||
|
||||
## References
|
||||
|
||||
- [Odin Language](https://odin-lang.org/)
|
||||
- [RocksDB Wiki](https://github.com/facebook/rocksdb/wiki)
|
||||
- [DynamoDB API Reference](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/)
|
||||
- [Varint Encoding](https://developers.google.com/protocol-buffers/docs/encoding#varints)
|
||||
Reference in New Issue
Block a user