- Changed all namespace block sizes to 1h (was 2h/12h/24h in manifests, 30d+ in the live cluster due to backfill-era bufferPast hacks) - Deleted entire backfill/ directory (scripts, pods, runbooks) - Removed stale 05-m3coordinator.yaml (had backfill namespaces) - Added 05-m3coordinator-deployment.yaml to kustomization - Fixed init job health check (/health instead of /api/v1/services/m3db/health) - Updated .env.example (removed Mimir credentials) - Added 'Why Backfill Doesn't Work' section to README
314 lines
15 KiB
Markdown
314 lines
15 KiB
Markdown
# M3DB on Vultr Kubernetes Engine
|
||
|
||
Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.
|
||
|
||
## Prerequisites
|
||
|
||
- **kubectl** — for applying manifests
|
||
- **helm** — for installing Traefik Ingress Controller
|
||
|
||
```bash
|
||
# Install helm (macOS/Linux with Homebrew)
|
||
brew install helm
|
||
```
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ Vultr VKE Cluster │
|
||
│ │
|
||
External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer) │
|
||
External Grafana ─┼──PromQL query──▶ │ TLS termination, basic auth │
|
||
│ │ │
|
||
│ ┌──────┴──────┐ │
|
||
│ │ M3 Coordinator (Deployment, 2 replicas)
|
||
In-cluster Prometheus┼──remote_write──▶ │ │
|
||
In-cluster Grafana ┼──PromQL query──▶ │ │
|
||
│ └──────┬──────┘ │
|
||
│ │ │
|
||
│ ┌────┴────┐ │
|
||
│ │ M3DB Nodes │ (StatefulSet, 3 replicas)
|
||
│ │ Vultr Block│ (100Gi NVMe per node) │
|
||
│ │ Storage │ │
|
||
│ └────┬────┘ │
|
||
│ │ │
|
||
│ etcd cluster (StatefulSet, 3 replicas)
|
||
└─────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**External access flow:**
|
||
```
|
||
Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201
|
||
```
|
||
|
||
## Retention Tiers
|
||
|
||
All namespaces use **1h block size** — the sweet spot for M3DB. Smaller blocks mean faster queries, faster flushes, and less memory pressure during compaction. See [Why Backfill Doesn't Work](#why-backfill-doesnt-work) for why larger blocks were a disaster.
|
||
|
||
| Namespace | Resolution | Retention | Block Size | Use Case |
|
||
|----------------|-----------|-----------|------------|---------------------------|
|
||
| `default` | raw | 48h | 1h | Real-time queries |
|
||
| `agg_10s_30d` | 10s | 30 days | 1h | Recent dashboards |
|
||
| `agg_1m_1y` | 1m | 1 year | 1h | Long-term trends/capacity |
|
||
|
||
## Deployment
|
||
|
||
```bash
|
||
# 1. Install Traefik Ingress Controller (handles TLS + basic auth)
|
||
helm repo add traefik https://traefik.github.io/charts
|
||
helm repo update
|
||
helm install traefik traefik/traefik \
|
||
--namespace traefik --create-namespace \
|
||
--set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \
|
||
--set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \
|
||
--set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
|
||
|
||
# Note: ACME requires single replica. For HA, use external cert management
|
||
# or Traefik Enterprise with distributed ACME storage.
|
||
|
||
# 2. Get the Traefik LoadBalancer IP and update DNS
|
||
kubectl -n traefik get svc traefik
|
||
# Point your domain (e.g., m3db.vultrlabs.dev) to this IP
|
||
|
||
# 3. Apply M3DB manifests
|
||
kubectl apply -k .
|
||
|
||
# 4. Wait for all pods to be Running
|
||
kubectl -n m3db get pods -w
|
||
```
|
||
|
||
## Bootstrap M3DB Cluster
|
||
|
||
The init job waits for coordinator health, which requires m3db to be bootstrapped.
|
||
Bootstrap directly via m3dbnode's embedded coordinator:
|
||
|
||
```bash
|
||
# Initialize placement
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
|
||
-H "Content-Type: application/json" -d '{
|
||
"num_shards": 64,
|
||
"replication_factor": 3,
|
||
"instances": [
|
||
{"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000},
|
||
{"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000},
|
||
{"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000}
|
||
]
|
||
}'
|
||
|
||
# Create namespaces
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
|
||
-H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"}}}'
|
||
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
|
||
-H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}'
|
||
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
|
||
-H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'
|
||
|
||
# Wait for bootstrapping to complete (check shard state = AVAILABLE)
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
|
||
```
|
||
|
||
## Authentication
|
||
|
||
External access is protected by HTTP basic auth. Update the password in `08-basic-auth-middleware.yaml`:
|
||
|
||
```bash
|
||
# Generate new htpasswd entry
|
||
htpasswd -nb <username> <password>
|
||
|
||
# Update the secret stringData.users field and apply
|
||
kubectl apply -f 08-basic-auth-middleware.yaml
|
||
```
|
||
|
||
## Testing
|
||
|
||
**Quick connectivity test:**
|
||
```bash
|
||
# With basic auth (external)
|
||
./test-metrics.sh https://m3db.vultrlabs.dev example example
|
||
|
||
# Without auth (in-cluster or port-forward)
|
||
./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201
|
||
```
|
||
|
||
**Full read/write test (Python):**
|
||
```bash
|
||
pip install requests python-snappy
|
||
|
||
# With basic auth (external)
|
||
python3 test-metrics.py https://m3db.vultrlabs.dev example example
|
||
|
||
# Without auth (in-cluster or port-forward)
|
||
python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201
|
||
```
|
||
|
||
## Prometheus Configuration (Replacing Mimir)
|
||
|
||
Update your Prometheus config to point at M3 Coordinator.
|
||
|
||
**In-cluster (same VKE cluster):**
|
||
```yaml
|
||
# prometheus.yml
|
||
remote_write:
|
||
- url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write"
|
||
queue_config:
|
||
capacity: 10000
|
||
max_shards: 30
|
||
max_samples_per_send: 5000
|
||
batch_send_deadline: 5s
|
||
|
||
remote_read:
|
||
- url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read"
|
||
read_recent: true
|
||
```
|
||
|
||
**External (cross-region/cross-cluster):**
|
||
```yaml
|
||
# prometheus.yml
|
||
remote_write:
|
||
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
|
||
basic_auth:
|
||
username: example
|
||
password: example
|
||
queue_config:
|
||
capacity: 10000
|
||
max_shards: 30
|
||
max_samples_per_send: 5000
|
||
batch_send_deadline: 5s
|
||
|
||
remote_read:
|
||
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read"
|
||
basic_auth:
|
||
username: example
|
||
password: example
|
||
read_recent: true
|
||
```
|
||
|
||
## Grafana Datasource
|
||
|
||
Add a **Prometheus** datasource in Grafana pointing to:
|
||
|
||
- **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201`
|
||
- **External:** `https://m3db.vultrlabs.dev` (with basic auth)
|
||
|
||
All existing PromQL dashboards will work without modification.
|
||
|
||
## Migration from Mimir
|
||
|
||
1. **Dual-write phase**: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously.
|
||
2. **Validation**: Compare query results between Mimir and M3DB for the same time ranges.
|
||
3. **Cutover**: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
|
||
4. **Cleanup**: Decommission Mimir components.
|
||
|
||
## Multi-Tenancy (Label-Based)
|
||
|
||
M3DB uses Prometheus-style labels for tenant isolation. Add labels like `tenant`, `service`, `env` to your metrics to differentiate between sources.
|
||
|
||
**Write metrics with tenant labels:**
|
||
```python
|
||
# In your Prometheus remote_write client
|
||
labels = {
|
||
"tenant": "acme-corp",
|
||
"service": "api-gateway",
|
||
"env": "prod"
|
||
}
|
||
# Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"}
|
||
```
|
||
|
||
**Query by tenant:**
|
||
```bash
|
||
# All metrics from a specific tenant
|
||
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}"
|
||
|
||
# Filter by service within tenant
|
||
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}"
|
||
|
||
# Filter by environment
|
||
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}"
|
||
```
|
||
|
||
**Prometheus configuration with labels:**
|
||
```yaml
|
||
# prometheus.yml
|
||
remote_write:
|
||
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
|
||
basic_auth:
|
||
username: example
|
||
password: example
|
||
# Add tenant labels to all metrics from this Prometheus
|
||
write_relabel_configs:
|
||
- target_label: tenant
|
||
replacement: acme-corp
|
||
- target_label: env
|
||
replacement: prod
|
||
```
|
||
|
||
## Tuning for Vultr
|
||
|
||
- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
|
||
- **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
|
||
- **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
|
||
- **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`.
|
||
|
||
## Why Backfill Doesn't Work
|
||
|
||
**TL;DR: M3DB is not designed for historical data import. Don't try it.**
|
||
|
||
M3DB is a time-series database optimized for real-time ingestion and sequential writes. Backfilling — writing data with timestamps in the past — fights the fundamental architecture at every turn:
|
||
|
||
### The Problems
|
||
|
||
1. **`bufferPast` is a hard gate.** M3DB rejects writes whose timestamps fall outside the `bufferPast` window (default: 10m). To write data from 3 weeks ago, you need `bufferPast=504h` (21 days). This setting is **immutable** on existing namespaces — you have to create entirely new namespaces just for backfill, doubling your operational complexity.
|
||
|
||
2. **Massive block sizes were required.** To make the backfill namespaces work with `bufferPast=504h`, block sizes had to be enormous (30+ day blocks). This defeated the entire point of M3DB's time-partitioned storage — blocks that large cause extreme memory pressure, slow compaction, and bloated index lookups.
|
||
|
||
3. **Downsample pipeline ignores historical data.** M3DB's downsample coordinator only processes new writes in real-time. Backfilled data written to `default_backfill` namespaces never gets downsampled into aggregated namespaces, so your long-term retention tiers have gaps.
|
||
|
||
4. **No transaction boundaries.** Each backfill write is an individual operation. Writing 12M+ samples means 12M+ individual writes with no batching semantics. If one fails, there's no rollback, no retry from a checkpoint — you get partial data with no easy way to detect or fix gaps.
|
||
|
||
5. **Compaction and flush chaos.** M3DB expects data to flow sequentially through commitlog → flush → compact. Backfill dumps data out of order, causing the background compaction to thrash, consuming CPU and I/O for blocks that may never be queried again.
|
||
|
||
### What We Tried
|
||
|
||
- Created `default_backfill`, `agg_10s_backfill`, `agg_1m_backfill` namespaces with `bufferPast=504h`
|
||
- Increased block sizes to 24h–30d to accommodate the large bufferPast
|
||
- Wrote 12M+ samples from Mimir to M3DB over multiple runs
|
||
- Result: Data landed, but the operational cost was catastrophic — huge blocks, no downsampling, and the cluster was unstable
|
||
|
||
### What To Do Instead
|
||
|
||
- **Start fresh.** Configure M3DB with sane block sizes (1h) from day one and let it accumulate data naturally via Prometheus remote_write.
|
||
- **Accept the gap.** Historical data lives in Mimir (or wherever it was before). Query Mimir for old data, M3DB for new data.
|
||
- **Dual-write during migration.** Write to both systems simultaneously until M3DB's retention catches up.
|
||
- **If you absolutely need old data in M3DB**, accept that you're doing a one-time migration and build tooling around the constraints — but know that it's a project, not a script.
|
||
|
||
---
|
||
|
||
## Useful Commands
|
||
|
||
```bash
|
||
# Get Traefik LoadBalancer IP
|
||
kubectl -n traefik get svc traefik
|
||
|
||
# Check cluster health (from inside cluster)
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health
|
||
|
||
# Check placement (from inside cluster)
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq
|
||
|
||
# Check m3dbnode bootstrapped status
|
||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
|
||
|
||
# Query via PromQL (external with auth)
|
||
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up"
|
||
|
||
# Delete the init job to re-run (if needed)
|
||
kubectl -n m3db delete job m3db-cluster-init
|
||
kubectl apply -f 06-init-and-pdb.yaml
|
||
|
||
# View Traefik logs
|
||
kubectl -n traefik logs -l app.kubernetes.io/name=traefik
|
||
```
|