Files
m3db-vke-setup/README.md
biondizzle 7ade5ecac8 Clean slate: 1h block sizes, remove backfill artifacts
- Changed all namespace block sizes to 1h (was 2h/12h/24h in manifests,
  30d+ in the live cluster due to backfill-era bufferPast hacks)
- Deleted entire backfill/ directory (scripts, pods, runbooks)
- Removed stale 05-m3coordinator.yaml (had backfill namespaces)
- Added 05-m3coordinator-deployment.yaml to kustomization
- Fixed init job health check (/health instead of /api/v1/services/m3db/health)
- Updated .env.example (removed Mimir credentials)
- Added 'Why Backfill Doesn't Work' section to README
2026-04-09 19:00:08 +00:00

314 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M3DB on Vultr Kubernetes Engine
Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.
## Prerequisites
- **kubectl** — for applying manifests
- **helm** — for installing Traefik Ingress Controller
```bash
# Install helm (macOS/Linux with Homebrew)
brew install helm
```
## Architecture
```
┌─────────────────────────────────────────────────────┐
│ Vultr VKE Cluster │
│ │
External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer) │
External Grafana ─┼──PromQL query──▶ │ TLS termination, basic auth │
│ │ │
│ ┌──────┴──────┐ │
│ │ M3 Coordinator (Deployment, 2 replicas)
In-cluster Prometheus┼──remote_write──▶ │ │
In-cluster Grafana ┼──PromQL query──▶ │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────┴────┐ │
│ │ M3DB Nodes │ (StatefulSet, 3 replicas)
│ │ Vultr Block│ (100Gi NVMe per node) │
│ │ Storage │ │
│ └────┬────┘ │
│ │ │
│ etcd cluster (StatefulSet, 3 replicas)
└─────────────────────────────────────────────────────┘
```
**External access flow:**
```
Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201
```
## Retention Tiers
All namespaces use **1h block size** — the sweet spot for M3DB. Smaller blocks mean faster queries, faster flushes, and less memory pressure during compaction. See [Why Backfill Doesn't Work](#why-backfill-doesnt-work) for why larger blocks were a disaster.
| Namespace | Resolution | Retention | Block Size | Use Case |
|----------------|-----------|-----------|------------|---------------------------|
| `default` | raw | 48h | 1h | Real-time queries |
| `agg_10s_30d` | 10s | 30 days | 1h | Recent dashboards |
| `agg_1m_1y` | 1m | 1 year | 1h | Long-term trends/capacity |
## Deployment
```bash
# 1. Install Traefik Ingress Controller (handles TLS + basic auth)
helm repo add traefik https://traefik.github.io/charts
helm repo update
helm install traefik traefik/traefik \
--namespace traefik --create-namespace \
--set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \
--set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \
--set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
# Note: ACME requires single replica. For HA, use external cert management
# or Traefik Enterprise with distributed ACME storage.
# 2. Get the Traefik LoadBalancer IP and update DNS
kubectl -n traefik get svc traefik
# Point your domain (e.g., m3db.vultrlabs.dev) to this IP
# 3. Apply M3DB manifests
kubectl apply -k .
# 4. Wait for all pods to be Running
kubectl -n m3db get pods -w
```
## Bootstrap M3DB Cluster
The init job waits for coordinator health, which requires m3db to be bootstrapped.
Bootstrap directly via m3dbnode's embedded coordinator:
```bash
# Initialize placement
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
-H "Content-Type: application/json" -d '{
"num_shards": 64,
"replication_factor": 3,
"instances": [
{"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000},
{"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000},
{"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000}
]
}'
# Create namespaces
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"}}}'
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}'
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'
# Wait for bootstrapping to complete (check shard state = AVAILABLE)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
```
## Authentication
External access is protected by HTTP basic auth. Update the password in `08-basic-auth-middleware.yaml`:
```bash
# Generate new htpasswd entry
htpasswd -nb <username> <password>
# Update the secret stringData.users field and apply
kubectl apply -f 08-basic-auth-middleware.yaml
```
## Testing
**Quick connectivity test:**
```bash
# With basic auth (external)
./test-metrics.sh https://m3db.vultrlabs.dev example example
# Without auth (in-cluster or port-forward)
./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201
```
**Full read/write test (Python):**
```bash
pip install requests python-snappy
# With basic auth (external)
python3 test-metrics.py https://m3db.vultrlabs.dev example example
# Without auth (in-cluster or port-forward)
python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201
```
## Prometheus Configuration (Replacing Mimir)
Update your Prometheus config to point at M3 Coordinator.
**In-cluster (same VKE cluster):**
```yaml
# prometheus.yml
remote_write:
- url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write"
queue_config:
capacity: 10000
max_shards: 30
max_samples_per_send: 5000
batch_send_deadline: 5s
remote_read:
- url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read"
read_recent: true
```
**External (cross-region/cross-cluster):**
```yaml
# prometheus.yml
remote_write:
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
basic_auth:
username: example
password: example
queue_config:
capacity: 10000
max_shards: 30
max_samples_per_send: 5000
batch_send_deadline: 5s
remote_read:
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read"
basic_auth:
username: example
password: example
read_recent: true
```
## Grafana Datasource
Add a **Prometheus** datasource in Grafana pointing to:
- **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201`
- **External:** `https://m3db.vultrlabs.dev` (with basic auth)
All existing PromQL dashboards will work without modification.
## Migration from Mimir
1. **Dual-write phase**: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously.
2. **Validation**: Compare query results between Mimir and M3DB for the same time ranges.
3. **Cutover**: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
4. **Cleanup**: Decommission Mimir components.
## Multi-Tenancy (Label-Based)
M3DB uses Prometheus-style labels for tenant isolation. Add labels like `tenant`, `service`, `env` to your metrics to differentiate between sources.
**Write metrics with tenant labels:**
```python
# In your Prometheus remote_write client
labels = {
"tenant": "acme-corp",
"service": "api-gateway",
"env": "prod"
}
# Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"}
```
**Query by tenant:**
```bash
# All metrics from a specific tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}"
# Filter by service within tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}"
# Filter by environment
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}"
```
**Prometheus configuration with labels:**
```yaml
# prometheus.yml
remote_write:
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
basic_auth:
username: example
password: example
# Add tenant labels to all metrics from this Prometheus
write_relabel_configs:
- target_label: tenant
replacement: acme-corp
- target_label: env
replacement: prod
```
## Tuning for Vultr
- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
- **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
- **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
- **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`.
## Why Backfill Doesn't Work
**TL;DR: M3DB is not designed for historical data import. Don't try it.**
M3DB is a time-series database optimized for real-time ingestion and sequential writes. Backfilling — writing data with timestamps in the past — fights the fundamental architecture at every turn:
### The Problems
1. **`bufferPast` is a hard gate.** M3DB rejects writes whose timestamps fall outside the `bufferPast` window (default: 10m). To write data from 3 weeks ago, you need `bufferPast=504h` (21 days). This setting is **immutable** on existing namespaces — you have to create entirely new namespaces just for backfill, doubling your operational complexity.
2. **Massive block sizes were required.** To make the backfill namespaces work with `bufferPast=504h`, block sizes had to be enormous (30+ day blocks). This defeated the entire point of M3DB's time-partitioned storage — blocks that large cause extreme memory pressure, slow compaction, and bloated index lookups.
3. **Downsample pipeline ignores historical data.** M3DB's downsample coordinator only processes new writes in real-time. Backfilled data written to `default_backfill` namespaces never gets downsampled into aggregated namespaces, so your long-term retention tiers have gaps.
4. **No transaction boundaries.** Each backfill write is an individual operation. Writing 12M+ samples means 12M+ individual writes with no batching semantics. If one fails, there's no rollback, no retry from a checkpoint — you get partial data with no easy way to detect or fix gaps.
5. **Compaction and flush chaos.** M3DB expects data to flow sequentially through commitlog → flush → compact. Backfill dumps data out of order, causing the background compaction to thrash, consuming CPU and I/O for blocks that may never be queried again.
### What We Tried
- Created `default_backfill`, `agg_10s_backfill`, `agg_1m_backfill` namespaces with `bufferPast=504h`
- Increased block sizes to 24h30d to accommodate the large bufferPast
- Wrote 12M+ samples from Mimir to M3DB over multiple runs
- Result: Data landed, but the operational cost was catastrophic — huge blocks, no downsampling, and the cluster was unstable
### What To Do Instead
- **Start fresh.** Configure M3DB with sane block sizes (1h) from day one and let it accumulate data naturally via Prometheus remote_write.
- **Accept the gap.** Historical data lives in Mimir (or wherever it was before). Query Mimir for old data, M3DB for new data.
- **Dual-write during migration.** Write to both systems simultaneously until M3DB's retention catches up.
- **If you absolutely need old data in M3DB**, accept that you're doing a one-time migration and build tooling around the constraints — but know that it's a project, not a script.
---
## Useful Commands
```bash
# Get Traefik LoadBalancer IP
kubectl -n traefik get svc traefik
# Check cluster health (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health
# Check placement (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq
# Check m3dbnode bootstrapped status
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
# Query via PromQL (external with auth)
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up"
# Delete the init job to re-run (if needed)
kubectl -n m3db delete job m3db-cluster-init
kubectl apply -f 06-init-and-pdb.yaml
# View Traefik logs
kubectl -n traefik logs -l app.kubernetes.io/name=traefik
```