# M3DB on Vultr Kubernetes Engine Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI. ## Prerequisites - **kubectl** — for applying manifests - **helm** — for installing Traefik Ingress Controller ```bash # Install helm (macOS/Linux with Homebrew) brew install helm ``` ## Architecture ``` ┌─────────────────────────────────────────────────────┐ │ Vultr VKE Cluster │ │ │ External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer) │ External Grafana ─┼──PromQL query──▶ │ TLS termination, basic auth │ │ │ │ │ ┌──────┴──────┐ │ │ │ M3 Coordinator (Deployment, 2 replicas) In-cluster Prometheus┼──remote_write──▶ │ │ In-cluster Grafana ┼──PromQL query──▶ │ │ │ └──────┬──────┘ │ │ │ │ │ ┌────┴────┐ │ │ │ M3DB Nodes │ (StatefulSet, 3 replicas) │ │ Vultr Block│ (100Gi NVMe per node) │ │ │ Storage │ │ │ └────┬────┘ │ │ │ │ │ etcd cluster (StatefulSet, 3 replicas) └─────────────────────────────────────────────────────┘ ``` **External access flow:** ``` Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201 ``` ## Retention Tiers All namespaces use **1h block size** — the sweet spot for M3DB. Smaller blocks mean faster queries, faster flushes, and less memory pressure during compaction. See [Why Backfill Doesn't Work](#why-backfill-doesnt-work) for why larger blocks were a disaster. | Namespace | Resolution | Retention | Block Size | Use Case | |----------------|-----------|-----------|------------|---------------------------| | `default` | raw | 48h | 1h | Real-time queries | | `agg_10s_30d` | 10s | 30 days | 1h | Recent dashboards | | `agg_1m_1y` | 1m | 1 year | 1h | Long-term trends/capacity | ## Deployment ```bash # 1. Install Traefik Ingress Controller (handles TLS + basic auth) helm repo add traefik https://traefik.github.io/charts helm repo update helm install traefik traefik/traefik \ --namespace traefik --create-namespace \ --set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \ --set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \ --set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web' # Note: ACME requires single replica. For HA, use external cert management # or Traefik Enterprise with distributed ACME storage. # 2. Get the Traefik LoadBalancer IP and update DNS kubectl -n traefik get svc traefik # Point your domain (e.g., m3db.vultrlabs.dev) to this IP # 3. Apply M3DB manifests kubectl apply -k . # 4. Wait for all pods to be Running kubectl -n m3db get pods -w ``` ## Bootstrap M3DB Cluster The init job waits for coordinator health, which requires m3db to be bootstrapped. Bootstrap directly via m3dbnode's embedded coordinator: ```bash # Initialize placement kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \ -H "Content-Type: application/json" -d '{ "num_shards": 64, "replication_factor": 3, "instances": [ {"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000}, {"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000}, {"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000} ] }' # Create namespaces kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \ -H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"}}}' kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \ -H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}' kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \ -H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}' # Wait for bootstrapping to complete (check shard state = AVAILABLE) kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health ``` ## Authentication External access is protected by HTTP basic auth. Update the password in `08-basic-auth-middleware.yaml`: ```bash # Generate new htpasswd entry htpasswd -nb # Update the secret stringData.users field and apply kubectl apply -f 08-basic-auth-middleware.yaml ``` ## Testing **Quick connectivity test:** ```bash # With basic auth (external) ./test-metrics.sh https://m3db.vultrlabs.dev example example # Without auth (in-cluster or port-forward) ./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201 ``` **Full read/write test (Python):** ```bash pip install requests python-snappy # With basic auth (external) python3 test-metrics.py https://m3db.vultrlabs.dev example example # Without auth (in-cluster or port-forward) python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201 ``` ## Prometheus Configuration (Replacing Mimir) Update your Prometheus config to point at M3 Coordinator. **In-cluster (same VKE cluster):** ```yaml # prometheus.yml remote_write: - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write" queue_config: capacity: 10000 max_shards: 30 max_samples_per_send: 5000 batch_send_deadline: 5s remote_read: - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read" read_recent: true ``` **External (cross-region/cross-cluster):** ```yaml # prometheus.yml remote_write: - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write" basic_auth: username: example password: example queue_config: capacity: 10000 max_shards: 30 max_samples_per_send: 5000 batch_send_deadline: 5s remote_read: - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read" basic_auth: username: example password: example read_recent: true ``` ## Grafana Datasource Add a **Prometheus** datasource in Grafana pointing to: - **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201` - **External:** `https://m3db.vultrlabs.dev` (with basic auth) All existing PromQL dashboards will work without modification. ## Migration from Mimir 1. **Dual-write phase**: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously. 2. **Validation**: Compare query results between Mimir and M3DB for the same time ranges. 3. **Cutover**: Once retention in M3DB covers your needs, remove the Mimir remote_write target. 4. **Cleanup**: Decommission Mimir components. ## Multi-Tenancy (Label-Based) M3DB uses Prometheus-style labels for tenant isolation. Add labels like `tenant`, `service`, `env` to your metrics to differentiate between sources. **Write metrics with tenant labels:** ```python # In your Prometheus remote_write client labels = { "tenant": "acme-corp", "service": "api-gateway", "env": "prod" } # Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"} ``` **Query by tenant:** ```bash # All metrics from a specific tenant curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}" # Filter by service within tenant curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}" # Filter by environment curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}" ``` **Prometheus configuration with labels:** ```yaml # prometheus.yml remote_write: - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write" basic_auth: username: example password: example # Add tenant labels to all metrics from this Prometheus write_relabel_configs: - target_label: tenant replacement: acme-corp - target_label: env replacement: prod ``` ## Tuning for Vultr - **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention. - **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod. - **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256. - **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`. ## Why Backfill Doesn't Work **TL;DR: M3DB is not designed for historical data import. Don't try it.** M3DB is a time-series database optimized for real-time ingestion and sequential writes. Backfilling — writing data with timestamps in the past — fights the fundamental architecture at every turn: ### The Problems 1. **`bufferPast` is a hard gate.** M3DB rejects writes whose timestamps fall outside the `bufferPast` window (default: 10m). To write data from 3 weeks ago, you need `bufferPast=504h` (21 days). This setting is **immutable** on existing namespaces — you have to create entirely new namespaces just for backfill, doubling your operational complexity. 2. **Massive block sizes were required.** To make the backfill namespaces work with `bufferPast=504h`, block sizes had to be enormous (30+ day blocks). This defeated the entire point of M3DB's time-partitioned storage — blocks that large cause extreme memory pressure, slow compaction, and bloated index lookups. 3. **Downsample pipeline ignores historical data.** M3DB's downsample coordinator only processes new writes in real-time. Backfilled data written to `default_backfill` namespaces never gets downsampled into aggregated namespaces, so your long-term retention tiers have gaps. 4. **No transaction boundaries.** Each backfill write is an individual operation. Writing 12M+ samples means 12M+ individual writes with no batching semantics. If one fails, there's no rollback, no retry from a checkpoint — you get partial data with no easy way to detect or fix gaps. 5. **Compaction and flush chaos.** M3DB expects data to flow sequentially through commitlog → flush → compact. Backfill dumps data out of order, causing the background compaction to thrash, consuming CPU and I/O for blocks that may never be queried again. ### What We Tried - Created `default_backfill`, `agg_10s_backfill`, `agg_1m_backfill` namespaces with `bufferPast=504h` - Increased block sizes to 24h–30d to accommodate the large bufferPast - Wrote 12M+ samples from Mimir to M3DB over multiple runs - Result: Data landed, but the operational cost was catastrophic — huge blocks, no downsampling, and the cluster was unstable ### What To Do Instead - **Start fresh.** Configure M3DB with sane block sizes (1h) from day one and let it accumulate data naturally via Prometheus remote_write. - **Accept the gap.** Historical data lives in Mimir (or wherever it was before). Query Mimir for old data, M3DB for new data. - **Dual-write during migration.** Write to both systems simultaneously until M3DB's retention catches up. - **If you absolutely need old data in M3DB**, accept that you're doing a one-time migration and build tooling around the constraints — but know that it's a project, not a script. --- ## Useful Commands ```bash # Get Traefik LoadBalancer IP kubectl -n traefik get svc traefik # Check cluster health (from inside cluster) kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health # Check placement (from inside cluster) kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq # Check m3dbnode bootstrapped status kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health # Query via PromQL (external with auth) curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up" # Delete the init job to re-run (if needed) kubectl -n m3db delete job m3db-cluster-init kubectl apply -f 06-init-and-pdb.yaml # View Traefik logs kubectl -n traefik logs -l app.kubernetes.io/name=traefik ```