Files
m3db-vke-setup/README.md
biondizzle 7ade5ecac8 Clean slate: 1h block sizes, remove backfill artifacts
- Changed all namespace block sizes to 1h (was 2h/12h/24h in manifests,
  30d+ in the live cluster due to backfill-era bufferPast hacks)
- Deleted entire backfill/ directory (scripts, pods, runbooks)
- Removed stale 05-m3coordinator.yaml (had backfill namespaces)
- Added 05-m3coordinator-deployment.yaml to kustomization
- Fixed init job health check (/health instead of /api/v1/services/m3db/health)
- Updated .env.example (removed Mimir credentials)
- Added 'Why Backfill Doesn't Work' section to README
2026-04-09 19:00:08 +00:00

15 KiB
Raw Permalink Blame History

M3DB on Vultr Kubernetes Engine

Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.

Prerequisites

  • kubectl — for applying manifests
  • helm — for installing Traefik Ingress Controller
# Install helm (macOS/Linux with Homebrew)
brew install helm

Architecture

                     ┌─────────────────────────────────────────────────────┐
                     │                 Vultr VKE Cluster                   │
                     │                                                     │
External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer)    │
External Grafana    ─┼──PromQL query──▶   │ TLS termination, basic auth   │
                     │                     │                               │
                     │              ┌──────┴──────┐                        │
                     │              │ M3 Coordinator (Deployment, 2 replicas)
In-cluster Prometheus┼──remote_write──▶     │                               │
In-cluster Grafana   ┼──PromQL query──▶     │                               │
                     │              └──────┬──────┘                        │
                     │                     │                               │
                     │                ┌────┴────┐                          │
                     │                │ M3DB Nodes │ (StatefulSet, 3 replicas)
                     │                │ Vultr Block│ (100Gi NVMe per node)  │
                     │                │  Storage   │                        │
                     │                └────┬────┘                          │
                     │                     │                               │
                     │               etcd cluster  (StatefulSet, 3 replicas)
                     └─────────────────────────────────────────────────────┘

External access flow:

Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201

Retention Tiers

All namespaces use 1h block size — the sweet spot for M3DB. Smaller blocks mean faster queries, faster flushes, and less memory pressure during compaction. See Why Backfill Doesn't Work for why larger blocks were a disaster.

Namespace Resolution Retention Block Size Use Case
default raw 48h 1h Real-time queries
agg_10s_30d 10s 30 days 1h Recent dashboards
agg_1m_1y 1m 1 year 1h Long-term trends/capacity

Deployment

# 1. Install Traefik Ingress Controller (handles TLS + basic auth)
helm repo add traefik https://traefik.github.io/charts
helm repo update
helm install traefik traefik/traefik \
  --namespace traefik --create-namespace \
  --set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \
  --set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \
  --set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'

# Note: ACME requires single replica. For HA, use external cert management
# or Traefik Enterprise with distributed ACME storage.

# 2. Get the Traefik LoadBalancer IP and update DNS
kubectl -n traefik get svc traefik
# Point your domain (e.g., m3db.vultrlabs.dev) to this IP

# 3. Apply M3DB manifests
kubectl apply -k .

# 4. Wait for all pods to be Running
kubectl -n m3db get pods -w

Bootstrap M3DB Cluster

The init job waits for coordinator health, which requires m3db to be bootstrapped. Bootstrap directly via m3dbnode's embedded coordinator:

# Initialize placement
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
  -H "Content-Type: application/json" -d '{
    "num_shards": 64,
    "replication_factor": 3,
    "instances": [
      {"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000},
      {"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000},
      {"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000}
    ]
  }'

# Create namespaces
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"}}}'

kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}'

kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'

# Wait for bootstrapping to complete (check shard state = AVAILABLE)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health

Authentication

External access is protected by HTTP basic auth. Update the password in 08-basic-auth-middleware.yaml:

# Generate new htpasswd entry
htpasswd -nb <username> <password>

# Update the secret stringData.users field and apply
kubectl apply -f 08-basic-auth-middleware.yaml

Testing

Quick connectivity test:

# With basic auth (external)
./test-metrics.sh https://m3db.vultrlabs.dev example example

# Without auth (in-cluster or port-forward)
./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201

Full read/write test (Python):

pip install requests python-snappy

# With basic auth (external)
python3 test-metrics.py https://m3db.vultrlabs.dev example example

# Without auth (in-cluster or port-forward)
python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201

Prometheus Configuration (Replacing Mimir)

Update your Prometheus config to point at M3 Coordinator.

In-cluster (same VKE cluster):

# prometheus.yml
remote_write:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write"
    queue_config:
      capacity: 10000
      max_shards: 30
      max_samples_per_send: 5000
      batch_send_deadline: 5s

remote_read:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read"
    read_recent: true

External (cross-region/cross-cluster):

# prometheus.yml
remote_write:
  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
    basic_auth:
      username: example
      password: example
    queue_config:
      capacity: 10000
      max_shards: 30
      max_samples_per_send: 5000
      batch_send_deadline: 5s

remote_read:
  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read"
    basic_auth:
      username: example
      password: example
    read_recent: true

Grafana Datasource

Add a Prometheus datasource in Grafana pointing to:

  • In-cluster: http://m3coordinator.m3db.svc.cluster.local:7201
  • External: https://m3db.vultrlabs.dev (with basic auth)

All existing PromQL dashboards will work without modification.

Migration from Mimir

  1. Dual-write phase: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously.
  2. Validation: Compare query results between Mimir and M3DB for the same time ranges.
  3. Cutover: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
  4. Cleanup: Decommission Mimir components.

Multi-Tenancy (Label-Based)

M3DB uses Prometheus-style labels for tenant isolation. Add labels like tenant, service, env to your metrics to differentiate between sources.

Write metrics with tenant labels:

# In your Prometheus remote_write client
labels = {
    "tenant": "acme-corp",
    "service": "api-gateway", 
    "env": "prod"
}
# Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"}

Query by tenant:

# All metrics from a specific tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}"

# Filter by service within tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}"

# Filter by environment
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}"

Prometheus configuration with labels:

# prometheus.yml
remote_write:
  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
    basic_auth:
      username: example
      password: example
    # Add tenant labels to all metrics from this Prometheus
    write_relabel_configs:
      - target_label: tenant
        replacement: acme-corp
      - target_label: env
        replacement: prod

Tuning for Vultr

  • Storage: The vultr-block-storage-m3db StorageClass uses disk_type: nvme (NVMe SSD). Adjust storage in the VolumeClaimTemplates based on your cardinality and retention.
  • Node sizing: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
  • Shards: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
  • Volume expansion: The StorageClass has allowVolumeExpansion: true — you can resize PVCs online via kubectl edit pvc.

Why Backfill Doesn't Work

TL;DR: M3DB is not designed for historical data import. Don't try it.

M3DB is a time-series database optimized for real-time ingestion and sequential writes. Backfilling — writing data with timestamps in the past — fights the fundamental architecture at every turn:

The Problems

  1. bufferPast is a hard gate. M3DB rejects writes whose timestamps fall outside the bufferPast window (default: 10m). To write data from 3 weeks ago, you need bufferPast=504h (21 days). This setting is immutable on existing namespaces — you have to create entirely new namespaces just for backfill, doubling your operational complexity.

  2. Massive block sizes were required. To make the backfill namespaces work with bufferPast=504h, block sizes had to be enormous (30+ day blocks). This defeated the entire point of M3DB's time-partitioned storage — blocks that large cause extreme memory pressure, slow compaction, and bloated index lookups.

  3. Downsample pipeline ignores historical data. M3DB's downsample coordinator only processes new writes in real-time. Backfilled data written to default_backfill namespaces never gets downsampled into aggregated namespaces, so your long-term retention tiers have gaps.

  4. No transaction boundaries. Each backfill write is an individual operation. Writing 12M+ samples means 12M+ individual writes with no batching semantics. If one fails, there's no rollback, no retry from a checkpoint — you get partial data with no easy way to detect or fix gaps.

  5. Compaction and flush chaos. M3DB expects data to flow sequentially through commitlog → flush → compact. Backfill dumps data out of order, causing the background compaction to thrash, consuming CPU and I/O for blocks that may never be queried again.

What We Tried

  • Created default_backfill, agg_10s_backfill, agg_1m_backfill namespaces with bufferPast=504h
  • Increased block sizes to 24h30d to accommodate the large bufferPast
  • Wrote 12M+ samples from Mimir to M3DB over multiple runs
  • Result: Data landed, but the operational cost was catastrophic — huge blocks, no downsampling, and the cluster was unstable

What To Do Instead

  • Start fresh. Configure M3DB with sane block sizes (1h) from day one and let it accumulate data naturally via Prometheus remote_write.
  • Accept the gap. Historical data lives in Mimir (or wherever it was before). Query Mimir for old data, M3DB for new data.
  • Dual-write during migration. Write to both systems simultaneously until M3DB's retention catches up.
  • If you absolutely need old data in M3DB, accept that you're doing a one-time migration and build tooling around the constraints — but know that it's a project, not a script.

Useful Commands

# Get Traefik LoadBalancer IP
kubectl -n traefik get svc traefik

# Check cluster health (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health

# Check placement (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq

# Check m3dbnode bootstrapped status
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health

# Query via PromQL (external with auth)
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up"

# Delete the init job to re-run (if needed)
kubectl -n m3db delete job m3db-cluster-init
kubectl apply -f 06-init-and-pdb.yaml

# View Traefik logs
kubectl -n traefik logs -l app.kubernetes.io/name=traefik