Files

biondizzle 7ade5ecac8 Clean slate: 1h block sizes, remove backfill artifacts

- Changed all namespace block sizes to 1h (was 2h/12h/24h in manifests,
  30d+ in the live cluster due to backfill-era bufferPast hacks)
- Deleted entire backfill/ directory (scripts, pods, runbooks)
- Removed stale 05-m3coordinator.yaml (had backfill namespaces)
- Added 05-m3coordinator-deployment.yaml to kustomization
- Fixed init job health check (/health instead of /api/v1/services/m3db/health)
- Updated .env.example (removed Mimir credentials)
- Added 'Why Backfill Doesn't Work' section to README

2026-04-09 19:00:08 +00:00

15 KiB

Raw Permalink Blame History

M3DB on Vultr Kubernetes Engine

Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.

Prerequisites

kubectl — for applying manifests
helm — for installing Traefik Ingress Controller

# Install helm (macOS/Linux with Homebrew)
brew install helm

Architecture

                     ┌─────────────────────────────────────────────────────┐
                     │                 Vultr VKE Cluster                   │
                     │                                                     │
External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer)    │
External Grafana    ─┼──PromQL query──▶   │ TLS termination, basic auth   │
                     │                     │                               │
                     │              ┌──────┴──────┐                        │
                     │              │ M3 Coordinator (Deployment, 2 replicas)
In-cluster Prometheus┼──remote_write──▶     │                               │
In-cluster Grafana   ┼──PromQL query──▶     │                               │
                     │              └──────┬──────┘                        │
                     │                     │                               │
                     │                ┌────┴────┐                          │
                     │                │ M3DB Nodes │ (StatefulSet, 3 replicas)
                     │                │ Vultr Block│ (100Gi NVMe per node)  │
                     │                │  Storage   │                        │
                     │                └────┬────┘                          │
                     │                     │                               │
                     │               etcd cluster  (StatefulSet, 3 replicas)
                     └─────────────────────────────────────────────────────┘

External access flow:

Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201

Retention Tiers

All namespaces use 1h block size — the sweet spot for M3DB. Smaller blocks mean faster queries, faster flushes, and less memory pressure during compaction. See Why Backfill Doesn't Work for why larger blocks were a disaster.

Namespace	Resolution	Retention	Block Size	Use Case
`default`	raw	48h	1h	Real-time queries
`agg_10s_30d`	10s	30 days	1h	Recent dashboards
`agg_1m_1y`	1m	1 year	1h	Long-term trends/capacity

Deployment

# 1. Install Traefik Ingress Controller (handles TLS + basic auth)
helm repo add traefik https://traefik.github.io/charts
helm repo update
helm install traefik traefik/traefik \
  --namespace traefik --create-namespace \
  --set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \
  --set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \
  --set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'

# Note: ACME requires single replica. For HA, use external cert management
# or Traefik Enterprise with distributed ACME storage.

# 2. Get the Traefik LoadBalancer IP and update DNS
kubectl -n traefik get svc traefik
# Point your domain (e.g., m3db.vultrlabs.dev) to this IP

# 3. Apply M3DB manifests
kubectl apply -k .

# 4. Wait for all pods to be Running
kubectl -n m3db get pods -w

Bootstrap M3DB Cluster

The init job waits for coordinator health, which requires m3db to be bootstrapped. Bootstrap directly via m3dbnode's embedded coordinator:

# Initialize placement
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
  -H "Content-Type: application/json" -d '{
    "num_shards": 64,
    "replication_factor": 3,
    "instances": [
      {"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000},
      {"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000},
      {"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000}
    ]
  }'

# Create namespaces
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"}}}'

kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}'

kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"1h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"1h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'

# Wait for bootstrapping to complete (check shard state = AVAILABLE)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health

Authentication

External access is protected by HTTP basic auth. Update the password in 08-basic-auth-middleware.yaml:

# Generate new htpasswd entry
htpasswd -nb <username> <password>

# Update the secret stringData.users field and apply
kubectl apply -f 08-basic-auth-middleware.yaml

Testing

Quick connectivity test:

# With basic auth (external)
./test-metrics.sh https://m3db.vultrlabs.dev example example

# Without auth (in-cluster or port-forward)
./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201

Full read/write test (Python):

pip install requests python-snappy

# With basic auth (external)
python3 test-metrics.py https://m3db.vultrlabs.dev example example

# Without auth (in-cluster or port-forward)
python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201

Prometheus Configuration (Replacing Mimir)

Update your Prometheus config to point at M3 Coordinator.

In-cluster (same VKE cluster):

# prometheus.yml
remote_write:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write"
    queue_config:
      capacity: 10000
      max_shards: 30
      max_samples_per_send: 5000
      batch_send_deadline: 5s

remote_read:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read"
    read_recent: true

External (cross-region/cross-cluster):

# prometheus.yml
remote_write:
  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
    basic_auth:
      username: example
      password: example
    queue_config:
      capacity: 10000
      max_shards: 30
      max_samples_per_send: 5000
      batch_send_deadline: 5s

remote_read:
  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read"
    basic_auth:
      username: example
      password: example
    read_recent: true

Grafana Datasource

Add a Prometheus datasource in Grafana pointing to:

In-cluster: http://m3coordinator.m3db.svc.cluster.local:7201
External: https://m3db.vultrlabs.dev (with basic auth)

All existing PromQL dashboards will work without modification.

Migration from Mimir

Dual-write phase: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously.
Validation: Compare query results between Mimir and M3DB for the same time ranges.
Cutover: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
Cleanup: Decommission Mimir components.

Multi-Tenancy (Label-Based)

M3DB uses Prometheus-style labels for tenant isolation. Add labels like tenant, service, env to your metrics to differentiate between sources.

Write metrics with tenant labels:

# In your Prometheus remote_write client
labels = {
    "tenant": "acme-corp",
    "service": "api-gateway", 
    "env": "prod"
}
# Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"}

Query by tenant:

# All metrics from a specific tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}"

# Filter by service within tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}"

# Filter by environment
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}"

Prometheus configuration with labels:

# prometheus.yml
remote_write:
  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
    basic_auth:
      username: example
      password: example
    # Add tenant labels to all metrics from this Prometheus
    write_relabel_configs:
      - target_label: tenant
        replacement: acme-corp
      - target_label: env
        replacement: prod

Tuning for Vultr

Storage: The vultr-block-storage-m3db StorageClass uses disk_type: nvme (NVMe SSD). Adjust storage in the VolumeClaimTemplates based on your cardinality and retention.
Node sizing: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
Shards: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
Volume expansion: The StorageClass has allowVolumeExpansion: true — you can resize PVCs online via kubectl edit pvc.

Why Backfill Doesn't Work

TL;DR: M3DB is not designed for historical data import. Don't try it.

M3DB is a time-series database optimized for real-time ingestion and sequential writes. Backfilling — writing data with timestamps in the past — fights the fundamental architecture at every turn:

The Problems

bufferPast is a hard gate. M3DB rejects writes whose timestamps fall outside the bufferPast window (default: 10m). To write data from 3 weeks ago, you need bufferPast=504h (21 days). This setting is immutable on existing namespaces — you have to create entirely new namespaces just for backfill, doubling your operational complexity.
Massive block sizes were required. To make the backfill namespaces work with bufferPast=504h, block sizes had to be enormous (30+ day blocks). This defeated the entire point of M3DB's time-partitioned storage — blocks that large cause extreme memory pressure, slow compaction, and bloated index lookups.
Downsample pipeline ignores historical data. M3DB's downsample coordinator only processes new writes in real-time. Backfilled data written to default_backfill namespaces never gets downsampled into aggregated namespaces, so your long-term retention tiers have gaps.
No transaction boundaries. Each backfill write is an individual operation. Writing 12M+ samples means 12M+ individual writes with no batching semantics. If one fails, there's no rollback, no retry from a checkpoint — you get partial data with no easy way to detect or fix gaps.
Compaction and flush chaos. M3DB expects data to flow sequentially through commitlog → flush → compact. Backfill dumps data out of order, causing the background compaction to thrash, consuming CPU and I/O for blocks that may never be queried again.

What We Tried

Created default_backfill, agg_10s_backfill, agg_1m_backfill namespaces with bufferPast=504h
Increased block sizes to 24h–30d to accommodate the large bufferPast
Wrote 12M+ samples from Mimir to M3DB over multiple runs
Result: Data landed, but the operational cost was catastrophic — huge blocks, no downsampling, and the cluster was unstable

What To Do Instead

Start fresh. Configure M3DB with sane block sizes (1h) from day one and let it accumulate data naturally via Prometheus remote_write.
Accept the gap. Historical data lives in Mimir (or wherever it was before). Query Mimir for old data, M3DB for new data.
Dual-write during migration. Write to both systems simultaneously until M3DB's retention catches up.
If you absolutely need old data in M3DB, accept that you're doing a one-time migration and build tooling around the constraints — but know that it's a project, not a script.

Useful Commands

# Get Traefik LoadBalancer IP
kubectl -n traefik get svc traefik

# Check cluster health (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health

# Check placement (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq

# Check m3dbnode bootstrapped status
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health

# Query via PromQL (external with auth)
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up"

# Delete the init job to re-run (if needed)
kubectl -n m3db delete job m3db-cluster-init
kubectl apply -f 06-init-and-pdb.yaml

# View Traefik logs
kubectl -n traefik logs -l app.kubernetes.io/name=traefik

15 KiB Raw Permalink Blame History Unescape Escape