M3DB on Vultr Kubernetes Engine

Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.

Architecture

Prometheus ──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
Grafana   ──PromQL query──▶       │
                                  │
                          ┌───────┴───────┐
                          │   M3DB Nodes  │  (StatefulSet, 3 replicas)
                          │  Vultr Block  │  (100Gi SSD per node)
                          │   Storage     │
                          └───────┬───────┘
                                  │
                            etcd cluster   (StatefulSet, 3 replicas)

Retention Tiers

Namespace	Resolution	Retention	Use Case
`default`	raw	48h	Real-time queries
`agg_10s_30d`	10s	30 days	Recent dashboards
`agg_1m_1y`	1m	1 year	Long-term trends/capacity

Deployment

# 1. Apply everything (except the init job won't succeed until pods are up)
kubectl apply -k .

# 2. Wait for all pods to be Ready
kubectl -n m3db get pods -w

# 3. Once all m3dbnode and m3coordinator pods are Running, the init job
#    will bootstrap the cluster (placement + namespaces).
#    Monitor it:
kubectl -n m3db logs -f job/m3db-cluster-init

# 4. Verify cluster health
kubectl -n m3db port-forward svc/m3coordinator 7201:7201
curl http://localhost:7201/api/v1/services/m3db/placement
curl http://localhost:7201/api/v1/services/m3db/namespace

Prometheus Configuration (Replacing Mimir)

Update your Prometheus config to point at M3 Coordinator instead of Mimir:

# prometheus.yml
remote_write:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write"
    queue_config:
      capacity: 10000
      max_shards: 30
      max_samples_per_send: 5000
      batch_send_deadline: 5s

remote_read:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read"
    read_recent: true

Grafana Datasource

Add a Prometheus datasource in Grafana pointing to:

http://m3coordinator.m3db.svc.cluster.local:7201

All existing PromQL dashboards will work without modification.

Migration from Mimir

Dual-write phase: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously.
Validation: Compare query results between Mimir and M3DB for the same time ranges.
Cutover: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
Cleanup: Decommission Mimir components.

Tuning for Vultr

Storage: The vultr-block-storage-m3db StorageClass uses high_perf (NVMe SSD). Adjust storage in the VolumeClaimTemplates based on your cardinality and retention.
Node sizing: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
Shards: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
Volume expansion: The StorageClass has allowVolumeExpansion: true — you can resize PVCs online via kubectl edit pvc.

Useful Commands

# Check placement
curl http://localhost:7201/api/v1/services/m3db/placement | jq

# Check namespace readiness
curl http://localhost:7201/api/v1/services/m3db/namespace/ready \
  -d '{"name":"default"}'

# Write a test metric
curl -X POST http://localhost:7201/api/v1/prom/remote/write \
  -H "Content-Type: application/x-protobuf"

# Query via PromQL
curl "http://localhost:7201/api/v1/query?query=up"

# Delete the init job to re-run (if needed)
kubectl -n m3db delete job m3db-cluster-init
kubectl apply -f 06-init-and-pdb.yaml

3.9 KiB Raw Blame History