# M3DB on Vultr Kubernetes Engine Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI. ## Prerequisites - **kubectl** — for applying manifests - **helm** — for installing Traefik Ingress Controller ```bash # Install helm (macOS/Linux with Homebrew) brew install helm ``` ## Architecture ``` ┌─────────────────────────────────────────────────────┐ │ Vultr VKE Cluster │ │ │ External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer) │ External Grafana ─┼──PromQL query──▶ │ TLS termination, basic auth │ │ │ │ │ ┌──────┴──────┐ │ │ │ M3 Coordinator (Deployment, 2 replicas) In-cluster Prometheus┼──remote_write──▶ │ │ In-cluster Grafana ┼──PromQL query──▶ │ │ │ └──────┬──────┘ │ │ │ │ │ ┌────┴────┐ │ │ │ M3DB Nodes │ (StatefulSet, 3 replicas) │ │ Vultr Block│ (100Gi NVMe per node) │ │ │ Storage │ │ │ └────┬────┘ │ │ │ │ │ etcd cluster (StatefulSet, 3 replicas) └─────────────────────────────────────────────────────┘ ``` **External access flow:** ``` Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201 ``` ## Retention Tiers | Namespace | Resolution | Retention | Use Case | |----------------|-----------|-----------|---------------------------| | `default` | raw | 48h | Real-time queries | | `agg_10s_30d` | 10s | 30 days | Recent dashboards | | `agg_1m_1y` | 1m | 1 year | Long-term trends/capacity | ## Deployment ```bash # 1. Install Traefik Ingress Controller (handles TLS + basic auth) helm repo add traefik https://traefik.github.io/charts helm repo update helm install traefik traefik/traefik \ --namespace traefik --create-namespace \ --set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \ --set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \ --set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web' # Note: ACME requires single replica. For HA, use external cert management # or Traefik Enterprise with distributed ACME storage. # 2. Get the Traefik LoadBalancer IP and update DNS kubectl -n traefik get svc traefik # Point your domain (e.g., m3db.vultrlabs.dev) to this IP # 3. Apply M3DB manifests kubectl apply -k . # 4. Wait for all pods to be Running kubectl -n m3db get pods -w ``` ## Bootstrap M3DB Cluster The init job waits for coordinator health, which requires m3db to be bootstrapped. Bootstrap directly via m3dbnode's embedded coordinator: ```bash # Initialize placement kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \ -H "Content-Type: application/json" -d '{ "num_shards": 64, "replication_factor": 3, "instances": [ {"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000}, {"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000}, {"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000} ] }' # Create namespaces kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \ -H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"2h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"2h"}}}' kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \ -H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"12h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"12h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}' kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \ -H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"24h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"24h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}' # Wait for bootstrapping to complete (check shard state = AVAILABLE) kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health ``` ## Authentication External access is protected by HTTP basic auth. Update the password in `08-basic-auth-middleware.yaml`: ```bash # Generate new htpasswd entry htpasswd -nb # Update the secret stringData.users field and apply kubectl apply -f 08-basic-auth-middleware.yaml ``` ## Testing **Quick connectivity test:** ```bash # With basic auth (external) ./test-metrics.sh https://m3db.vultrlabs.dev example example # Without auth (in-cluster or port-forward) ./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201 ``` **Full read/write test (Python):** ```bash pip install requests python-snappy # With basic auth (external) python3 test-metrics.py https://m3db.vultrlabs.dev example example # Without auth (in-cluster or port-forward) python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201 ``` ## Prometheus Configuration (Replacing Mimir) Update your Prometheus config to point at M3 Coordinator. **In-cluster (same VKE cluster):** ```yaml # prometheus.yml remote_write: - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write" queue_config: capacity: 10000 max_shards: 30 max_samples_per_send: 5000 batch_send_deadline: 5s remote_read: - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read" read_recent: true ``` **External (cross-region/cross-cluster):** ```yaml # prometheus.yml remote_write: - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write" basic_auth: username: example password: example queue_config: capacity: 10000 max_shards: 30 max_samples_per_send: 5000 batch_send_deadline: 5s remote_read: - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read" basic_auth: username: example password: example read_recent: true ``` ## Grafana Datasource Add a **Prometheus** datasource in Grafana pointing to: - **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201` - **External:** `https://m3db.vultrlabs.dev` (with basic auth) All existing PromQL dashboards will work without modification. ## Migration from Mimir 1. **Dual-write phase**: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously. 2. **Validation**: Compare query results between Mimir and M3DB for the same time ranges. 3. **Cutover**: Once retention in M3DB covers your needs, remove the Mimir remote_write target. 4. **Cleanup**: Decommission Mimir components. ## Multi-Tenancy (Label-Based) M3DB uses Prometheus-style labels for tenant isolation. Add labels like `tenant`, `service`, `env` to your metrics to differentiate between sources. **Write metrics with tenant labels:** ```python # In your Prometheus remote_write client labels = { "tenant": "acme-corp", "service": "api-gateway", "env": "prod" } # Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"} ``` **Query by tenant:** ```bash # All metrics from a specific tenant curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}" # Filter by service within tenant curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}" # Filter by environment curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}" ``` **Prometheus configuration with labels:** ```yaml # prometheus.yml remote_write: - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write" basic_auth: username: example password: example # Add tenant labels to all metrics from this Prometheus write_relabel_configs: - target_label: tenant replacement: acme-corp - target_label: env replacement: prod ``` ## Tuning for Vultr - **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention. - **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod. - **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256. - **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`. ## Useful Commands ```bash # Get Traefik LoadBalancer IP kubectl -n traefik get svc traefik # Check cluster health (from inside cluster) kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health # Check placement (from inside cluster) kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq # Check m3dbnode bootstrapped status kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health # Query via PromQL (external with auth) curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up" # Delete the init job to re-run (if needed) kubectl -n m3db delete job m3db-cluster-init kubectl apply -f 06-init-and-pdb.yaml # View Traefik logs kubectl -n traefik logs -l app.kubernetes.io/name=traefik ```