m3db-vke-setup/README.md

# M3DB on Vultr Kubernetes Engine

Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.

## Architecture

```
Prometheus ──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
Grafana   ──PromQL query──▶       │
                                  │
                          ┌───────┴───────┐
                          │   M3DB Nodes  │  (StatefulSet, 3 replicas)
                          │  Vultr Block  │  (100Gi SSD per node)
                          │   Storage     │
                          └───────┬───────┘
                                  │
                            etcd cluster   (StatefulSet, 3 replicas)
```

## Retention Tiers

| Namespace      | Resolution | Retention | Use Case                  |
|----------------|-----------|-----------|---------------------------|
| `default`      | raw       | 48h       | Real-time queries         |
| `agg_10s_30d`  | 10s       | 30 days   | Recent dashboards         |
| `agg_1m_1y`    | 1m        | 1 year    | Long-term trends/capacity |

## Deployment

```bash
# 1. Apply everything (except the init job won't succeed until pods are up)
kubectl apply -k .

# 2. Wait for all pods to be Ready
kubectl -n m3db get pods -w

# 3. Once all m3dbnode and m3coordinator pods are Running, the init job
#    will bootstrap the cluster (placement + namespaces).
#    Monitor it:
kubectl -n m3db logs -f job/m3db-cluster-init

# 4. Verify cluster health
kubectl -n m3db port-forward svc/m3coordinator 7201:7201
curl http://localhost:7201/api/v1/services/m3db/placement
curl http://localhost:7201/api/v1/services/m3db/namespace
```

## Prometheus Configuration (Replacing Mimir)

Update your Prometheus config to point at M3 Coordinator instead of Mimir:

```yaml
# prometheus.yml
remote_write:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/write"
    queue_config:
      capacity: 10000
      max_shards: 30
      max_samples_per_send: 5000
      batch_send_deadline: 5s

remote_read:
  - url: "http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/prom/remote/read"
    read_recent: true
```

## Grafana Datasource

Add a **Prometheus** datasource in Grafana pointing to:

```
http://m3coordinator.m3db.svc.cluster.local:7201
```

All existing PromQL dashboards will work without modification.

## Migration from Mimir

1. **Dual-write phase**: Configure Prometheus to remote_write to both Mimir and M3DB simultaneously.
2. **Validation**: Compare query results between Mimir and M3DB for the same time ranges.
3. **Cutover**: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
4. **Cleanup**: Decommission Mimir components.

## Tuning for Vultr

- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `high_perf` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
- **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
- **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
- **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`.

## Useful Commands

```bash
# Check placement
curl http://localhost:7201/api/v1/services/m3db/placement | jq

# Check namespace readiness
curl http://localhost:7201/api/v1/services/m3db/namespace/ready \
  -d '{"name":"default"}'

# Write a test metric
curl -X POST http://localhost:7201/api/v1/prom/remote/write \
  -H "Content-Type: application/x-protobuf"

# Query via PromQL
curl "http://localhost:7201/api/v1/query?query=up"

# Delete the init job to re-run (if needed)
kubectl -n m3db delete job m3db-cluster-init
kubectl apply -f 06-init-and-pdb.yaml
```