Fix m3dbnode port conflict, update README, fix test script

- Remove duplicate db.metrics section (port 7203 conflict)
- Fix coordinator health endpoint (/health not /api/v1/services/m3db/health)
- Update README: remove NodePort references, always use LoadBalancer
- Add bootstrap instructions (workaround for init job chicken-and-egg)
- Fix test-metrics.sh: correct health endpoint and JSON parsing
This commit is contained in:
2026-03-31 15:49:59 +00:00
parent ac13c30905
commit a8469f79d7
10 changed files with 488 additions and 79 deletions

139
README.md
View File

@@ -5,16 +5,23 @@ Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, d
## Architecture
```
Prometheus ──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
Grafana ──PromQL query──▶
┌───────┴───────┐
│ M3DB Nodes │ (StatefulSet, 3 replicas)
Vultr Block (100Gi SSD per node)
│ Storage │
└───────┬───────┘
etcd cluster (StatefulSet, 3 replicas)
┌─────────────────────────────────────────────────────┐
│ Vultr VKE Cluster
External Prometheus ─┼──remote_write──▶ Vultr LoadBalancer (m3coordinator-lb)
External Grafana ─┼──PromQL query──▶ │ (managed, provisioned by CCM)
In-cluster Prometheus┼──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
In-cluster Grafana ┼──PromQL query──▶ │
┌───────┴───────┐
│ │ M3DB Nodes │ (StatefulSet, 3 replicas)
│ │ Vultr Block │ (100Gi NVMe per node)
│ │ Storage │
│ └───────┬───────┘
│ │
│ etcd cluster (StatefulSet, 3 replicas)
└─────────────────────────────────────────────────────┘
```
## Retention Tiers
@@ -28,27 +35,68 @@ Grafana ──PromQL query──▶ │
## Deployment
```bash
# 1. Apply everything (except the init job won't succeed until pods are up)
# 1. Apply everything
kubectl apply -k .
# 2. Wait for all pods to be Ready
# 2. Wait for all pods to be Running
kubectl -n m3db get pods -w
# 3. Once all m3dbnode and m3coordinator pods are Running, the init job
# will bootstrap the cluster (placement + namespaces).
# Monitor it:
kubectl -n m3db logs -f job/m3db-cluster-init
# 3. Bootstrap the cluster (placement + namespaces)
# The init job waits for coordinator health, which requires m3db to be bootstrapped.
# Bootstrap directly via m3dbnode's embedded coordinator:
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
-H "Content-Type: application/json" -d '{
"num_shards": 64,
"replication_factor": 3,
"instances": [
{"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000},
{"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000},
{"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000}
]
}'
# 4. Verify cluster health
kubectl -n m3db port-forward svc/m3coordinator 7201:7201
curl http://localhost:7201/api/v1/services/m3db/placement
curl http://localhost:7201/api/v1/services/m3db/namespace
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"2h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"2h"}}}'
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"12h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"12h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}'
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"24h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"24h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'
# 4. Wait for bootstrapping to complete (check shard state = AVAILABLE)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
# 5. Get the LoadBalancer IP
kubectl -n m3db get svc m3coordinator-lb
```
## Testing
**Quick connectivity test:**
```bash
./test-metrics.sh <LB_IP>
```
This script verifies:
1. Coordinator health endpoint responds
2. Placement is configured with all 3 m3dbnode instances
3. All 3 namespaces are created (default, agg_10s_30d, agg_1m_1y)
4. PromQL queries work
**Full read/write test (Python):**
```bash
pip install requests python-snappy
python3 test-metrics.py <LB_IP>
```
Writes a test metric via Prometheus remote_write and reads it back.
## Prometheus Configuration (Replacing Mimir)
Update your Prometheus config to point at M3 Coordinator instead of Mimir:
Update your Prometheus config to point at M3 Coordinator.
**In-cluster (same VKE cluster):**
```yaml
# prometheus.yml
remote_write:
@@ -64,13 +112,33 @@ remote_read:
read_recent: true
```
**External (cross-region/cross-cluster):**
```yaml
# prometheus.yml
remote_write:
- url: "http://<LB-IP>:7201/api/v1/prom/remote/write"
queue_config:
capacity: 10000
max_shards: 30
max_samples_per_send: 5000
batch_send_deadline: 5s
remote_read:
- url: "http://<LB-IP>:7201/api/v1/prom/remote/read"
read_recent: true
```
Get the LoadBalancer IP:
```bash
kubectl -n m3db get svc m3coordinator-lb
```
## Grafana Datasource
Add a **Prometheus** datasource in Grafana pointing to:
```
http://m3coordinator.m3db.svc.cluster.local:7201
```
- **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201`
- **External:** `http://<LB-IP>:7201`
All existing PromQL dashboards will work without modification.
@@ -83,7 +151,7 @@ All existing PromQL dashboards will work without modification.
## Tuning for Vultr
- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `high_perf` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
- **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
- **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
- **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`.
@@ -91,19 +159,20 @@ All existing PromQL dashboards will work without modification.
## Useful Commands
```bash
# Check placement
curl http://localhost:7201/api/v1/services/m3db/placement | jq
# Get LoadBalancer IP
kubectl -n m3db get svc m3coordinator-lb
# Check namespace readiness
curl http://localhost:7201/api/v1/services/m3db/namespace/ready \
-d '{"name":"default"}'
# Check cluster health (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health
# Write a test metric
curl -X POST http://localhost:7201/api/v1/prom/remote/write \
-H "Content-Type: application/x-protobuf"
# Check placement (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq
# Query via PromQL
curl "http://localhost:7201/api/v1/query?query=up"
# Check m3dbnode bootstrapped status
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
# Query via PromQL (external)
curl "http://<LB-IP>:7201/api/v1/query?query=up"
# Delete the init job to re-run (if needed)
kubectl -n m3db delete job m3db-cluster-init