Fix m3dbnode port conflict, update README, fix test script

- Remove duplicate db.metrics section (port 7203 conflict) - Fix coordinator health endpoint (/health not /api/v1/services/m3db/health) - Update README: remove NodePort references, always use LoadBalancer - Add bootstrap instructions (workaround for init job chicken-and-egg) - Fix test-metrics.sh: correct health endpoint and JSON parsing
2026-03-31 15:49:59 +00:00
parent ac13c30905
commit a8469f79d7
10 changed files with 488 additions and 79 deletions
--- a/README.md
+++ b/README.md
@@ -5,16 +5,23 @@ Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, d
 ## Architecture

 ```
-Prometheus ──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
-Grafana   ──PromQL query──▶       │
-                                  │
-                          ┌───────┴───────┐
-                          │   M3DB Nodes  │  (StatefulSet, 3 replicas)
-                          │  Vultr Block  │  (100Gi SSD per node)
-                          │   Storage     │
-                          └───────┬───────┘
-                                  │
-                            etcd cluster   (StatefulSet, 3 replicas)
+                     ┌─────────────────────────────────────────────────────┐
+                     │                 Vultr VKE Cluster                   │
+                     │                                                     │
+External Prometheus ─┼──remote_write──▶ Vultr LoadBalancer (m3coordinator-lb)
+External Grafana    ─┼──PromQL query──▶         │ (managed, provisioned by CCM)
+                     │                           │
+In-cluster Prometheus┼──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
+In-cluster Grafana   ┼──PromQL query──▶       │
+                     │                        │
+                     │                ┌───────┴───────┐
+                     │                │   M3DB Nodes  │  (StatefulSet, 3 replicas)
+                     │                │  Vultr Block  │  (100Gi NVMe per node)
+                     │                │   Storage     │
+                     │                └───────┬───────┘
+                     │                        │
+                     │                  etcd cluster   (StatefulSet, 3 replicas)
+                     └─────────────────────────────────────────────────────┘
 ```

 ## Retention Tiers
@@ -28,27 +35,68 @@ Grafana   ──PromQL query──▶       │
 ## Deployment

 ```bash
-# 1. Apply everything (except the init job won't succeed until pods are up)
+# 1. Apply everything
 kubectl apply -k .

-# 2. Wait for all pods to be Ready
+# 2. Wait for all pods to be Running
 kubectl -n m3db get pods -w

-# 3. Once all m3dbnode and m3coordinator pods are Running, the init job
-#    will bootstrap the cluster (placement + namespaces).
-#    Monitor it:
-kubectl -n m3db logs -f job/m3db-cluster-init
+# 3. Bootstrap the cluster (placement + namespaces)
+#    The init job waits for coordinator health, which requires m3db to be bootstrapped.
+#    Bootstrap directly via m3dbnode's embedded coordinator:
+kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
+  -H "Content-Type: application/json" -d '{
+    "num_shards": 64,
+    "replication_factor": 3,
+    "instances": [
+      {"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000},
+      {"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000},
+      {"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000}
+    ]
+  }'

-# 4. Verify cluster health
-kubectl -n m3db port-forward svc/m3coordinator 7201:7201
-curl http://localhost:7201/api/v1/services/m3db/placement
-curl http://localhost:7201/api/v1/services/m3db/namespace
+kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
+  -H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"2h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"2h"}}}'
+
+kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
+  -H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"12h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"12h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}'
+
+kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
+  -H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"24h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"24h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'
+
+# 4. Wait for bootstrapping to complete (check shard state = AVAILABLE)
+kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
+
+# 5. Get the LoadBalancer IP
+kubectl -n m3db get svc m3coordinator-lb
 ```

+## Testing
+
+**Quick connectivity test:**
+```bash
+./test-metrics.sh <LB_IP>
+```
+
+This script verifies:
+1. Coordinator health endpoint responds
+2. Placement is configured with all 3 m3dbnode instances
+3. All 3 namespaces are created (default, agg_10s_30d, agg_1m_1y)
+4. PromQL queries work
+
+**Full read/write test (Python):**
+```bash
+pip install requests python-snappy
+python3 test-metrics.py <LB_IP>
+```
+
+Writes a test metric via Prometheus remote_write and reads it back.
+
 ## Prometheus Configuration (Replacing Mimir)

-Update your Prometheus config to point at M3 Coordinator instead of Mimir:
+Update your Prometheus config to point at M3 Coordinator.

+**In-cluster (same VKE cluster):**
 ```yaml
 # prometheus.yml
 remote_write:
@@ -64,13 +112,33 @@ remote_read:
    read_recent: true
 ```

+**External (cross-region/cross-cluster):**
+```yaml
+# prometheus.yml
+remote_write:
+  - url: "http://<LB-IP>:7201/api/v1/prom/remote/write"
+    queue_config:
+      capacity: 10000
+      max_shards: 30
+      max_samples_per_send: 5000
+      batch_send_deadline: 5s
+
+remote_read:
+  - url: "http://<LB-IP>:7201/api/v1/prom/remote/read"
+    read_recent: true
+```
+
+Get the LoadBalancer IP:
+```bash
+kubectl -n m3db get svc m3coordinator-lb
+```
+
 ## Grafana Datasource

 Add a **Prometheus** datasource in Grafana pointing to:

-```
-http://m3coordinator.m3db.svc.cluster.local:7201
-```
+- **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201`
+- **External:** `http://<LB-IP>:7201`

 All existing PromQL dashboards will work without modification.

@@ -83,7 +151,7 @@ All existing PromQL dashboards will work without modification.

 ## Tuning for Vultr

- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `high_perf` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
+- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
 - **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
 - **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
 - **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`.
@@ -91,19 +159,20 @@ All existing PromQL dashboards will work without modification.
 ## Useful Commands

 ```bash
-# Check placement
-curl http://localhost:7201/api/v1/services/m3db/placement | jq
+# Get LoadBalancer IP
+kubectl -n m3db get svc m3coordinator-lb

-# Check namespace readiness
-curl http://localhost:7201/api/v1/services/m3db/namespace/ready \
-  -d '{"name":"default"}'
+# Check cluster health (from inside cluster)
+kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health

-# Write a test metric
-curl -X POST http://localhost:7201/api/v1/prom/remote/write \
-  -H "Content-Type: application/x-protobuf"
+# Check placement (from inside cluster)
+kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq

-# Query via PromQL
-curl "http://localhost:7201/api/v1/query?query=up"
+# Check m3dbnode bootstrapped status
+kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
+
+# Query via PromQL (external)
+curl "http://<LB-IP>:7201/api/v1/query?query=up"

 # Delete the init job to re-run (if needed)
 kubectl -n m3db delete job m3db-cluster-init