Fix m3dbnode port conflict, update README, fix test script
- Remove duplicate db.metrics section (port 7203 conflict) - Fix coordinator health endpoint (/health not /api/v1/services/m3db/health) - Update README: remove NodePort references, always use LoadBalancer - Add bootstrap instructions (workaround for init job chicken-and-egg) - Fix test-metrics.sh: correct health endpoint and JSON parsing
This commit is contained in:
139
README.md
139
README.md
@@ -5,16 +5,23 @@ Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, d
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Prometheus ──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
|
||||
Grafana ──PromQL query──▶ │
|
||||
│
|
||||
┌───────┴───────┐
|
||||
│ M3DB Nodes │ (StatefulSet, 3 replicas)
|
||||
│ Vultr Block │ (100Gi SSD per node)
|
||||
│ Storage │
|
||||
└───────┬───────┘
|
||||
│
|
||||
etcd cluster (StatefulSet, 3 replicas)
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Vultr VKE Cluster │
|
||||
│ │
|
||||
External Prometheus ─┼──remote_write──▶ Vultr LoadBalancer (m3coordinator-lb)
|
||||
External Grafana ─┼──PromQL query──▶ │ (managed, provisioned by CCM)
|
||||
│ │
|
||||
In-cluster Prometheus┼──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
|
||||
In-cluster Grafana ┼──PromQL query──▶ │
|
||||
│ │
|
||||
│ ┌───────┴───────┐
|
||||
│ │ M3DB Nodes │ (StatefulSet, 3 replicas)
|
||||
│ │ Vultr Block │ (100Gi NVMe per node)
|
||||
│ │ Storage │
|
||||
│ └───────┬───────┘
|
||||
│ │
|
||||
│ etcd cluster (StatefulSet, 3 replicas)
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Retention Tiers
|
||||
@@ -28,27 +35,68 @@ Grafana ──PromQL query──▶ │
|
||||
## Deployment
|
||||
|
||||
```bash
|
||||
# 1. Apply everything (except the init job won't succeed until pods are up)
|
||||
# 1. Apply everything
|
||||
kubectl apply -k .
|
||||
|
||||
# 2. Wait for all pods to be Ready
|
||||
# 2. Wait for all pods to be Running
|
||||
kubectl -n m3db get pods -w
|
||||
|
||||
# 3. Once all m3dbnode and m3coordinator pods are Running, the init job
|
||||
# will bootstrap the cluster (placement + namespaces).
|
||||
# Monitor it:
|
||||
kubectl -n m3db logs -f job/m3db-cluster-init
|
||||
# 3. Bootstrap the cluster (placement + namespaces)
|
||||
# The init job waits for coordinator health, which requires m3db to be bootstrapped.
|
||||
# Bootstrap directly via m3dbnode's embedded coordinator:
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
|
||||
-H "Content-Type: application/json" -d '{
|
||||
"num_shards": 64,
|
||||
"replication_factor": 3,
|
||||
"instances": [
|
||||
{"id": "m3dbnode-0", "isolation_group": "zone-a", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-0.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-0", "port": 9000},
|
||||
{"id": "m3dbnode-1", "isolation_group": "zone-b", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-1.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-1", "port": 9000},
|
||||
{"id": "m3dbnode-2", "isolation_group": "zone-c", "zone": "embedded", "weight": 100, "endpoint": "m3dbnode-2.m3dbnode.m3db.svc.cluster.local:9000", "hostname": "m3dbnode-2", "port": 9000}
|
||||
]
|
||||
}'
|
||||
|
||||
# 4. Verify cluster health
|
||||
kubectl -n m3db port-forward svc/m3coordinator 7201:7201
|
||||
curl http://localhost:7201/api/v1/services/m3db/placement
|
||||
curl http://localhost:7201/api/v1/services/m3db/namespace
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
|
||||
-H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"2h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"2h"}}}'
|
||||
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
|
||||
-H "Content-Type: application/json" -d '{"name":"agg_10s_30d","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"720h","blockSizeDuration":"12h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"12h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"10s"}}]}}}'
|
||||
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
|
||||
-H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"24h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"24h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'
|
||||
|
||||
# 4. Wait for bootstrapping to complete (check shard state = AVAILABLE)
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
|
||||
|
||||
# 5. Get the LoadBalancer IP
|
||||
kubectl -n m3db get svc m3coordinator-lb
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
**Quick connectivity test:**
|
||||
```bash
|
||||
./test-metrics.sh <LB_IP>
|
||||
```
|
||||
|
||||
This script verifies:
|
||||
1. Coordinator health endpoint responds
|
||||
2. Placement is configured with all 3 m3dbnode instances
|
||||
3. All 3 namespaces are created (default, agg_10s_30d, agg_1m_1y)
|
||||
4. PromQL queries work
|
||||
|
||||
**Full read/write test (Python):**
|
||||
```bash
|
||||
pip install requests python-snappy
|
||||
python3 test-metrics.py <LB_IP>
|
||||
```
|
||||
|
||||
Writes a test metric via Prometheus remote_write and reads it back.
|
||||
|
||||
## Prometheus Configuration (Replacing Mimir)
|
||||
|
||||
Update your Prometheus config to point at M3 Coordinator instead of Mimir:
|
||||
Update your Prometheus config to point at M3 Coordinator.
|
||||
|
||||
**In-cluster (same VKE cluster):**
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
remote_write:
|
||||
@@ -64,13 +112,33 @@ remote_read:
|
||||
read_recent: true
|
||||
```
|
||||
|
||||
**External (cross-region/cross-cluster):**
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
remote_write:
|
||||
- url: "http://<LB-IP>:7201/api/v1/prom/remote/write"
|
||||
queue_config:
|
||||
capacity: 10000
|
||||
max_shards: 30
|
||||
max_samples_per_send: 5000
|
||||
batch_send_deadline: 5s
|
||||
|
||||
remote_read:
|
||||
- url: "http://<LB-IP>:7201/api/v1/prom/remote/read"
|
||||
read_recent: true
|
||||
```
|
||||
|
||||
Get the LoadBalancer IP:
|
||||
```bash
|
||||
kubectl -n m3db get svc m3coordinator-lb
|
||||
```
|
||||
|
||||
## Grafana Datasource
|
||||
|
||||
Add a **Prometheus** datasource in Grafana pointing to:
|
||||
|
||||
```
|
||||
http://m3coordinator.m3db.svc.cluster.local:7201
|
||||
```
|
||||
- **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201`
|
||||
- **External:** `http://<LB-IP>:7201`
|
||||
|
||||
All existing PromQL dashboards will work without modification.
|
||||
|
||||
@@ -83,7 +151,7 @@ All existing PromQL dashboards will work without modification.
|
||||
|
||||
## Tuning for Vultr
|
||||
|
||||
- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `high_perf` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
|
||||
- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
|
||||
- **Node sizing**: M3DB is memory-hungry. Recommend at least 8GB RAM nodes on Vultr. The manifest requests 4Gi per m3dbnode pod.
|
||||
- **Shards**: The init job creates 64 shards across 3 nodes. For higher cardinality, increase to 128 or 256.
|
||||
- **Volume expansion**: The StorageClass has `allowVolumeExpansion: true` — you can resize PVCs online via `kubectl edit pvc`.
|
||||
@@ -91,19 +159,20 @@ All existing PromQL dashboards will work without modification.
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# Check placement
|
||||
curl http://localhost:7201/api/v1/services/m3db/placement | jq
|
||||
# Get LoadBalancer IP
|
||||
kubectl -n m3db get svc m3coordinator-lb
|
||||
|
||||
# Check namespace readiness
|
||||
curl http://localhost:7201/api/v1/services/m3db/namespace/ready \
|
||||
-d '{"name":"default"}'
|
||||
# Check cluster health (from inside cluster)
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health
|
||||
|
||||
# Write a test metric
|
||||
curl -X POST http://localhost:7201/api/v1/prom/remote/write \
|
||||
-H "Content-Type: application/x-protobuf"
|
||||
# Check placement (from inside cluster)
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/api/v1/services/m3db/placement | jq
|
||||
|
||||
# Query via PromQL
|
||||
curl "http://localhost:7201/api/v1/query?query=up"
|
||||
# Check m3dbnode bootstrapped status
|
||||
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
|
||||
|
||||
# Query via PromQL (external)
|
||||
curl "http://<LB-IP>:7201/api/v1/query?query=up"
|
||||
|
||||
# Delete the init job to re-run (if needed)
|
||||
kubectl -n m3db delete job m3db-cluster-init
|
||||
|
||||
Reference in New Issue
Block a user