Replace LB with Traefik ingress for TLS + basic auth

- Remove m3coordinator LoadBalancer service (was using deprecated AutoSSL)
- Add Traefik ingress controller with Let's Encrypt ACME
- Add basic auth middleware for external access
- Update test scripts with auth support and fixed protobuf encoding
- Add multi-tenancy documentation (label-based isolation)
- Update README with Traefik deployment instructions
This commit is contained in:
2026-04-01 05:19:14 +00:00
parent 5eb58d1864
commit a6c59d6a65
6 changed files with 368 additions and 197 deletions

187
README.md
View File

@@ -2,28 +2,46 @@
Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.
## Prerequisites
- **kubectl** — for applying manifests
- **helm** — for installing Traefik Ingress Controller
```bash
# Install helm (macOS/Linux with Homebrew)
brew install helm
```
## Architecture
```
┌─────────────────────────────────────────────────────┐
│ Vultr VKE Cluster │
│ │
External Prometheus ─┼──remote_write──▶ Vultr LoadBalancer (m3coordinator-lb)
External Grafana ─┼──PromQL query──▶ │ (managed, provisioned by CCM)
│ │
In-cluster Prometheus┼──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
In-cluster Grafana ┼──PromQL query──▶ │
┌───────┴───────┐
M3DB Nodes (StatefulSet, 3 replicas)
Vultr Block │ (100Gi NVMe per node)
│ Storage
└───────┬───────┘
etcd cluster (StatefulSet, 3 replicas)
External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer) │
External Grafana ─┼──PromQL query──▶ │ TLS termination, basic auth │
│ ┌──────┴──────┐ │
│ │ M3 Coordinator (Deployment, 2 replicas)
In-cluster Prometheus┼──remote_write──▶ │
In-cluster Grafana ┼──PromQL query──▶
└──────┬──────┘
│ │
┌────┴────┐
│ M3DB Nodes │ (StatefulSet, 3 replicas)
│ Vultr Block│ (100Gi NVMe per node)
Storage │ │
│ └────┬────┘ │
│ │ │
│ etcd cluster (StatefulSet, 3 replicas)
└─────────────────────────────────────────────────────┘
```
**External access flow:**
```
Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201
```
## Retention Tiers
| Namespace | Resolution | Retention | Use Case |
@@ -35,15 +53,36 @@ In-cluster Grafana ┼──PromQL query──▶ │
## Deployment
```bash
# 1. Apply everything
# 1. Install Traefik Ingress Controller (handles TLS + basic auth)
helm repo add traefik https://traefik.github.io/charts
helm repo update
helm install traefik traefik/traefik \
--namespace traefik --create-namespace \
--set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \
--set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \
--set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
# Note: ACME requires single replica. For HA, use external cert management
# or Traefik Enterprise with distributed ACME storage.
# 2. Get the Traefik LoadBalancer IP and update DNS
kubectl -n traefik get svc traefik
# Point your domain (e.g., m3db.vultrlabs.dev) to this IP
# 3. Apply M3DB manifests
kubectl apply -k .
# 2. Wait for all pods to be Running
# 4. Wait for all pods to be Running
kubectl -n m3db get pods -w
```
# 3. Bootstrap the cluster (placement + namespaces)
# The init job waits for coordinator health, which requires m3db to be bootstrapped.
# Bootstrap directly via m3dbnode's embedded coordinator:
## Bootstrap M3DB Cluster
The init job waits for coordinator health, which requires m3db to be bootstrapped.
Bootstrap directly via m3dbnode's embedded coordinator:
```bash
# Initialize placement
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
-H "Content-Type: application/json" -d '{
"num_shards": 64,
@@ -55,6 +94,7 @@ kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/
]
}'
# Create namespaces
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"2h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"2h"}}}'
@@ -64,37 +104,43 @@ kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/
kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
-H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"24h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"24h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'
# 4. Wait for bootstrapping to complete (check shard state = AVAILABLE)
# Wait for bootstrapping to complete (check shard state = AVAILABLE)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
```
# 5. Get the LoadBalancer IP
kubectl -n m3db get svc m3coordinator-lb
## Authentication
External access is protected by HTTP basic auth. Update the password in `08-basic-auth-middleware.yaml`:
```bash
# Generate new htpasswd entry
htpasswd -nb <username> <password>
# Update the secret stringData.users field and apply
kubectl apply -f 08-basic-auth-middleware.yaml
```
## Testing
**Quick connectivity test:**
```bash
./test-metrics.sh <BASE_URL>
# Example:
./test-metrics.sh http://m3db.vultrlabs.dev:7201
```
# With basic auth (external)
./test-metrics.sh https://m3db.vultrlabs.dev example example
This script verifies:
1. Coordinator health endpoint responds
2. Placement is configured with all 3 m3dbnode instances
3. All 3 namespaces are created (default, agg_10s_30d, agg_1m_1y)
4. PromQL queries work
# Without auth (in-cluster or port-forward)
./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201
```
**Full read/write test (Python):**
```bash
pip install requests python-snappy
python3 test-metrics.py <BASE_URL>
# Example:
python3 test-metrics.py http://m3db.vultrlabs.dev:7201
```
Writes a test metric via Prometheus remote_write and reads it back.
# With basic auth (external)
python3 test-metrics.py https://m3db.vultrlabs.dev example example
# Without auth (in-cluster or port-forward)
python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201
```
## Prometheus Configuration (Replacing Mimir)
@@ -120,7 +166,10 @@ remote_read:
```yaml
# prometheus.yml
remote_write:
- url: "http://m3db.vultrlabs.dev:7201/api/v1/prom/remote/write"
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
basic_auth:
username: example
password: example
queue_config:
capacity: 10000
max_shards: 30
@@ -128,21 +177,19 @@ remote_write:
batch_send_deadline: 5s
remote_read:
- url: "http://m3db.vultrlabs.dev:7201/api/v1/prom/remote/read"
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read"
basic_auth:
username: example
password: example
read_recent: true
```
Get the LoadBalancer IP:
```bash
kubectl -n m3db get svc m3coordinator-lb
```
## Grafana Datasource
Add a **Prometheus** datasource in Grafana pointing to:
- **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201`
- **External:** `http://m3db.vultrlabs.dev:7201`
- **External:** `https://m3db.vultrlabs.dev` (with basic auth)
All existing PromQL dashboards will work without modification.
@@ -153,6 +200,49 @@ All existing PromQL dashboards will work without modification.
3. **Cutover**: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
4. **Cleanup**: Decommission Mimir components.
## Multi-Tenancy (Label-Based)
M3DB uses Prometheus-style labels for tenant isolation. Add labels like `tenant`, `service`, `env` to your metrics to differentiate between sources.
**Write metrics with tenant labels:**
```python
# In your Prometheus remote_write client
labels = {
"tenant": "acme-corp",
"service": "api-gateway",
"env": "prod"
}
# Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"}
```
**Query by tenant:**
```bash
# All metrics from a specific tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}"
# Filter by service within tenant
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}"
# Filter by environment
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}"
```
**Prometheus configuration with labels:**
```yaml
# prometheus.yml
remote_write:
- url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
basic_auth:
username: example
password: example
# Add tenant labels to all metrics from this Prometheus
write_relabel_configs:
- target_label: tenant
replacement: acme-corp
- target_label: env
replacement: prod
```
## Tuning for Vultr
- **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
@@ -163,8 +253,8 @@ All existing PromQL dashboards will work without modification.
## Useful Commands
```bash
# Get LoadBalancer IP
kubectl -n m3db get svc m3coordinator-lb
# Get Traefik LoadBalancer IP
kubectl -n traefik get svc traefik
# Check cluster health (from inside cluster)
kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health
@@ -175,10 +265,13 @@ kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster
# Check m3dbnode bootstrapped status
kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
# Query via PromQL (external)
curl "http://<LB-IP>:7201/api/v1/query?query=up"
# Query via PromQL (external with auth)
curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up"
# Delete the init job to re-run (if needed)
kubectl -n m3db delete job m3db-cluster-init
kubectl apply -f 06-init-and-pdb.yaml
# View Traefik logs
kubectl -n traefik logs -l app.kubernetes.io/name=traefik
```