Replace LB with Traefik ingress for TLS + basic auth

- Remove m3coordinator LoadBalancer service (was using deprecated AutoSSL) - Add Traefik ingress controller with Let's Encrypt ACME - Add basic auth middleware for external access - Update test scripts with auth support and fixed protobuf encoding - Add multi-tenancy documentation (label-based isolation) - Update README with Traefik deployment instructions
2026-04-01 05:19:14 +00:00
parent 5eb58d1864
commit a6c59d6a65
6 changed files with 368 additions and 197 deletions
--- a/README.md
+++ b/README.md
@@ -2,28 +2,46 @@

 Drop-in Mimir replacement using M3DB for long-term Prometheus metrics storage, deployed on Vultr VKE with Vultr Block Storage CSI.

+## Prerequisites
+
+- **kubectl** — for applying manifests
+- **helm** — for installing Traefik Ingress Controller
+
+```bash
+# Install helm (macOS/Linux with Homebrew)
+brew install helm
+```
+
 ## Architecture

 ```
                     ┌─────────────────────────────────────────────────────┐
                     │                 Vultr VKE Cluster                   │
                     │                                                     │
-External Prometheus ─┼──remote_write──▶ Vultr LoadBalancer (m3coordinator-lb)
-External Grafana    ─┼──PromQL query──▶         │ (managed, provisioned by CCM)
-                     │                           │
-In-cluster Prometheus┼──remote_write──▶ M3 Coordinator (Deployment, 2 replicas)
-In-cluster Grafana   ┼──PromQL query──▶       │
-                     │                        │
-                     │                ┌───────┴───────┐
-                     │                │   M3DB Nodes  │  (StatefulSet, 3 replicas)
-                     │                │  Vultr Block  │  (100Gi NVMe per node)
-                     │                │   Storage     │
-                     │                └───────┬───────┘
-                     │                        │
-                     │                  etcd cluster   (StatefulSet, 3 replicas)
+External Prometheus ─┼──remote_write──▶ Traefik Ingress (LoadBalancer)    │
+External Grafana    ─┼──PromQL query──▶   │ TLS termination, basic auth   │
+                     │                     │                               │
+                     │              ┌──────┴──────┐                        │
+                     │              │ M3 Coordinator (Deployment, 2 replicas)
+In-cluster Prometheus┼──remote_write──▶     │                               │
+In-cluster Grafana   ┼──PromQL query──▶     │                               │
+                     │              └──────┬──────┘                        │
+                     │                     │                               │
+                     │                ┌────┴────┐                          │
+                     │                │ M3DB Nodes │ (StatefulSet, 3 replicas)
+                     │                │ Vultr Block│ (100Gi NVMe per node)  │
+                     │                │  Storage   │                        │
+                     │                └────┬────┘                          │
+                     │                     │                               │
+                     │               etcd cluster  (StatefulSet, 3 replicas)
                     └─────────────────────────────────────────────────────┘
 ```

+**External access flow:**
+```
+Internet → Vultr LoadBalancer → Traefik (TLS + basic auth) → m3coordinator:7201
+```
+
 ## Retention Tiers

 | Namespace      | Resolution | Retention | Use Case                  |
@@ -35,15 +53,36 @@ In-cluster Grafana   ┼──PromQL query──▶       │
 ## Deployment

 ```bash
-# 1. Apply everything
+# 1. Install Traefik Ingress Controller (handles TLS + basic auth)
+helm repo add traefik https://traefik.github.io/charts
+helm repo update
+helm install traefik traefik/traefik \
+  --namespace traefik --create-namespace \
+  --set 'additionalArguments[0]=--certificatesresolvers.letsencrypt.acme.email=your-email@example.com' \
+  --set 'additionalArguments[1]=--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json' \
+  --set 'additionalArguments[2]=--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
+
+# Note: ACME requires single replica. For HA, use external cert management
+# or Traefik Enterprise with distributed ACME storage.
+
+# 2. Get the Traefik LoadBalancer IP and update DNS
+kubectl -n traefik get svc traefik
+# Point your domain (e.g., m3db.vultrlabs.dev) to this IP
+
+# 3. Apply M3DB manifests
 kubectl apply -k .

-# 2. Wait for all pods to be Running
+# 4. Wait for all pods to be Running
 kubectl -n m3db get pods -w
+```

-# 3. Bootstrap the cluster (placement + namespaces)
-#    The init job waits for coordinator health, which requires m3db to be bootstrapped.
-#    Bootstrap directly via m3dbnode's embedded coordinator:
+## Bootstrap M3DB Cluster
+
+The init job waits for coordinator health, which requires m3db to be bootstrapped.
+Bootstrap directly via m3dbnode's embedded coordinator:
+
+```bash
+# Initialize placement
 kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/placement/init \
  -H "Content-Type: application/json" -d '{
    "num_shards": 64,
@@ -55,6 +94,7 @@ kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/
    ]
  }'

+# Create namespaces
 kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"default","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"repairEnabled":false,"retentionOptions":{"retentionPeriodDuration":"48h","blockSizeDuration":"2h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"2h"}}}'

@@ -64,37 +104,43 @@ kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/
 kubectl -n m3db exec m3dbnode-0 -- curl -s -X POST http://localhost:7201/api/v1/services/m3db/namespace \
  -H "Content-Type: application/json" -d '{"name":"agg_1m_1y","options":{"bootstrapEnabled":true,"flushEnabled":true,"writesToCommitLog":true,"cleanupEnabled":true,"snapshotEnabled":true,"retentionOptions":{"retentionPeriodDuration":"8760h","blockSizeDuration":"24h","bufferFutureDuration":"10m","bufferPastDuration":"10m"},"indexOptions":{"enabled":true,"blockSizeDuration":"24h"},"aggregationOptions":{"aggregations":[{"aggregated":true,"attributes":{"resolutionDuration":"1m"}}]}}}'

-# 4. Wait for bootstrapping to complete (check shard state = AVAILABLE)
+# Wait for bootstrapping to complete (check shard state = AVAILABLE)
 kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health
+```

-# 5. Get the LoadBalancer IP
-kubectl -n m3db get svc m3coordinator-lb
+## Authentication
+
+External access is protected by HTTP basic auth. Update the password in `08-basic-auth-middleware.yaml`:
+
+```bash
+# Generate new htpasswd entry
+htpasswd -nb <username> <password>
+
+# Update the secret stringData.users field and apply
+kubectl apply -f 08-basic-auth-middleware.yaml
 ```

 ## Testing

 **Quick connectivity test:**
 ```bash
-./test-metrics.sh <BASE_URL>
-# Example:
-./test-metrics.sh http://m3db.vultrlabs.dev:7201
-```
+# With basic auth (external)
+./test-metrics.sh https://m3db.vultrlabs.dev example example

-This script verifies:
-1. Coordinator health endpoint responds
-2. Placement is configured with all 3 m3dbnode instances
-3. All 3 namespaces are created (default, agg_10s_30d, agg_1m_1y)
-4. PromQL queries work
+# Without auth (in-cluster or port-forward)
+./test-metrics.sh http://m3coordinator.m3db.svc.cluster.local:7201
+```

 **Full read/write test (Python):**
 ```bash
 pip install requests python-snappy
-python3 test-metrics.py <BASE_URL>
-# Example:
-python3 test-metrics.py http://m3db.vultrlabs.dev:7201
-```

-Writes a test metric via Prometheus remote_write and reads it back.
+# With basic auth (external)
+python3 test-metrics.py https://m3db.vultrlabs.dev example example
+
+# Without auth (in-cluster or port-forward)
+python3 test-metrics.py http://m3coordinator.m3db.svc.cluster.local:7201
+```

 ## Prometheus Configuration (Replacing Mimir)

@@ -120,7 +166,10 @@ remote_read:
 ```yaml
 # prometheus.yml
 remote_write:
-  - url: "http://m3db.vultrlabs.dev:7201/api/v1/prom/remote/write"
+  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
+    basic_auth:
+      username: example
+      password: example
    queue_config:
      capacity: 10000
      max_shards: 30
@@ -128,21 +177,19 @@ remote_write:
      batch_send_deadline: 5s

 remote_read:
-  - url: "http://m3db.vultrlabs.dev:7201/api/v1/prom/remote/read"
+  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/read"
+    basic_auth:
+      username: example
+      password: example
    read_recent: true
 ```

-Get the LoadBalancer IP:
-```bash
-kubectl -n m3db get svc m3coordinator-lb
-```
-
 ## Grafana Datasource

 Add a **Prometheus** datasource in Grafana pointing to:

 - **In-cluster:** `http://m3coordinator.m3db.svc.cluster.local:7201`
- **External:** `http://m3db.vultrlabs.dev:7201`
+- **External:** `https://m3db.vultrlabs.dev` (with basic auth)

 All existing PromQL dashboards will work without modification.

@@ -153,6 +200,49 @@ All existing PromQL dashboards will work without modification.
 3. **Cutover**: Once retention in M3DB covers your needs, remove the Mimir remote_write target.
 4. **Cleanup**: Decommission Mimir components.

+## Multi-Tenancy (Label-Based)
+
+M3DB uses Prometheus-style labels for tenant isolation. Add labels like `tenant`, `service`, `env` to your metrics to differentiate between sources.
+
+**Write metrics with tenant labels:**
+```python
+# In your Prometheus remote_write client
+labels = {
+    "tenant": "acme-corp",
+    "service": "api-gateway", 
+    "env": "prod"
+}
+# Metric: http_requests_total{tenant="acme-corp", service="api-gateway", env="prod"}
+```
+
+**Query by tenant:**
+```bash
+# All metrics from a specific tenant
+curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\"}"
+
+# Filter by service within tenant
+curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{tenant=\"acme-corp\",service=\"api-gateway\"}"
+
+# Filter by environment
+curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=http_requests_total{env=\"prod\"}"
+```
+
+**Prometheus configuration with labels:**
+```yaml
+# prometheus.yml
+remote_write:
+  - url: "https://m3db.vultrlabs.dev/api/v1/prom/remote/write"
+    basic_auth:
+      username: example
+      password: example
+    # Add tenant labels to all metrics from this Prometheus
+    write_relabel_configs:
+      - target_label: tenant
+        replacement: acme-corp
+      - target_label: env
+        replacement: prod
+```
+
 ## Tuning for Vultr

 - **Storage**: The `vultr-block-storage-m3db` StorageClass uses `disk_type: nvme` (NVMe SSD). Adjust `storage` in the VolumeClaimTemplates based on your cardinality and retention.
@@ -163,8 +253,8 @@ All existing PromQL dashboards will work without modification.
 ## Useful Commands

 ```bash
-# Get LoadBalancer IP
-kubectl -n m3db get svc m3coordinator-lb
+# Get Traefik LoadBalancer IP
+kubectl -n traefik get svc traefik

 # Check cluster health (from inside cluster)
 kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster.local:7201/health
@@ -175,10 +265,13 @@ kubectl -n m3db exec m3dbnode-0 -- curl -s http://m3coordinator.m3db.svc.cluster
 # Check m3dbnode bootstrapped status
 kubectl -n m3db exec m3dbnode-0 -- curl -s http://localhost:9002/health

-# Query via PromQL (external)
-curl "http://<LB-IP>:7201/api/v1/query?query=up"
+# Query via PromQL (external with auth)
+curl -u example:example "https://m3db.vultrlabs.dev/api/v1/query?query=up"

 # Delete the init job to re-run (if needed)
 kubectl -n m3db delete job m3db-cluster-init
 kubectl apply -f 06-init-and-pdb.yaml
+
+# View Traefik logs
+kubectl -n traefik logs -l app.kubernetes.io/name=traefik
 ```