Files
m3db-vke-setup/backfill/BACKFILL_RUNBOOK.md

4.1 KiB

M3DB Backfill Runbook (Revised)

Context

Backfilling ~3 weeks of vLLM + DCGM metrics from Mimir to M3DB.

Blocker discovered: bufferPast is immutable on existing namespaces. Downsample pipeline rejects historical writes.

Solution: Create new backfill namespaces with bufferPast=504h (21 days).


Step 1 — Create Backfill Namespaces

COORD="http://m3coordinator.m3db.svc.cluster.local:7201"

# default_backfill: 7d retention, 21d bufferPast
curl -sSf -X POST "${COORD}/api/v1/services/m3db/namespace" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "default_backfill",
    "options": {
      "retentionOptions": {
        "retentionPeriodDuration": "168h",
        "blockSizeDuration": "2h",
        "bufferFutureDuration": "10m",
        "bufferPastDuration": "504h"
      }
    }
  }'

# agg_10s_backfill: 90d retention, 10s resolution, 21d bufferPast
curl -sSf -X POST "${COORD}/api/v1/services/m3db/namespace" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "agg_10s_backfill",
    "options": {
      "retentionOptions": {
        "retentionPeriodDuration": "2160h",
        "blockSizeDuration": "24h",
        "bufferFutureDuration": "10m",
        "bufferPastDuration": "504h"
      }
    },
    "aggregationOptions": {
      "aggregations": [{
        "aggregated": true,
        "attributes": {
          "resolutionNanos": "10000000000",
          "downsampleOptions": {"all": true}
        }
      }]
    }
  }'

# agg_1m_backfill: 1y retention, 1m resolution, 21d bufferPast
curl -sSf -X POST "${COORD}/api/v1/services/m3db/namespace" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "agg_1m_backfill",
    "options": {
      "retentionOptions": {
        "retentionPeriodDuration": "8760h",
        "blockSizeDuration": "24h",
        "bufferFutureDuration": "10m",
        "bufferPastDuration": "504h"
      }
    },
    "aggregationOptions": {
      "aggregations": [{
        "aggregated": true,
        "attributes": {
          "resolutionNanos": "60000000000",
          "downsampleOptions": {"all": true}
        }
      }]
    }
  }'

Step 2 — Update Coordinator ConfigMap

Add new namespaces to m3coordinator-config:

clusters:
  - namespaces:
      - namespace: default
        type: unaggregated
        retention: 168h
      - namespace: default_backfill
        type: unaggregated
        retention: 168h
      - namespace: agg_10s_30d
        type: aggregated
        retention: 2160h
        resolution: 10s
      - namespace: agg_10s_backfill
        type: aggregated
        retention: 2160h
        resolution: 10s
      - namespace: agg_1m_1y
        type: aggregated
        retention: 8760h
        resolution: 1m
      - namespace: agg_1m_backfill
        type: aggregated
        retention: 8760h
        resolution: 1m

Also add downsample rules for backfill namespaces.


Step 3 — Restart Coordinators

kubectl rollout restart deployment/m3coordinator -n m3db
kubectl rollout status deployment/m3coordinator -n m3db --timeout=120s

Step 4 — Run Backfill

Write directly to default_backfill namespace using __namespace__ label:

# In the protobuf write request, add label:
# __namespace__ = "default_backfill"

Or use the coordinator endpoint:

POST http://m3coordinator:7201/api/v1/prom/remote/write?namespace=default_backfill

Backfill time range: 2026-03-11T00:00:00Z to 2026-04-01T00:00:00Z


Step 5 — Verify

curl -sS "http://m3coordinator:7201/api/v1/query" \
  --data-urlencode 'query=vllm:prompt_tokens_total' \
  --data-urlencode 'time=2026-03-20T12:00:00Z'

Step 6 — Revert bufferPast (After Backfill)

# After backfill complete, shrink bufferPast back to 10m
# (Only retentionPeriod is mutable, so this requires namespace recreation)
# OR: Leave as-is since it's a backfill-only namespace

Performance Notes

  • M3DB has been fast so far
  • New namespaces won't impact existing query performance
  • Queries can fan out to both old and new namespaces in parallel
  • After backfill, consider consolidating (optional)