Files
irsa-webhook/TROUBLESHOOTING.md
2025-12-10 10:43:57 -05:00

10 KiB

Troubleshooting Guide

Common Issues and Solutions

1. Webhook Not Responding

Symptoms:

  • Pods fail to create with timeout errors
  • Events show webhook timeout
  • kubectl get pods hangs

Diagnosis:

# Check webhook pod status
kubectl get pods -n irsa-system

# View webhook logs
kubectl logs -n irsa-system -l app=irsa-webhook

# Check webhook service
kubectl get svc -n irsa-system irsa-webhook
kubectl get endpoints -n irsa-system irsa-webhook

Solutions:

  1. Webhook pods not running:

    kubectl describe pods -n irsa-system -l app=irsa-webhook
    # Fix image pull issues, resource constraints, etc.
    
  2. TLS certificate issues:

    # Regenerate certificates
    ./generate-certs.sh
    kubectl rollout restart deployment -n irsa-system irsa-webhook
    
  3. Service not routing correctly:

    # Check if service selectors match pod labels
    kubectl get svc -n irsa-system irsa-webhook -o yaml
    kubectl get pods -n irsa-system -l app=irsa-webhook --show-labels
    

2. Pods Not Being Mutated

Symptoms:

  • Pods create successfully but don't have injected configuration
  • Environment variables missing
  • Volume not mounted

Diagnosis:

# Check if ServiceAccount has annotation
kubectl get sa <service-account-name> -o yaml | grep vultr.com/role-arn

# Check webhook configuration
kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml

# View webhook logs for the specific pod creation
kubectl logs -n irsa-system -l app=irsa-webhook --tail=100

Solutions:

  1. ServiceAccount annotation missing:

    kubectl annotate sa <service-account-name> \
      vultr.com/role-arn="arn:aws:iam::123456789012:role/your-role"
    
  2. Namespace excluded from webhook: Check the namespaceSelector in the MutatingWebhookConfiguration:

    kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml
    
  3. Webhook not receiving requests:

    # Check webhook logs for incoming requests
    kubectl logs -n irsa-system -l app=irsa-webhook --tail=50
    
    # Verify webhook configuration matches service
    kubectl get mutatingwebhookconfiguration irsa-webhook -o jsonpath='{.webhooks[0].clientConfig}'
    

3. RBAC Permission Errors

Symptoms:

  • Webhook logs show "forbidden" or "unauthorized" errors
  • Error fetching ServiceAccounts

Diagnosis:

# Check webhook ServiceAccount permissions
kubectl auth can-i get serviceaccounts \
  --as=system:serviceaccount:irsa-system:irsa-webhook \
  --all-namespaces

# View RBAC resources
kubectl get clusterrole irsa-webhook -o yaml
kubectl get clusterrolebinding irsa-webhook -o yaml

Solutions:

  1. Missing RBAC permissions:

    # Reapply RBAC configuration
    kubectl apply -f deploy.yaml
    
  2. ServiceAccount not bound to role:

    kubectl get clusterrolebinding irsa-webhook -o yaml
    # Verify subjects include the correct ServiceAccount
    

4. TLS/Certificate Issues

Symptoms:

  • "x509: certificate signed by unknown authority"
  • "TLS handshake error"
  • Webhook returns 401 or 403

Diagnosis:

# Check certificate in secret
kubectl get secret -n irsa-system irsa-webhook-certs -o yaml

# Verify CA bundle in webhook config
kubectl get mutatingwebhookconfiguration irsa-webhook \
  -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d

Solutions:

  1. Regenerate certificates:

    ./generate-certs.sh
    
  2. Manually update CA bundle:

    CA_BUNDLE=$(kubectl get secret -n irsa-system irsa-webhook-certs \
      -o jsonpath='{.data.ca\.crt}')
    
    kubectl patch mutatingwebhookconfiguration irsa-webhook \
      --type='json' \
      -p="[{'op': 'replace', 'path': '/webhooks/0/clientConfig/caBundle', 'value':'${CA_BUNDLE}'}]"
    
  3. Verify certificate SANs:

    kubectl get secret -n irsa-system irsa-webhook-certs \
      -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
    

5. AWS Credential Issues

Symptoms:

  • Pods can't authenticate with AWS
  • "Unable to locate credentials" error
  • "InvalidIdentityToken" error from AWS STS

Diagnosis:

# Check injected environment variables
kubectl exec <pod-name> -- env | grep AWS

# Verify token file exists
kubectl exec <pod-name> -- ls -la /var/run/secrets/vultr.com/serviceaccount/

# Check token contents (first 50 chars)
kubectl exec <pod-name> -- head -c 50 /var/run/secrets/vultr.com/serviceaccount/token

# Test AWS STS
kubectl exec <pod-name> -- aws sts get-caller-identity

Solutions:

  1. Token not mounted:

    • Verify pod has the volume and volume mount
    • Check webhook logs for mutation
    • Delete and recreate the pod
  2. IAM role trust policy issue: Ensure your IAM role has the correct trust policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/YOUR_OIDC_PROVIDER"
          },
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Condition": {
            "StringEquals": {
              "YOUR_OIDC_PROVIDER:aud": "vultr"
            }
          }
        }
      ]
    }
    
  3. Wrong audience in token:

    • Verify the projected token has audience "vultr"
    • Check webhook configuration uses correct tokenAudience constant

6. Performance Issues

Symptoms:

  • Pod creation is slow
  • Webhook timeout warnings
  • High resource usage

Diagnosis:

# Check webhook resource usage
kubectl top pods -n irsa-system

# View webhook latency in logs
kubectl logs -n irsa-system -l app=irsa-webhook | grep "Processing pod"

# Check for throttling
kubectl describe pods -n irsa-system -l app=irsa-webhook

Solutions:

  1. Increase webhook timeout:

    kubectl patch mutatingwebhookconfiguration irsa-webhook \
      --type='json' \
      -p='[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]'
    
  2. Scale webhook deployment:

    kubectl scale deployment -n irsa-system irsa-webhook --replicas=3
    
  3. Increase resource limits: Edit deploy.yaml and increase CPU/memory limits:

    resources:
      requests:
        cpu: 200m
        memory: 256Mi
      limits:
        cpu: 1000m
        memory: 512Mi
    

7. JSON Patch Generation Errors

Symptoms:

  • "Failed to generate patches" in webhook logs
  • Malformed patch errors
  • Array index out of bounds

Diagnosis:

# Enable verbose logging (add to deployment)
# Set LOG_LEVEL=debug in environment

# Check specific pod that failed
kubectl logs -n irsa-system -l app=irsa-webhook --tail=100 | grep -A 10 "Failed"

Solutions:

  1. Review pod specification:

    • Ensure pod spec is valid JSON
    • Check for unusual container configurations
  2. Update webhook logic:

    • Fix any bugs in generatePatches function
    • Add error handling for edge cases

8. Multiple Webhooks Conflict

Symptoms:

  • Pod mutations from other webhooks interfering
  • Unexpected pod configuration
  • Volume/env var conflicts

Diagnosis:

# List all mutating webhooks
kubectl get mutatingwebhookconfigurations

# Check webhook order
kubectl get mutatingwebhookconfigurations -o yaml | grep -A 5 "name:"

Solutions:

  1. Adjust webhook order: Webhooks are processed alphabetically by name. Rename if needed:

    # Add a prefix to control order
    kubectl patch mutatingwebhookconfiguration irsa-webhook \
      --type='json' \
      -p='[{"op": "replace", "path": "/metadata/name", "value": "01-irsa-webhook"}]'
    
  2. Add reinvocationPolicy:

    webhooks:
    - name: irsa.vultr.com
      reinvocationPolicy: IfNeeded  # or Never
    

Debug Commands Cheat Sheet

# View all webhook-related resources
kubectl get all -n irsa-system
kubectl get mutatingwebhookconfiguration irsa-webhook
kubectl get clusterrole irsa-webhook
kubectl get clusterrolebinding irsa-webhook

# Test webhook directly
kubectl run test-pod --image=nginx --dry-run=client -o yaml | \
  kubectl create -f - --namespace=default

# Watch webhook logs in real-time
kubectl logs -n irsa-system -l app=irsa-webhook -f

# Check webhook pod health
kubectl get pods -n irsa-system -l app=irsa-webhook -o wide
kubectl describe pods -n irsa-system -l app=irsa-webhook

# View recent events
kubectl get events -n irsa-system --sort-by='.lastTimestamp'

# Test ServiceAccount annotation
kubectl get sa -A -o jsonpath='{range .items[?(@.metadata.annotations.vultr\.com/role-arn)]}{.metadata.namespace}{" "}{.metadata.name}{" "}{.metadata.annotations.vultr\.com/role-arn}{"\n"}{end}'

# Validate webhook configuration
kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml | grep -E "(caBundle|service|path|port)"

Getting Help

If you're still experiencing issues:

  1. Collect diagnostic information:

    # Run this script and save output
    kubectl get all -n irsa-system > diagnostics.txt
    kubectl logs -n irsa-system -l app=irsa-webhook --tail=200 >> diagnostics.txt
    kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml >> diagnostics.txt
    kubectl get events -n irsa-system >> diagnostics.txt
    
  2. Check webhook version:

    kubectl get deployment -n irsa-system irsa-webhook -o jsonpath='{.spec.template.spec.containers[0].image}'
    
  3. Review logs with timestamps:

    kubectl logs -n irsa-system -l app=irsa-webhook --timestamps=true --tail=100
    
  4. Test in isolation:

    • Create a separate test namespace
    • Deploy a simple test pod
    • Monitor webhook behavior

Prevention

Best Practices to Avoid Issues:

  1. Always test in a non-production cluster first
  2. Set failurePolicy: Ignore during initial deployment
  3. Monitor webhook performance and logs
  4. Keep certificates up to date (rotate every 365 days)
  5. Use resource limits to prevent webhook from consuming too much
  6. Implement readiness and liveness probes
  7. Scale webhook deployment for high-traffic clusters
  8. Document all ServiceAccount annotations