first commit

This commit is contained in:
2025-12-10 10:43:57 -05:00
commit 15d959a6c8
7 changed files with 1505 additions and 0 deletions

390
TROUBLESHOOTING.md Normal file
View File

@@ -0,0 +1,390 @@
# Troubleshooting Guide
## Common Issues and Solutions
### 1. Webhook Not Responding
**Symptoms:**
- Pods fail to create with timeout errors
- Events show webhook timeout
- `kubectl get pods` hangs
**Diagnosis:**
```bash
# Check webhook pod status
kubectl get pods -n irsa-system
# View webhook logs
kubectl logs -n irsa-system -l app=irsa-webhook
# Check webhook service
kubectl get svc -n irsa-system irsa-webhook
kubectl get endpoints -n irsa-system irsa-webhook
```
**Solutions:**
1. **Webhook pods not running:**
```bash
kubectl describe pods -n irsa-system -l app=irsa-webhook
# Fix image pull issues, resource constraints, etc.
```
2. **TLS certificate issues:**
```bash
# Regenerate certificates
./generate-certs.sh
kubectl rollout restart deployment -n irsa-system irsa-webhook
```
3. **Service not routing correctly:**
```bash
# Check if service selectors match pod labels
kubectl get svc -n irsa-system irsa-webhook -o yaml
kubectl get pods -n irsa-system -l app=irsa-webhook --show-labels
```
### 2. Pods Not Being Mutated
**Symptoms:**
- Pods create successfully but don't have injected configuration
- Environment variables missing
- Volume not mounted
**Diagnosis:**
```bash
# Check if ServiceAccount has annotation
kubectl get sa <service-account-name> -o yaml | grep vultr.com/role-arn
# Check webhook configuration
kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml
# View webhook logs for the specific pod creation
kubectl logs -n irsa-system -l app=irsa-webhook --tail=100
```
**Solutions:**
1. **ServiceAccount annotation missing:**
```bash
kubectl annotate sa <service-account-name> \
vultr.com/role-arn="arn:aws:iam::123456789012:role/your-role"
```
2. **Namespace excluded from webhook:**
Check the `namespaceSelector` in the MutatingWebhookConfiguration:
```bash
kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml
```
3. **Webhook not receiving requests:**
```bash
# Check webhook logs for incoming requests
kubectl logs -n irsa-system -l app=irsa-webhook --tail=50
# Verify webhook configuration matches service
kubectl get mutatingwebhookconfiguration irsa-webhook -o jsonpath='{.webhooks[0].clientConfig}'
```
### 3. RBAC Permission Errors
**Symptoms:**
- Webhook logs show "forbidden" or "unauthorized" errors
- Error fetching ServiceAccounts
**Diagnosis:**
```bash
# Check webhook ServiceAccount permissions
kubectl auth can-i get serviceaccounts \
--as=system:serviceaccount:irsa-system:irsa-webhook \
--all-namespaces
# View RBAC resources
kubectl get clusterrole irsa-webhook -o yaml
kubectl get clusterrolebinding irsa-webhook -o yaml
```
**Solutions:**
1. **Missing RBAC permissions:**
```bash
# Reapply RBAC configuration
kubectl apply -f deploy.yaml
```
2. **ServiceAccount not bound to role:**
```bash
kubectl get clusterrolebinding irsa-webhook -o yaml
# Verify subjects include the correct ServiceAccount
```
### 4. TLS/Certificate Issues
**Symptoms:**
- "x509: certificate signed by unknown authority"
- "TLS handshake error"
- Webhook returns 401 or 403
**Diagnosis:**
```bash
# Check certificate in secret
kubectl get secret -n irsa-system irsa-webhook-certs -o yaml
# Verify CA bundle in webhook config
kubectl get mutatingwebhookconfiguration irsa-webhook \
-o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
```
**Solutions:**
1. **Regenerate certificates:**
```bash
./generate-certs.sh
```
2. **Manually update CA bundle:**
```bash
CA_BUNDLE=$(kubectl get secret -n irsa-system irsa-webhook-certs \
-o jsonpath='{.data.ca\.crt}')
kubectl patch mutatingwebhookconfiguration irsa-webhook \
--type='json' \
-p="[{'op': 'replace', 'path': '/webhooks/0/clientConfig/caBundle', 'value':'${CA_BUNDLE}'}]"
```
3. **Verify certificate SANs:**
```bash
kubectl get secret -n irsa-system irsa-webhook-certs \
-o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
```
### 5. AWS Credential Issues
**Symptoms:**
- Pods can't authenticate with AWS
- "Unable to locate credentials" error
- "InvalidIdentityToken" error from AWS STS
**Diagnosis:**
```bash
# Check injected environment variables
kubectl exec <pod-name> -- env | grep AWS
# Verify token file exists
kubectl exec <pod-name> -- ls -la /var/run/secrets/vultr.com/serviceaccount/
# Check token contents (first 50 chars)
kubectl exec <pod-name> -- head -c 50 /var/run/secrets/vultr.com/serviceaccount/token
# Test AWS STS
kubectl exec <pod-name> -- aws sts get-caller-identity
```
**Solutions:**
1. **Token not mounted:**
- Verify pod has the volume and volume mount
- Check webhook logs for mutation
- Delete and recreate the pod
2. **IAM role trust policy issue:**
Ensure your IAM role has the correct trust policy:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/YOUR_OIDC_PROVIDER"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"YOUR_OIDC_PROVIDER:aud": "vultr"
}
}
}
]
}
```
3. **Wrong audience in token:**
- Verify the projected token has audience "vultr"
- Check webhook configuration uses correct tokenAudience constant
### 6. Performance Issues
**Symptoms:**
- Pod creation is slow
- Webhook timeout warnings
- High resource usage
**Diagnosis:**
```bash
# Check webhook resource usage
kubectl top pods -n irsa-system
# View webhook latency in logs
kubectl logs -n irsa-system -l app=irsa-webhook | grep "Processing pod"
# Check for throttling
kubectl describe pods -n irsa-system -l app=irsa-webhook
```
**Solutions:**
1. **Increase webhook timeout:**
```bash
kubectl patch mutatingwebhookconfiguration irsa-webhook \
--type='json' \
-p='[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]'
```
2. **Scale webhook deployment:**
```bash
kubectl scale deployment -n irsa-system irsa-webhook --replicas=3
```
3. **Increase resource limits:**
Edit deploy.yaml and increase CPU/memory limits:
```yaml
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
```
### 7. JSON Patch Generation Errors
**Symptoms:**
- "Failed to generate patches" in webhook logs
- Malformed patch errors
- Array index out of bounds
**Diagnosis:**
```bash
# Enable verbose logging (add to deployment)
# Set LOG_LEVEL=debug in environment
# Check specific pod that failed
kubectl logs -n irsa-system -l app=irsa-webhook --tail=100 | grep -A 10 "Failed"
```
**Solutions:**
1. **Review pod specification:**
- Ensure pod spec is valid JSON
- Check for unusual container configurations
2. **Update webhook logic:**
- Fix any bugs in generatePatches function
- Add error handling for edge cases
### 8. Multiple Webhooks Conflict
**Symptoms:**
- Pod mutations from other webhooks interfering
- Unexpected pod configuration
- Volume/env var conflicts
**Diagnosis:**
```bash
# List all mutating webhooks
kubectl get mutatingwebhookconfigurations
# Check webhook order
kubectl get mutatingwebhookconfigurations -o yaml | grep -A 5 "name:"
```
**Solutions:**
1. **Adjust webhook order:**
Webhooks are processed alphabetically by name. Rename if needed:
```bash
# Add a prefix to control order
kubectl patch mutatingwebhookconfiguration irsa-webhook \
--type='json' \
-p='[{"op": "replace", "path": "/metadata/name", "value": "01-irsa-webhook"}]'
```
2. **Add reinvocationPolicy:**
```yaml
webhooks:
- name: irsa.vultr.com
reinvocationPolicy: IfNeeded # or Never
```
## Debug Commands Cheat Sheet
```bash
# View all webhook-related resources
kubectl get all -n irsa-system
kubectl get mutatingwebhookconfiguration irsa-webhook
kubectl get clusterrole irsa-webhook
kubectl get clusterrolebinding irsa-webhook
# Test webhook directly
kubectl run test-pod --image=nginx --dry-run=client -o yaml | \
kubectl create -f - --namespace=default
# Watch webhook logs in real-time
kubectl logs -n irsa-system -l app=irsa-webhook -f
# Check webhook pod health
kubectl get pods -n irsa-system -l app=irsa-webhook -o wide
kubectl describe pods -n irsa-system -l app=irsa-webhook
# View recent events
kubectl get events -n irsa-system --sort-by='.lastTimestamp'
# Test ServiceAccount annotation
kubectl get sa -A -o jsonpath='{range .items[?(@.metadata.annotations.vultr\.com/role-arn)]}{.metadata.namespace}{" "}{.metadata.name}{" "}{.metadata.annotations.vultr\.com/role-arn}{"\n"}{end}'
# Validate webhook configuration
kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml | grep -E "(caBundle|service|path|port)"
```
## Getting Help
If you're still experiencing issues:
1. **Collect diagnostic information:**
```bash
# Run this script and save output
kubectl get all -n irsa-system > diagnostics.txt
kubectl logs -n irsa-system -l app=irsa-webhook --tail=200 >> diagnostics.txt
kubectl get mutatingwebhookconfiguration irsa-webhook -o yaml >> diagnostics.txt
kubectl get events -n irsa-system >> diagnostics.txt
```
2. **Check webhook version:**
```bash
kubectl get deployment -n irsa-system irsa-webhook -o jsonpath='{.spec.template.spec.containers[0].image}'
```
3. **Review logs with timestamps:**
```bash
kubectl logs -n irsa-system -l app=irsa-webhook --timestamps=true --tail=100
```
4. **Test in isolation:**
- Create a separate test namespace
- Deploy a simple test pod
- Monitor webhook behavior
## Prevention
**Best Practices to Avoid Issues:**
1. Always test in a non-production cluster first
2. Set `failurePolicy: Ignore` during initial deployment
3. Monitor webhook performance and logs
4. Keep certificates up to date (rotate every 365 days)
5. Use resource limits to prevent webhook from consuming too much
6. Implement readiness and liveness probes
7. Scale webhook deployment for high-traffic clusters
8. Document all ServiceAccount annotations