Troubleshooting Runbooks

📦v1.0.0📅2026-04-28🔄Updated 2026-04-28👤Admin Team

Troubleshooting Runbooks

Step-by-step procedures for the most common production incidents.

Runbook 1: Core Service Down

Symptoms:

"Core is unavailable — creating and starting campaigns is temporarily disabled" banner in the UI
POST /api/campaigns returns 503
Diagnostics page shows Core API as Unavailable

Impact: Campaign creation and moderation actions (approve/start/reject/delete) are blocked. Existing campaigns continue running — Core is still dispatching, only the BFF-to-Core link is broken.

Steps:

Check Core service health directly:
```
curl http://<core-host>:8092/health
```
If this fails, Core itself is down — escalate to the Core operator.

Check network connectivity from the Message Center pod:

kubectl exec -it <pod> -- curl -k https://<core-host>:8080/api/v1/jobs?limit=1 \
  -H "X-API-Key: $CORE_ADMIN_API_KEY"

Check logs for TLS errors (see mTLS Certificates if certificates are the cause):
```
kubectl logs -l app=core-admin --tail=200 | grep -E "core|tls|certificate"
```
If Core recovers: the banner clears automatically within 60 seconds (health probe polling interval). No restart needed.
Verify active campaigns: open the Campaigns list and confirm in-progress campaigns show current status (status updates resume as the BFF can contact Core again).

Runbook 2: Audit Fallback Overflow

Symptoms:

[audit-fallback-overflow] entries in application logs
GET /api/diagnostics/audit shows fallbackFileSize at or near 209715200 (200 MB)
Audit events are being silently dropped

Impact: New audit log entries are lost until MongoDB is restored and the fallback file is drained. Compliance events for the outage window cannot be recovered.

Steps:

Identify why MongoDB is unreachable:

kubectl logs -l app=core-admin | grep -i mongo
mongosh "$MONGODB_URI" --eval 'db.adminCommand({ping:1})'

Restore MongoDB connectivity (fix network, scale up, restore from backup as needed).
Once MongoDB is reachable, the drain runs automatically on the next logAuditEvent call. Trigger it by performing any auditable action (e.g., load a page, invite a member).

Verify drain completed:

curl -H "Cookie: ..." https://<host>/api/diagnostics/audit
# Expected: { "fallbackFileSize": 0, "fallbackLines": 0 }

Document the outage window. Events during overflow are permanently lost — note the time range for compliance purposes.

Runbook 3: Stale Shadow Status

Symptoms:

Campaigns on the list page show statuses that don't match Core (e.g., "New" for a campaign that is actually "Done" in Core)
synced_at column in Mongo campaigns collection shows a timestamp more than a few minutes old for active campaigns

Background: The Mongo shadow's status field is updated lazily by the BFF when listing campaigns. A short lag is normal. This runbook covers the case where shadows are stuck.

Steps:

Run the backfill script (dry run first):

make backfill-campaign-status    # preview — prints what would change
make backfill-campaign-status    # run without --dry-run to apply

This script calls Core for each shadow campaign and updates status and synced_at.

If Core is down, the backfill will skip unreachable jobs. Retry after Core recovers.
Verify: reload the Campaigns list — statuses should now match Core's view.

Runbook 4: Upload Timeouts

Symptoms:

Campaign creation fails during file upload phase
Errors like upload timed out or ETIMEDOUT in logs
Large files (hundreds of MB) fail consistently; small files succeed

Background: Recipient file uploads stream from the browser → BFF temp disk → Core via a 30-minute bodyTimeout. The default should handle 1 GB over 10 Mbit WAN.

Steps:

Check the upload timeout setting:

echo $CORE_UPLOAD_BODY_TIMEOUT_MS   # default: 1800000 (30 min)

If the network between BFF and Core is slow, raise the timeout:

# In ConfigMap:
CORE_UPLOAD_BODY_TIMEOUT_MS: "3600000"  # 60 min

Check temp disk space on the BFF pod:
```
kubectl exec -it <pod> -- df -h /tmp
```
If /tmp is full, delete stale temp files (they are named core-admin-uploads/*.bin):
```
kubectl exec -it <pod> -- find /tmp -name "*.bin" -mmin +60 -delete
```
Verify Core's /api/v1/jobs/{id}/upload endpoint is reachable with the admin API key.
For very large files (>1 GB): raise UPLOAD_MAX_BYTES and ensure /tmp has sufficient space.

Runbook 5: OOM / High Memory

Symptoms:

Pod restarts with OOMKilled exit code
Memory usage consistently above 85% of limit
[core-large] log entries appearing frequently

Background: Normal memory consumers are MongoDB driver connection pooling and Next.js SSR. The streaming upload path is O(chunk size) and should not cause OOM. File uploads going through coreUploadFile (non-streaming path) would cause OOM — this should not occur in the current implementation but would appear in logs as very large allocations during campaigns with huge recipient lists.

Steps:

Check which component is consuming memory:
```
kubectl top pod -l app=core-admin
```
If [core-large] appears frequently, Core is returning unusually large responses. Check CORE_MAX_RESPONSE_BYTES and whether the payload sizes are expected.
If upload requests are suspected (large campaigns), verify the streaming upload path is in use — logs should NOT show [core-slow] immediately followed by large memory spikes.
Short-term: raise the memory limit in k8s/deployment.yaml:
```
limits:
  memory: 1Gi   # increase from 512Mi
```
Long-term: reduce CORE_AGENT_CONNECTIONS to limit concurrent core calls, or investigate which route triggers the large payload.

Runbook 6: "Too Many Requests" (Rate Limiting)

Symptoms:

UI shows Too many requests. Try again in Xs
429 responses from the BFF

Steps:

This is expected behavior under high load or during automated testing. Wait for the cooldown period shown in the error message.
For production load: check which endpoint is being rate-limited by filtering logs:
```
kubectl logs -l app=core-admin | grep "429"
```
If the rate limit is triggered by legitimate usage patterns, contact the development team to review the rate limit thresholds.

Next Steps

Monitoring & Alerts — set up alerts so incidents are caught early
Backups & Recovery — restore procedures for data loss scenarios