Troubleshooting Runbooks
Troubleshooting Runbooks
Step-by-step procedures for the most common production incidents.
Runbook 1: Core Service Down
Symptoms:
- "Core is unavailable — creating and starting campaigns is temporarily disabled" banner in the UI
POST /api/campaignsreturns503- Diagnostics page shows Core API as Unavailable
Impact: Campaign creation and moderation actions (approve/start/reject/delete) are blocked. Existing campaigns continue running — Core is still dispatching, only the BFF-to-Core link is broken.
Steps:
-
Check Core service health directly:
curl http://<core-host>:8092/healthIf this fails, Core itself is down — escalate to the Core operator.
-
Check network connectivity from the Message Center pod:
kubectl exec -it <pod> -- curl -k https://<core-host>:8080/api/v1/jobs?limit=1 \ -H "X-API-Key: $CORE_ADMIN_API_KEY" -
Check logs for TLS errors (see mTLS Certificates if certificates are the cause):
kubectl logs -l app=core-admin --tail=200 | grep -E "core|tls|certificate" -
If Core recovers: the banner clears automatically within 60 seconds (health probe polling interval). No restart needed.
-
Verify active campaigns: open the Campaigns list and confirm in-progress campaigns show current status (status updates resume as the BFF can contact Core again).
Runbook 2: Audit Fallback Overflow
Symptoms:
[audit-fallback-overflow]entries in application logsGET /api/diagnostics/auditshowsfallbackFileSizeat or near 209715200 (200 MB)- Audit events are being silently dropped
Impact: New audit log entries are lost until MongoDB is restored and the fallback file is drained. Compliance events for the outage window cannot be recovered.
Steps:
-
Identify why MongoDB is unreachable:
kubectl logs -l app=core-admin | grep -i mongo mongosh "$MONGODB_URI" --eval 'db.adminCommand({ping:1})' -
Restore MongoDB connectivity (fix network, scale up, restore from backup as needed).
-
Once MongoDB is reachable, the drain runs automatically on the next
logAuditEventcall. Trigger it by performing any auditable action (e.g., load a page, invite a member). -
Verify drain completed:
curl -H "Cookie: ..." https://<host>/api/diagnostics/audit # Expected: { "fallbackFileSize": 0, "fallbackLines": 0 } -
Document the outage window. Events during overflow are permanently lost — note the time range for compliance purposes.
Runbook 3: Stale Shadow Status
Symptoms:
- Campaigns on the list page show statuses that don't match Core (e.g., "New" for a campaign that is actually "Done" in Core)
synced_atcolumn in Mongocampaignscollection shows a timestamp more than a few minutes old for active campaigns
Background: The Mongo shadow's status field is updated lazily by the BFF when listing campaigns. A short lag is normal. This runbook covers the case where shadows are stuck.
Steps:
-
Run the backfill script (dry run first):
make backfill-campaign-status # preview — prints what would change make backfill-campaign-status # run without --dry-run to applyThis script calls Core for each shadow campaign and updates
statusandsynced_at. -
If Core is down, the backfill will skip unreachable jobs. Retry after Core recovers.
-
Verify: reload the Campaigns list — statuses should now match Core's view.
Runbook 4: Upload Timeouts
Symptoms:
- Campaign creation fails during file upload phase
- Errors like
upload timed outorETIMEDOUTin logs - Large files (hundreds of MB) fail consistently; small files succeed
Background: Recipient file uploads stream from the browser → BFF temp disk → Core via a 30-minute bodyTimeout. The default should handle 1 GB over 10 Mbit WAN.
Steps:
-
Check the upload timeout setting:
echo $CORE_UPLOAD_BODY_TIMEOUT_MS # default: 1800000 (30 min) -
If the network between BFF and Core is slow, raise the timeout:
# In ConfigMap: CORE_UPLOAD_BODY_TIMEOUT_MS: "3600000" # 60 min -
Check temp disk space on the BFF pod:
kubectl exec -it <pod> -- df -h /tmpIf
/tmpis full, delete stale temp files (they are namedcore-admin-uploads/*.bin):kubectl exec -it <pod> -- find /tmp -name "*.bin" -mmin +60 -delete -
Verify Core's
/api/v1/jobs/{id}/uploadendpoint is reachable with the admin API key. -
For very large files (>1 GB): raise
UPLOAD_MAX_BYTESand ensure/tmphas sufficient space.
Runbook 5: OOM / High Memory
Symptoms:
- Pod restarts with
OOMKilledexit code - Memory usage consistently above 85% of limit
[core-large]log entries appearing frequently
Background: Normal memory consumers are MongoDB driver connection pooling and Next.js SSR. The streaming upload path is O(chunk size) and should not cause OOM. File uploads going through coreUploadFile (non-streaming path) would cause OOM — this should not occur in the current implementation but would appear in logs as very large allocations during campaigns with huge recipient lists.
Steps:
-
Check which component is consuming memory:
kubectl top pod -l app=core-admin -
If
[core-large]appears frequently, Core is returning unusually large responses. CheckCORE_MAX_RESPONSE_BYTESand whether the payload sizes are expected. -
If upload requests are suspected (large campaigns), verify the streaming upload path is in use — logs should NOT show
[core-slow]immediately followed by large memory spikes. -
Short-term: raise the memory limit in
k8s/deployment.yaml:limits: memory: 1Gi # increase from 512Mi -
Long-term: reduce
CORE_AGENT_CONNECTIONSto limit concurrent core calls, or investigate which route triggers the large payload.
Runbook 6: "Too Many Requests" (Rate Limiting)
Symptoms:
- UI shows
Too many requests. Try again in Xs 429responses from the BFF
Steps:
-
This is expected behavior under high load or during automated testing. Wait for the cooldown period shown in the error message.
-
For production load: check which endpoint is being rate-limited by filtering logs:
kubectl logs -l app=core-admin | grep "429" -
If the rate limit is triggered by legitimate usage patterns, contact the development team to review the rate limit thresholds.
Next Steps
- Monitoring & Alerts — set up alerts so incidents are caught early
- Backups & Recovery — restore procedures for data loss scenarios