Monitoring & Alerts
Monitoring & Alerts
Message Center emits structured log sentinels for key operational events and exposes a diagnostics API for system health. This page covers what to watch for in production.
Log Sentinels
The application emits structured warning messages with distinctive prefixes when important thresholds are crossed. These are the primary signals to alert on.
| Sentinel | Trigger | Severity | Action |
|---|---|---|---|
[core-slow] | Core API call took more than 5 seconds | Warning | Check Core service load; consider raising CORE_BODY_TIMEOUT_MS if legitimate |
[core-large] | Core API response body exceeded 8 MB (half of CORE_MAX_RESPONSE_BYTES) | Warning | Investigate what endpoint is returning large payloads |
[audit-fallback] | Audit fallback file growing above 50 MB | Warning | MongoDB may be down or slow; check DB connectivity |
[audit-fallback-overflow] | Audit fallback file reached 200 MB hard cap — events are being dropped | Critical | Restore MongoDB immediately; events are lost |
Example log lines
[core-slow] GET /api/v1/jobs 6342ms
[core-large] GET /api/v1/jobs/ext/abc123 9437542 bytes
[audit-fallback] file size 54321234 bytes
[audit-fallback-overflow] dropping audit event: campaign.created
Diagnostics API
System health endpoint
GET /api/diagnostics — visible on the Diagnostics page in the UI, accessible to all authenticated users.
Returns Core API health, DB schema version, and audit fallback status.
Audit-specific diagnostics (super_admin only)
GET /api/diagnostics/audit
{
"fallbackFileSize": 0,
"fallbackLines": 0,
"retentionDays": 90,
"schemaVersion": 9
}
fallbackFileSize: bytes in the disk-fallback file. Should be0in a healthy system.fallbackLines: number of unsynced audit events waiting to be drained.- A non-zero
fallbackFileSizemeans MongoDB was unavailable at some point and events are queued for replay.
Grafana Dashboard
Set GRAFANA_PUBLIC_URL to embed an existing Grafana dashboard at /monitoring. Grafana must have GF_SECURITY_ALLOW_EMBEDDING=true configured.
No Grafana-side configuration is provided by Message Center — you bring your own dashboards. Useful panels to include:
- Pod CPU and memory from Kubernetes metrics
- MongoDB connection pool depth and operation latency
- Request rate and error rate on port 3000
- Core API call latency (supplement with
[core-slow]alerts)
Recommended Alert Rules
Loki (log-based alerts)
groups:
- name: message-center
rules:
- alert: CoreApiSlow
expr: |
count_over_time({app="core-admin"} |= "[core-slow]" [5m]) > 3
for: 0m
labels:
severity: warning
annotations:
summary: "Core API slow calls (>3 in 5m)"
- alert: AuditFallbackOverflow
expr: |
count_over_time({app="core-admin"} |= "[audit-fallback-overflow]" [1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Audit events are being dropped — restore MongoDB immediately"
- alert: AuditFallbackGrowing
expr: |
count_over_time({app="core-admin"} |= "[audit-fallback]" [15m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Audit fallback file growing — check MongoDB connectivity"
Prometheus / kube-state-metrics
- alert: MessageCenterPodDown
expr: |
kube_deployment_status_replicas_available{deployment="core-admin"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "All Message Center pods are down"
- alert: MessageCenterHighMemory
expr: |
container_memory_working_set_bytes{container="core-admin"}
/ container_spec_memory_limit_bytes{container="core-admin"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Message Center memory above 85% of limit"
Health Check Endpoint
GET /api/health — returns 200 OK with {"status":"ok"} when the Next.js server is running. Used by Kubernetes liveness and readiness probes.
This endpoint does not test MongoDB or Core connectivity — it only verifies the application process is alive. Use the Diagnostics page for deeper health checks.
Audit Fallback Monitoring
The audit fallback file at $TMPDIR/core-admin-audit-fallback.jsonl grows when MongoDB is unreachable. On the next call to logAuditEvent after MongoDB recovers, the file is drained back into the database automatically (streaming — no memory spike).
To monitor fallback state without the UI:
# File size (should be 0 in a healthy system)
du -sh $TMPDIR/core-admin-audit-fallback.jsonl 2>/dev/null || echo "file absent (normal)"
# Line count
wc -l $TMPDIR/core-admin-audit-fallback.jsonl 2>/dev/null || echo "0"
For automated drain verification, call the diagnostics API after deployment and assert fallbackFileSize === 0.
Next Steps
- Backups & Recovery — MongoDB backup strategy
- Troubleshooting Runbooks — step-by-step incident response