Monitoring & Alerts

📦v1.0.0📅2026-04-28🔄Updated 2026-04-28👤Admin Team

Monitoring & Alerts

Message Center emits structured log sentinels for key operational events and exposes a diagnostics API for system health. This page covers what to watch for in production.

Log Sentinels

The application emits structured warning messages with distinctive prefixes when important thresholds are crossed. These are the primary signals to alert on.

Sentinel	Trigger	Severity	Action
`[core-slow]`	Core API call took more than 5 seconds	Warning	Check Core service load; consider raising `CORE_BODY_TIMEOUT_MS` if legitimate
`[core-large]`	Core API response body exceeded 8 MB (half of `CORE_MAX_RESPONSE_BYTES`)	Warning	Investigate what endpoint is returning large payloads
`[audit-fallback]`	Audit fallback file growing above 50 MB	Warning	MongoDB may be down or slow; check DB connectivity
`[audit-fallback-overflow]`	Audit fallback file reached 200 MB hard cap — events are being dropped	Critical	Restore MongoDB immediately; events are lost

Example log lines

[core-slow] GET /api/v1/jobs 6342ms
[core-large] GET /api/v1/jobs/ext/abc123 9437542 bytes
[audit-fallback] file size 54321234 bytes
[audit-fallback-overflow] dropping audit event: campaign.created

Diagnostics API

System health endpoint

GET /api/diagnostics — visible on the Diagnostics page in the UI, accessible to all authenticated users.

Returns Core API health, DB schema version, and audit fallback status.

Audit-specific diagnostics (super_admin only)

GET /api/diagnostics/audit

{
  "fallbackFileSize": 0,
  "fallbackLines": 0,
  "retentionDays": 90,
  "schemaVersion": 9
}

fallbackFileSize: bytes in the disk-fallback file. Should be 0 in a healthy system.
fallbackLines: number of unsynced audit events waiting to be drained.
A non-zero fallbackFileSize means MongoDB was unavailable at some point and events are queued for replay.

Grafana Dashboard

Set GRAFANA_PUBLIC_URL to embed an existing Grafana dashboard at /monitoring. Grafana must have GF_SECURITY_ALLOW_EMBEDDING=true configured.

No Grafana-side configuration is provided by Message Center — you bring your own dashboards. Useful panels to include:

Pod CPU and memory from Kubernetes metrics
MongoDB connection pool depth and operation latency
Request rate and error rate on port 3000
Core API call latency (supplement with [core-slow] alerts)

Recommended Alert Rules

Loki (log-based alerts)

groups:
  - name: message-center
    rules:
      - alert: CoreApiSlow
        expr: |
          count_over_time({app="core-admin"} |= "[core-slow]" [5m]) > 3
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Core API slow calls (>3 in 5m)"

      - alert: AuditFallbackOverflow
        expr: |
          count_over_time({app="core-admin"} |= "[audit-fallback-overflow]" [1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Audit events are being dropped — restore MongoDB immediately"

      - alert: AuditFallbackGrowing
        expr: |
          count_over_time({app="core-admin"} |= "[audit-fallback]" [15m]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Audit fallback file growing — check MongoDB connectivity"

Prometheus / kube-state-metrics

      - alert: MessageCenterPodDown
        expr: |
          kube_deployment_status_replicas_available{deployment="core-admin"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "All Message Center pods are down"

      - alert: MessageCenterHighMemory
        expr: |
          container_memory_working_set_bytes{container="core-admin"}
            / container_spec_memory_limit_bytes{container="core-admin"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Message Center memory above 85% of limit"

Health Check Endpoint

GET /api/health — returns 200 OK with {"status":"ok"} when the Next.js server is running. Used by Kubernetes liveness and readiness probes.

This endpoint does not test MongoDB or Core connectivity — it only verifies the application process is alive. Use the Diagnostics page for deeper health checks.

Audit Fallback Monitoring

The audit fallback file at $TMPDIR/core-admin-audit-fallback.jsonl grows when MongoDB is unreachable. On the next call to logAuditEvent after MongoDB recovers, the file is drained back into the database automatically (streaming — no memory spike).

To monitor fallback state without the UI:

# File size (should be 0 in a healthy system)
du -sh $TMPDIR/core-admin-audit-fallback.jsonl 2>/dev/null || echo "file absent (normal)"

# Line count
wc -l $TMPDIR/core-admin-audit-fallback.jsonl 2>/dev/null || echo "0"

For automated drain verification, call the diagnostics API after deployment and assert fallbackFileSize === 0.

Next Steps

Backups & Recovery — MongoDB backup strategy
Troubleshooting Runbooks — step-by-step incident response