Capacity Planning

📦v1.0.0📅2026-04-28🔄Updated 2026-04-28👤Admin Team
administrationoperationscapacity-planning

Capacity Planning

Message Center's resource consumption is modest for typical deployments. This page covers the variables that have the largest impact on throughput and memory under load.


CPU and Memory Baselines

ScenarioCPUMemory
Idle (no requests)~5m~180 MB
Steady-state web traffic~50–150m~250–320 MB
Large file upload in progress~100–200m~250 MB (streaming — upload does not spike heap)
Peak: 10 concurrent campaign list pages~200–350m~300 MB

Memory consumption is dominated by the MongoDB driver connection pool and Next.js SSR cache. The streaming upload path is O(chunk size) and does not accumulate file data in heap.

Recommended minimums (Kubernetes):

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Raise the memory limit to 1 Gi if you see OOM kills or if CORE_AGENT_CONNECTIONS is set above 64.


Core HTTP Client Tuning

CORE_AGENT_CONNECTIONS (default: 32)

Controls how many keep-alive connections the BFF maintains to Core. Each concurrent HTTP request to Core (including uploads) consumes one connection.

ScenarioRecommended value
Low-traffic deployment (< 10 concurrent users)8
Standard deployment32 (default)
High-throughput (frequent large campaign lists, heavy audit queries)64
Core is rate-limiting or showing high loadReduce to 8–16

Every additional connection holds ~2 KB of socket state. 32 connections ≈ 64 KB overhead — negligible.

CORE_BODY_TIMEOUT_MS (default: 60000)

The maximum idle time between response body chunks for non-upload calls. A value of 60 seconds is sufficient for all standard endpoints. Raise if:

  • You see [core-slow] warnings followed by timeouts for legitimate large responses
  • Core is deployed on a high-latency WAN link

CORE_MAX_RESPONSE_BYTES (default: 16 MB)

Hard cap on response body size. The default 16 MB covers all standard list endpoints. Admin fanout routes (/api/admin/aerospike) can request up to 64 MB per-call. Raise the default only if you see PAYLOAD_TOO_LARGE errors on standard endpoints.


Upload Sizing

Server-side cap (UPLOAD_MAX_BYTES, default: 1 GB)

The policy limit for recipient file uploads via POST /api/uploads. The streaming implementation keeps heap usage at O(chunk size) regardless of file size, so the cap is purely a policy choice.

Temp disk space

During upload, a copy of the file is written to $TMPDIR/core-admin-uploads/. Ensure the pod has at least 2× the maximum file size of free space on /tmp.

Max file sizeMin /tmp space
50 MB (UI wizard default)500 MB
500 MB1 GB
1 GB2 GB

In Kubernetes, /tmp is backed by an emptyDir volume by default (memory-backed or disk-backed depending on cluster config). Prefer disk-backed emptyDir for upload workloads:

volumes:
  - name: tmp
    emptyDir:
      medium: ""   # disk-backed (not Memory)
      sizeLimit: 10Gi

Audit Retention

AUDIT_RETENTION_DAYS (default: 90)

The audit_logs collection grows by one document per auditable action. For a deployment with 50 active users performing 20 actions per day, this is ~90,000 documents per 90-day retention window.

Active usersActions/day90-day volumeEst. collection size
101009,000 docs~2 MB
5050045,000 docs~15 MB
2002,000180,000 docs~60 MB

After changing AUDIT_RETENTION_DAYS, run make migrate to update the TTL index — the TTL change does not take effect without the migration.


MongoDB Index Sizing

The most frequently read indexes (campaigns and audit_logs) are sized for workspaces of up to 200,000 campaigns and 500,000 audit entries. Beyond these scales, consider:

  • Separate MongoDB cluster per workspace (for very large tenants)
  • Increasing cursor.maxTimeMS indirectly via CORE_BODY_TIMEOUT_MS (though this is a Core-side concern)
  • Index-only queries: all list queries in the DAOs are covered by the compound indexes created in migrations v1–v8

Horizontal Scaling

Message Center is stateless (session state is in the next-auth JWT cookie; the BFF service JWT is cached per-process). Running multiple replicas is safe with these caveats:

  • Each replica maintains its own in-memory service JWT cache. On startup, all replicas will make one proxy login call simultaneously — this is harmless but expected.
  • The audit fallback file ($TMPDIR/...) is per-pod. If a pod dies with a non-empty fallback file before draining, those events are lost. Use PodDisruptionBudget with minAvailable: 1 and graceful shutdown to minimize this risk.
# k8s/pdb.yaml (example)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: core-admin
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: core-admin

Next Steps