Self-Hosted Operations

Day-2 operations for self-hosted AGLedger: upgrades, backup/restore, rollback, secret rotation, observability, troubleshooting, and high availability.

This guide assumes AGLedger is installed and running. All commands use the scripts in the agledger-ai/install repo. Paths are relative to the repo root.

1. Upgrading

The upgrade.sh script automates the full upgrade workflow: backup, image pull, migration, restart, and verification.

Pre-upgrade checklist

Before running upgrade.sh:

  1. Confirm your current versioncurl -s http://localhost:3001/conformance | jq .version. Note it in writing; the rollback path uses this number.
  2. Read the release notes — the agledger-ai/install CHANGELOG and any breaking-change call-outs between your current version and the target.
  3. Verify backup freshness — run ./scripts/backup.sh if the last backup is older than your RPO target. upgrade.sh takes its own pre-upgrade backup by default, but an explicit fresh backup that you have verified is safer.
  4. Run preflight on the current versiondocker compose exec agledger-api node dist/scripts/preflight.js. A clean preflight before the upgrade means any post-upgrade failure is attributable to the upgrade itself, not pre-existing drift.
  5. Quiesce scheduled jobs — if external cron or CI calls AGLedger, pause them for the upgrade window. pg-boss handles in-flight jobs gracefully; external callers will see 503 during the API restart.
  6. Have the rollback path ready — know which backup file you would restore from (./backup/backup-<timestamp>.tar.gz) and which image tag you would redeploy. See §3 Rollback.

Basic upgrade

./scripts/upgrade.sh 0.19.17

The script performs these steps in order:

  1. Detects the current running version (from .env, VERSION file, or Docker image tag)
  2. Creates a pre-upgrade backup (calls ./scripts/backup.sh)
  3. Authenticates with the container registry (ECR)
  4. Pulls the target version image
  5. Stops the worker to prevent job processing during migration
  6. Runs database migrations using the new image
  7. Updates AGLEDGER_VERSION in .env and VERSION file
  8. Restarts all services
  9. Waits for the API to become healthy (up to 30 seconds)
  10. Runs preflight checks
  11. Verifies the new version via /health and /conformance

Skipping the pre-upgrade backup

./scripts/upgrade.sh 0.19.17 --skip-backup

Not recommended for production. Use only if you have a separate backup mechanism (e.g., RDS automated snapshots).

Deprecated environment variables

The upgrade script automatically handles removed variables:

Verifying after upgrade

curl -s http://localhost:3001/health | jq .
curl -s http://localhost:3001/conformance | jq .version
cd compose && docker compose ps

2. Backup and Restore

2.1 Creating a backup

./scripts/backup.sh

This creates a timestamped tarball containing a PostgreSQL custom-format dump (pg_dump -Fc). The default location is ./backup/backup-YYYY-MM-DD-HHMMSS.tar.gz.

Retention: By default, the script keeps the 7 most recent backups and deletes older ones.

# Keep last 14 backups
./scripts/backup.sh --keep 14

# Custom backup directory
BACKUP_DIR=/mnt/backups ./scripts/backup.sh

Bundled vs. external database: The script auto-detects your database mode from DATABASE_URL in .env. For bundled PostgreSQL, it uses docker compose exec postgres pg_dump. For external databases (Aurora, RDS, Cloud SQL), it calls pg_dump directly — ensure the PostgreSQL client tools are installed on the host.

2.2 Restoring from backup

./scripts/restore.sh backup/backup-2026-03-14-120000.tar.gz

The restore script:

  1. Prompts for confirmation (the restore replaces all data in the database)
  2. Extracts the backup tarball to a temporary directory
  3. Stops the API and worker services
  4. Terminates active database connections
  5. Drops and recreates the database
  6. Runs pg_restore --no-owner --no-acl from the dump
  7. Restarts all services with docker compose up -d --wait
  8. Runs preflight checks to verify the restore

For non-interactive use (CI, automation):

./scripts/restore.sh --non-interactive backup/backup-2026-03-14-120000.tar.gz

External database restore: For external databases, the restore script uses psql and pg_restore directly. The database user must have CREATEDB privilege (or the ability to drop and recreate the target database). If DATABASE_URL_MIGRATE is set, the script uses that connection (which typically has owner-role privileges for DDL).

2.3 Managed database backups

If you run AGLedger on a managed PostgreSQL service, use the provider's native backup tools alongside (or instead of) the script-based backups:

| Provider | Backup mechanism | Notes | |----------|-----------------|-------| | AWS RDS / Aurora | Automated snapshots + point-in-time recovery | Enable automated backups; set retention to at least 7 days | | Google Cloud SQL | Automated backups + on-demand backups | Enable in the instance configuration | | Azure Database for PostgreSQL | Automated backups (enabled by default) | Retention configurable 7-35 days |

Managed backups provide continuous protection with near-zero RPO. The script-based backups (pg_dump) are useful for cross-version migration and portable archives.

2.4 Post-restore verification

After restoring from any backup method, verify the instance:

# 1. Health check
curl -s http://localhost:3001/health | jq .

# 2. Vault chain integrity scan
curl -X POST http://localhost:3001/v1/admin/vault/scan \
  -H "Authorization: Bearer $PLATFORM_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

# 3. If using YAML provisioning, reload to reconcile state
curl -X POST http://localhost:3001/v1/admin/provisioning/reload \
  -H "Authorization: Bearer $PLATFORM_KEY" \
  -H "Content-Type: application/json"

# 4. Run a smoke-test lifecycle (create mandate, submit receipt, verify settlement)

The vault scan walks every mandate's hash chain and reports any broken chains. After a clean restore, all chains should be intact up to the restore point — PostgreSQL's transactional consistency guarantees that partial vault entries cannot exist.

2.5 Monthly restore-test procedure

Untested backups are untrusted. Verify quarterly at minimum; monthly for production systems with active mandate flow:

  1. Copy a recent backup to a non-production environment (another compose stack, a staging cluster, or a separate docker host with its own DATABASE_URL).
  2. Restore via ./scripts/restore.sh <backup-file> against the non-production database.
  3. Run a vault integrity scan:
    curl -X POST http://localhost:3001/v1/admin/vault/scan \
      -H "Authorization: Bearer $PLATFORM_KEY" \
      -H "Content-Type: application/json" \
      -d '{}'
    
    Broken chains should be zero.
  4. Record the result — timestamp, backup file, scan outcome. Store alongside your backup retention records.

The point is not just to validate that the backup exists; it is to validate that the restore procedure, the current version of restore.sh, and your platform can actually reconstruct a working instance. Every component of that chain changes over time.

2.6 RTO/RPO targets

Recovery targets depend on your backup mechanism:

| Mechanism | RPO (data-loss window) | RTO (time to restore) | Notes | |-----------|------------------------|-----------------------|-------| | backup.sh nightly | Up to 24 hours | Minutes (Compose) to 10-30 minutes (K8s) | Restore time depends on DB size and I/O | | backup.sh hourly | Up to 1 hour | Same as above | Add a cron entry; ship tarballs off-host | | RDS automated snapshots | ~5 minutes via point-in-time recovery | 10-30 minutes + API restart | Enabled by default on RDS/Aurora with a retention period set | | Aurora Global Database | Seconds (replication lag) | Seconds (cross-region failover) | Secondary region read replica; promote on incident |

Production baseline: RPO ≤ 1 hour, RTO ≤ 30 minutes. Aurora deployments meet this with automated snapshots + PITR at default settings. For bundled PostgreSQL on Compose, run backup.sh on an hourly cron with off-host storage (S3, GCS, or an NFS mount).

All state is in PostgreSQL. A restored database plus a current-version AGLedger image and a valid Ed25519 vault signing key is a complete recovery — there is no out-of-band state to reconstruct.

3. Rollback

Database migrations are forward-only, so rollback means restoring the pre-upgrade backup and pointing the compose stack at the previous image. There is no dedicated rollback.sh — the three-step sequence below uses the scripts that ship in the install repo.

Data loss warning: All data written after the backup will be lost. This includes mandates, receipts, vault entries, and webhook deliveries created between the backup and the rollback.

Step 1: Identify the pre-upgrade version

upgrade.sh writes the previous version to ./backup/.pre-upgrade-version before applying the new image:

cat ./backup/.pre-upgrade-version
# → 0.19.16

If the file is missing, pick the target version from your release history.

Step 2: Restore the pre-upgrade backup

# List available backups
ls -1t ./backup/backup-*.tar.gz | head -5

# Restore the most recent pre-upgrade backup (interactive)
./scripts/restore.sh ./backup/backup-2026-03-25-120000.tar.gz

# Or non-interactive
./scripts/restore.sh --non-interactive ./backup/backup-2026-03-25-120000.tar.gz

restore.sh drops and recreates the database, runs pg_restore, and restarts all services.

Step 3: Pin the previous image version

Update AGLEDGER_VERSION in compose/.env and restart with the old image:

cd compose

# Edit .env in place
sed -i 's/^AGLEDGER_VERSION=.*/AGLEDGER_VERSION=0.19.16/' .env

docker compose pull agledger-api agledger-worker
docker compose up -d --wait

Verify the rollback:

curl -s http://localhost:3001/conformance | jq .version
# → "0.19.16"

4. Secret Rotation

Two secrets may need rotation: API_KEY_SECRET (HMAC hash key for API keys) and VAULT_SIGNING_KEY (Ed25519 signing key for audit vault entries). Both rotations are manual edits against compose/.env using in-container helper scripts — there is no orchestrator script in the install repo.

4.1 API_KEY_SECRET rotation

Six-step process with a dual-secret window to avoid downtime. Between steps 3 and 6, both the old and new secrets validate incoming keys, so integrations keep working while you verify.

cd compose

Step 1 — save the current secret as the previous:

CURRENT=$(grep '^API_KEY_SECRET=' .env | cut -d= -f2-)
echo "API_KEY_SECRET_PREVIOUS=${CURRENT}" >> .env

Step 2 — generate and install a new secret:

NEW=$(openssl rand -hex 32)
sed -i "s/^API_KEY_SECRET=.*/API_KEY_SECRET=${NEW}/" .env

Step 3 — restart so dual-secret mode activates:

docker compose up -d --wait

Step 4 — re-hash all API keys with the new secret (irreversible):

docker compose exec agledger-api node dist/scripts/rehash-api-keys.js

Step 5 — restart and verify that API keys still authenticate:

docker compose up -d --wait
curl -s -H "Authorization: Bearer $PLATFORM_KEY" http://localhost:3001/v1/auth/me | jq .

Step 6 — remove the previous secret (old hashes stop validating):

sed -i '/^API_KEY_SECRET_PREVIOUS=/d' .env
docker compose up -d --wait

4.2 VAULT_SIGNING_KEY rotation

Vault signing key rotation is simpler because historical signatures remain valid — each vault entry is signed at creation time and is not re-verified against the current key. For zero-downtime rotation, set VAULT_SIGNING_KEY_PREVIOUS so historical entries can still be verified by the running instance.

cd compose

# 1. Generate a new Ed25519 key
NEW_KEY=$(docker compose run --rm agledger-api node dist/scripts/generate-signing-key.js)

# 2. Move the current key to PREVIOUS, install the new one
CURRENT=$(grep '^VAULT_SIGNING_KEY=' .env | cut -d= -f2-)
echo "VAULT_SIGNING_KEY_PREVIOUS=${CURRENT}" >> .env
sed -i "s|^VAULT_SIGNING_KEY=.*|VAULT_SIGNING_KEY=${NEW_KEY}|" .env

# 3. Restart
docker compose up -d --wait

New vault entries are signed with the new key. Old entries retain their original signatures and continue to verify using the key registry. For key lifecycle management — registry rotation, offline verification, recovery — see the Vault Signing Key Guide.

5. Observability

5.1 Monitoring stack (Docker Compose)

The Docker Compose configuration includes an optional monitoring profile with OTel Collector, Jaeger, Prometheus, and Grafana:

cd compose
docker compose --profile monitoring \
  -f docker-compose.yml \
  -f docker-compose.postgres.yml \
  up -d --wait

Or re-run the installer with --with-monitoring to wire it up from scratch.

This starts four additional services:

| Service | Port | Purpose | |---------|------|---------| | OTel Collector | 4317 (gRPC) | Receives traces from AGLedger, exports to Jaeger | | Jaeger | 16686 | Distributed trace viewer (Jaeger v2) | | Prometheus | 9090 | Metrics scraping and storage | | Grafana | 3003 | Dashboard UI (default password: admin) |

5.2 Health endpoints

| Endpoint | Port | Auth | Use for | |----------|------|------|---------| | GET /health | 3001 (API), 3002 (Worker) | None | Liveness probes. Returns {"status":"ok"}. | | GET /health/ready | 3001 (API), 3002 (Worker) | None | Readiness probes. Returns 503 if database is unreachable. | | GET /status | 3001 | None | Public status page with database health check. | | GET /v1/admin/system-health | 3001 | Platform key | Detailed system health: DB latency, pool stats, memory, pg-boss queue counts. |

For Kubernetes deployments:

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10

5.3 Prometheus metrics

Both the API (port 3000) and Worker (port 3001) expose GET /metrics in Prometheus exposition format. No authentication required — intended for internal network scraping only.

Business metrics:

| Metric | Type | Description | |--------|------|-------------| | agledger_mandate_transitions_total | Counter | State transitions, labeled from_status and to_status | | agledger_verification_duration_seconds | Histogram | Phase 2 verification latency by contract_type and result | | agledger_worker_jobs_processed_total | Counter | Worker job completions by queue and status |

Infrastructure metrics:

| Metric | Type | Description | |--------|------|-------------| | agledger_http_request_duration_seconds | Histogram | HTTP latency by method, route, status_code | | agledger_db_pool_total_connections | Gauge | Total PostgreSQL pool connections | | agledger_db_pool_idle_connections | Gauge | Idle connections available | | agledger_db_pool_waiting_connections | Gauge | Clients waiting for a connection (non-zero = pool exhaustion) |

Cache metrics:

| Metric | Type | Description | |--------|------|-------------| | agledger_auth_cache_hits_total | Counter | API key auth cache hits | | agledger_auth_cache_misses_total | Counter | Auth cache misses (DB lookup required) | | agledger_schema_cache_hits_total | Counter | Contract type schema cache hits | | agledger_schema_cache_misses_total | Counter | Schema cache misses |

Process metrics (automatic, prefixed agledger_): CPU usage, resident memory, event loop lag, GC duration, active handles/requests.

5.4 Useful PromQL queries

# P95 request latency
histogram_quantile(0.95, rate(agledger_http_request_duration_seconds_bucket[5m]))

# Error rate (5xx)
sum(rate(agledger_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
/
sum(rate(agledger_http_request_duration_seconds_count[5m]))

# Mandate transitions by destination state
sum by (to_status) (rate(agledger_mandate_transitions_total[5m]))

# DB pool saturation (alert if > 0)
agledger_db_pool_waiting_connections > 0

# Auth cache hit rate (target > 90%)
rate(agledger_auth_cache_hits_total[5m])
/
(rate(agledger_auth_cache_hits_total[5m]) + rate(agledger_auth_cache_misses_total[5m]))

# Worker job failure rate by queue
sum by (queue) (rate(agledger_worker_jobs_processed_total{status="failure"}[5m]))
/
sum by (queue) (rate(agledger_worker_jobs_processed_total[5m]))

5.5 Recommended Alertmanager rules

groups:
  - name: agledger
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(agledger_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
          /
          sum(rate(agledger_http_request_duration_seconds_count[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AGLedger API error rate above 5%"

      - alert: VerificationSlow
        expr: |
          histogram_quantile(0.95, rate(agledger_verification_duration_seconds_bucket[5m])) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Phase 2 verification P95 latency above 5 seconds"

      - alert: DBPoolExhaustion
        expr: agledger_db_pool_waiting_connections > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool has waiting clients"

      - alert: WorkerJobFailures
        expr: |
          rate(agledger_worker_jobs_processed_total{status="failure"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker job failure rate elevated on queue {{ $labels.queue }}"

      - alert: AuthCacheLowHitRate
        expr: |
          rate(agledger_auth_cache_hits_total[5m])
          /
          (rate(agledger_auth_cache_hits_total[5m]) + rate(agledger_auth_cache_misses_total[5m]))
          < 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Auth cache hit rate below 80%"

5.6 Distributed tracing (OpenTelemetry)

Tracing is disabled by default. Set OTEL_EXPORTER_OTLP_ENDPOINT to activate it.

| Variable | Default | Description | |----------|---------|-------------| | OTEL_EXPORTER_OTLP_ENDPOINT | (unset) | OTLP gRPC endpoint (e.g., http://jaeger:4317). Setting this activates tracing. | | OTEL_PROVIDER | generic | generic (W3C Trace-Context) or xray (AWS X-Ray ID format). | | OTEL_SERVICE_NAME | agledger-api | Service name in trace metadata. Worker uses agledger-worker. |

Auto-instrumented: HTTP requests (Fastify), PostgreSQL queries, outbound HTTP (webhook delivery). Custom span attributes include mandateId, agentId, enterpriseId, contractType.

5.7 Logging

AGLedger uses pino for structured JSON logging. Every log line is a JSON object with level, time, msg, and contextual fields (mandateId, reqId, etc.). Sensitive fields (API keys, secrets, passwords, tokens) are automatically redacted.

| Variable | Default | Description | |----------|---------|-------------| | LOG_LEVEL | info | Minimum level: trace, debug, info, warn, error, fatal |

5.8 SIEM integration

AGLedger can export audit events to your SIEM. Configure via environment variables:

| Variable | Default | Description | |----------|---------|-------------| | SIEM_ENABLED | false | Enable SIEM event export | | SIEM_FORMAT | ocsf | Export format: ocsf or raw | | SIEM_FILE_ENABLED | true | Write events to a local file (when SIEM is enabled) | | SIEM_FILE_PATH | /var/log/agledger/siem.ndjson | File path for NDJSON output | | SIEM_HTTP_ENABLED | false | Push events to an HTTP endpoint | | SIEM_HTTP_URL | (empty) | HTTP endpoint URL | | SIEM_HTTP_AUTH_HEADER | (empty) | Authorization header value for HTTP push | | SIEM_BATCH_SIZE | 50 | Events per batch | | SIEM_FLUSH_INTERVAL_MS | 5000 | Flush interval in milliseconds |

6. Support Bundle

When contacting AGLedger support, generate a diagnostic bundle that collects system state with all secrets automatically redacted.

6.1 CLI support bundle

./scripts/support-bundle.sh

The script collects:

The bundle is saved as ./support-bundle-YYYY-MM-DD-HHMMSS.tar.gz. Send it to support@agledger.ai.

6.2 API support bundle

The admin API provides a JSON support bundle with 9 sections:

curl -s http://localhost:3001/v1/admin/support-bundle \
  -H "Authorization: Bearer $PLATFORM_KEY" | jq .

Response sections:

| Section | Contents | |---------|----------| | manifest | Bundle version, generation timestamp, section index | | version | AGLedger version, Node.js version, operating mode (standalone/gateway/hub) | | license | License tier, status, features | | health | Database connectivity, component status | | authCache | Cache size, max capacity, hit rate | | config | Runtime configuration (secrets excluded) | | database | PostgreSQL version, migration state, table sizes | | environment | Platform, CPU count, memory | | guidance | List of items not included in the bundle and why |

The API bundle requires a platform key. Enterprise and agent keys receive 403.

7. Troubleshooting Runbook

7.1 Connection pooler incompatibility

Symptom: Jobs silently stop processing. Webhook deliveries stall. Worker logs show no errors but no activity.

Cause: AGLedger uses pg-boss, which requires PostgreSQL LISTEN/NOTIFY and session-level advisory locks. Transaction-mode connection poolers break both features.

Incompatible poolers:

Fix: Connect AGLedger directly to the PostgreSQL instance, bypassing any connection pooler. If you must use a pooler for other applications, configure AGLedger's DATABASE_URL to point to the direct endpoint.

7.2 SSL/TLS setup issues

Symptom: ECONNREFUSED or SSL handshake errors when connecting to a managed database.

For AWS RDS/Aurora: The Docker image bundles the RDS global CA bundle at /etc/ssl/certs/rds-global-bundle.pem. Set:

NODE_EXTRA_CA_CERTS=/etc/ssl/certs/rds-global-bundle.pem

And append ?sslmode=verify-full to your DATABASE_URL.

For other providers: Mount your CA certificate into the container and set NODE_EXTRA_CA_CERTS to its path.

7.3 Migration failures

Symptom: upgrade.sh fails at the migration step. The API container exits immediately.

Common causes:

Diagnosis:

# Check migration state
docker compose exec postgres psql -U agledger -d agledger \
  -c "SELECT * FROM _migrations ORDER BY id DESC LIMIT 5;"

# Run migrations manually with verbose output
docker compose run --rm agledger-migrate

7.4 Vault integrity scan

Run a vault scan after any restore, infrastructure incident, or suspected data integrity issue:

curl -X POST http://localhost:3001/v1/admin/vault/scan \
  -H "Authorization: Bearer $PLATFORM_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

The scan may return a jobId for asynchronous processing. Poll for results:

curl -s http://localhost:3001/v1/admin/vault/scan/$JOB_ID \
  -H "Authorization: Bearer $PLATFORM_KEY" | jq .

A healthy scan reports zero broken chains. If broken chains are found, the response indicates which mandate IDs are affected. Contact support with the scan results and a support bundle.

7.5 Preflight check failures

The preflight script verifies database connectivity, migrations, and configuration. Run it inside the API container:

cd compose
docker compose exec agledger-api node dist/scripts/preflight.js

If the API container is not running, start a one-off container on the compose network:

docker compose run --rm agledger-api node dist/scripts/preflight.js

Common failures:

8. High Availability

8.1 Multi-replica deployment

All API instances are stateless. They share a single PostgreSQL primary.

Docker Compose:

docker compose up -d --scale agledger-api=3 --scale agledger-worker=2

Kubernetes / Helm:

api:
  replicaCount: 3

worker:
  replicaCount: 2

8.2 Load balancer configuration

| Setting | Value | Why | |---------|-------|-----| | Health check path | /health/ready | Returns 503 during startup, shutdown, and DB outage | | Health check interval | 5-10s | Fast enough to detect failover | | Deregistration delay | 30s | Matches graceful shutdown drain period | | Sticky sessions | Not required | All instances are stateless |

8.3 What is safe under concurrency

| Component | Mechanism | |-----------|-----------| | Vault hash chain | pg_advisory_xact_lock per mandate — concurrent writers serialize | | Webhook sequence counters | UPDATE ... RETURNING — PostgreSQL row-level lock guarantees unique, monotonic values | | Maintenance jobs | pg-boss singletonKey scheduling — only one instance picks up each job | | Vault checkpoints | UNIQUE(mandate_id, chain_position) + ON CONFLICT DO NOTHING — dedup on retry | | Provisioning reload | pg_try_advisory_lock — only one instance reconciles at a time | | Auth/signing key caches | LISTEN/NOTIFY invalidation — changes propagate to all instances within seconds |

8.4 Connection failover

When PostgreSQL fails over (RDS Multi-AZ, Aurora failover), all connections drop simultaneously. AGLedger handles this automatically:

| Time | What happens | |------|-------------| | T+0s | Primary fails, connections drop | | T+0-5s | Requests return 500, /health/ready returns 503 | | T+5-10s | Load balancer deregisters unhealthy instances | | T+10-30s | DNS/endpoint updates to new primary (varies by provider) | | T+30-60s | Pool reconnects, health check passes, traffic resumes |

No manual intervention required. All state is in PostgreSQL. Transaction-scoped advisory locks are released automatically on connection drop.

8.5 Graceful shutdown

During rolling deploys, each instance:

  1. Receives SIGTERM
  2. /health/ready returns 503 immediately (stops new traffic)
  3. In-flight HTTP requests drain (up to HANDLER_TIMEOUT_MS, default 30s)
  4. pg-boss stops gracefully (up to 20s for active jobs)
  5. Connection pool closes
  6. Process exits

Set your orchestrator's termination grace period to at least 35 seconds. The Kubernetes default of 30 seconds may need increasing.

8.6 Connection pool sizing

Each instance uses up to DATABASE_POOL_MAX connections (default 20). Total usage per instance: API pool + Worker pool (~10) + pg-boss overhead (~5).

Total connections = (API replicas + Worker replicas) * DATABASE_POOL_MAX

Ensure PostgreSQL's max_connections exceeds this total. For multi-replica deployments, reduce DATABASE_POOL_MAX:

DATABASE_POOL_MAX=10   # 5 instances * 10 = 50 connections

Reference sizing:

8.7 Rate limiting in multi-replica deployments

By default, rate limit counters are in-memory (per-process). With multiple replicas, each enforces limits independently, effectively multiplying the limit by the replica count.

For accurate cross-replica rate limiting, switch to the PostgreSQL-backed store:

RATE_LIMIT_STORE=postgresql

The PostgreSQL store uses an UNLOGGED table for performance. If the database crashes, counters reset — this only means a brief window of unenforced limits, not data loss.

Related


Validated: 58 assertions covering HA, support bundle, and day-2 operations.