2026-04-13 · updated 2026-04-18
EngineeringZero Mocks, Real Infrastructure: How AGLedger Tests an Accountability Engine
By Michael Cooper · Founder
An accountability system that hides bugs is worse than no accountability system. If the audit trail says PASS when the behavior was wrong, you have a liability, not a feature. That principle — aligned with NIST's minimum standards for developer verification — drives everything about how we test AGLedger.
Summary
AGLedger’s independent testbed runs 36 tests against a live API — real EKS cluster, real Aurora database, real webhook delivery, real LLM agents. No mocks. No stubs. Every test produces a structured JSON result with three possible outcomes: PASS, FAIL, or SKIP. Failures are documented with finding numbers (F-NNN) and tracked to resolution. The testbed has cataloged over 390 findings since inception.
Last week the suite had 152 tests. We deleted 117 of them. The ones we kept focus on behavior you can only learn from a real deployment — the rest were duplicating the AGLedger API repo’s own integration suite and adding noise without signal.
Why we deleted 117 tests
A testbed only earns its keep when it catches things other tests can’t. If the API repo already has a state-machine unit test for mandate transitions, re-running that logic against a live deployment doesn’t add information — it adds latency. It also dilutes the signal when a real-infrastructure bug slips in, because the failing test is buried in a long list of tests that have nothing to do with deployment.
On 2026-04-16 we audited every test against a single question: does this only manifest on deployed infrastructure, or against a customer-facing surface? Tests that failed that question were deleted. 117 of them did. What remains: 36 tests that exercise the ALB, the WAF, Aurora, TLS, the SDKs, the CLI, the MCP server, webhook delivery, federation, HA failover, and day-2 operations.
Fewer tests, higher signal. Test runtime dropped from 54.5 minutes to roughly 5. Every failure now points to something the testbed is uniquely positioned to catch.
Philosophy
The goal is product improvement, not passing tests.
A test that marks a broken thing as PASS is worse than no test. A test that skips instead of failing hides the problem. We have three rules for test authors:
1. Never mark a broken thing as PASS
2. Never skip-as-PASS — use t.skip() with a reason
3. Assert the specific thing you're testing, not just HTTP 200
Every failure gets a finding number (F-001 through F-390+), a severity, expected vs. actual behavior, reproduction steps, and an impact assessment. Findings are tracked in a catalog any team member can read. Resolved findings move to an archive. The open ones stay visible.
Infrastructure: no mocks
The testbed runs against a live deployment — not a mock server, not a test double, not an in-memory stub. The stack:
Compute — EKS cluster (dedicated testbed namespace)
Database — Aurora Serverless v2, PostgreSQL 17
Networking — ALB Ingress, WAF with IP allowlist
Deployment — Helm chart, same Docker image as production
Workers — Real async verification, settlement, and webhook delivery
Fresh namespace per test run. Ephemeral database. Auto-provisioned keys. Deterministic cleanup. The testbed tests what customers actually run.
What we test
36 tests organized by deployment surface and customer interface. Each test is a standalone executable TypeScript file — no shared state, no fixtures, no implicit ordering.
| Profile | What it validates |
|---|---|
| infra | ALB health, WAF rules, TLS cert handling, Aurora failover behavior, HA support bundle |
| onboarding | First-run install, license provisioning, fresh registration, day-one customer path |
| integration | Orchestrator chains, mixed-chain (ERP + AI) workflows, entity references, scope profiles |
| SDK | TypeScript SDK (@agledger/sdk), Python SDK (agledger), CLI, MCP server — zero-scaffolding discovery |
| webhooks | HMAC signatures, delivery reliability, retry logic, dead-letter queue, failure recovery |
| federation | Cross-boundary signing, settlement signal emission, hub/gateway interoperability |
| compliance | SOC 2 control mapping (CC1.1, CC6.x, CC7.2, PI1), audit trail completeness, tamper detection |
| day-2 | Vault signing keys, scope profile management, YAML provisioning, operational runbooks |
Logic tests — state machines, RBAC matrices, schema validation, cryptographic primitives — live in the AGLedger API repo’s own integration suite. That is where they belong. The testbed never re-runs them.
Test profiles
Not every change needs the full 36 tests. Named profiles run the relevant subset:
smoke — Core lifecycle plus basic auth. Run on every PR.
infra — ALB, WAF, Aurora, TLS behavior.
onboarding — Fresh customer install path.
integration — Orchestrator, mixed chains, entity references.
soc2 — Mapped to SOC 2 controls (CC1.1, CC6.x, CC7.2, PI1).
all — 36 tests. ~5 minutes. Full suite on release.
Stress tests with pg_stat_statements
Correctness tests catch bugs. Stress tests catch cost. We run a separate audit-read stress harness — 20 concurrent readers for 5 minutes against 2,260 mandates on Aurora Serverless v2 — and record pg_stat_statements for every release. The output is a perf-snapshot directory with before/after query breakdowns and a findings file that flags regressions.
This is how we caught a 6-query aggregation pattern in the enterprise-report endpoint that was burning 70 seconds of database CPU every 5 minutes. The rewrite cut it to 39. We wrote it up, including the honest caveat about why wall-clock p50 didn’t improve the way you’d expect.
Testing with real LLM agents
Four of our tests use real AI models — not simulated agents, not scripted calls. We give Claude Haiku, Gemini Flash, GPT-4o-mini, and Amazon Nova the API tool definitions and a business task. No documentation. No examples. No hand-holding.
The question: can an agent that has never seen AGLedger before discover and complete a full mandate lifecycle from tool descriptions alone? Research shows LLMs score 84–89% on synthetic benchmarks but only 25–34% on real-world tasks. We test the real-world number.
We measure: discovery rate, error recovery, steps to completion, which providers get stuck, and where. Randomized business scenarios — procurement, analysis, coordination, infrastructure. Max 15 tool calls per task before timeout.
This is how we find out whether our API is usable, not whether it is correct. Correctness comes from the API repo’s own suite. Usability comes from watching real agents try.
Customer reality tests
The onboarding profile simulates a real customer's first day. Fresh registration. No pre-configured accounts. SDK only (no raw API shortcuts). No rate limit exemptions. Unicode throughout. Error message quality checks.
If the onboarding path is broken, this profile catches it before a customer does.
What the numbers look like
Latest baseline (v0.19.10, 2026-04-17):
Profile — all (36 tests)
Assertions — 951
Passed — 927
Failed — 7
Skipped — 17
Findings cataloged — 390+ (most resolved; open ones tracked to resolution)
We publish these numbers because hiding them defeats the purpose. Most failures are single assertions within a test — an edge case in delegation cascading, a timing issue in webhook retry, a schema validation gap. Each one has a finding number and gets fixed.
Core lifecycle, authentication, and security tests are stable. Delegation chains and federation signing have the most open findings. That is where the complexity lives, and that is where we focus.
Why this matters
If you are evaluating AGLedger, the testbed is how we hold ourselves accountable. The same lifecycle we ask you to use for your agents — structured commitment, evidence of delivery, verdict — is what we apply to our own software.
Every test hits a real deployment. No mocks.
Every failure is cataloged and tracked. No hiding.
Every finding has a number, a severity, and a resolution path.
Real LLM agents test usability, not just correctness.
Stress tests catch cost regressions, not just bugs.
An accountability engine that cannot account for its own quality is not worth running.
Sources & further reading
NIST SP 800-115 — Technical Guide to Information Security Testing and Assessment
NIST IR 8397 — Guidelines on Minimum Standards for Developer Verification of Software
AICPA SOC 2 — Trust Services Criteria
PostgreSQL pg_stat_statements — Query execution statistics
RFC 8032 — Edwards-Curve Digital Signature Algorithm (Ed25519)
RFC 8785 — JSON Canonicalization Scheme (JCS)
arXiv 2510.26130 — Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Tasks
AWS Aurora PostgreSQL — Best Practices