End-to-end testing a multi-language SDK platform

Unit tests prove components work in isolation. Integration tests prove they work together inside one service. Neither proves that a TypeScript SDK talking to the real API returns the right value when a flag is flipped through the Python management client. End-to-end tests prove that — and the first time we ran ours, they found three real bugs that every unit test in the codebase had cheerfully missed. The bugs are below; first, the machine that caught them.

Why a separate repository

The e2e suite lives in its own smplkit/e2e repo rather than inside any service, for three reasons. The tests are infrastructure-agnostic — they exercise the system the way a customer does, through the public API and SDKs, and don’t know which service does what. They don’t run on every push — they’re workflow_dispatch, triggered manually before releases, because e2e-on-every-commit buys you a slow pipeline and a steady diet of infrastructure flakiness. And they need real credentials — API keys, a deployed environment, the power to create and destroy accounts — which are better kept out of every product repo’s blast radius.

The structure

Each language has a test program: not a test framework, just a script that creates a fresh account, performs a realistic sequence (create a flag, set targeting rules, evaluate from the SDK, verify), asserts results, and deletes the account. Fresh accounts per run mean no accumulated state and no name collisions — every run gets to call its flag checkout-v2 like it’s the first time. The programs run locally too, so an SDK change can be verified against staging before it’s pushed.

The suite covers the three products end to end: flags (create, target, evaluate with matching and non-matching contexts, update, and verify the update reaches the SDK over WebSocket within a bounded five-second wait), config (parent inheritance, child overrides, live updates), and logging (install, auto-discovery, remote level change applied in the running process). Output is human-readable as it goes — “Evaluating with matching context… expected ‘treatment’, got ‘treatment’ OK” — so a failure diagnoses itself.

Above the scripts sits a small pytest harness whose test functions shell out to each program and verify it exits 0. The harness doesn’t know any language-specific details. It just runs programs and checks exit codes — which is why adding a seventh language means writing one script and registering it, and why one language’s failure never contaminates another’s report. The harness also owns setup and teardown: credentials, environment variables, and making sure accounts get cleaned up even when a script dies mid-run.

The three bugs

Nested context attributes didn’t match. The targeting rule {"==": [{"var": "user.metadata.tier"}, "enterprise"]} failed against a context that plainly contained {"user": {"metadata": {"tier": "enterprise"}}}. The JSON Logic evaluator handled top-level attribute access but not dotted-path access below it. The unit tests never used a context shaped like real data; the e2e scenario did, and caught it immediately.

WebSocket reconnection lost its subscriptions. After a disconnect and reconnect, the SDK wasn’t re-subscribing to the right channels, so flag updates after reconnection silently never arrived. Invisible to unit tests, which mock the WebSocket layer — which is to say, the tests mocked away exactly the thing that was broken.

Deleting a parent config produced a 500. Resolving a child config whose parent had been deleted returned a 500 instead of a useful 404. We found this one by accident: the cleanup step happened to delete a parent before its child.

All three were fixed before the TypeScript SDK existed, which means the TypeScript e2e run never saw them. That ordering is the quiet argument for building the harness early: bugs fixed at the API layer are fixed for every SDK that hasn’t been written yet.

Cleanup

Test programs delete their own accounts on the way out, success or failure — but a program that crashes before teardown orphans its account. A scheduled cleanup script in staging sweeps for accounts matching the e2e-test- naming prefix that are older than 24 hours and deletes them. Belt, suspenders, cron.

Still on the list

Three known gaps. The suite runs against source, not against published packages — automatically running e2e when a new SDK version hits PyPI or npm would verify what customers actually install. All scenarios are single-language — the genuinely interesting test, create-a-flag-in-Python-evaluate-it-in-TypeScript, exercises the cross-language contract and doesn’t exist yet. And nothing asserts on timing, so a latency regression sails through; correctness bounds exist, performance baselines don’t.