← All posts

End-to-End Testing a Multi-Language SDK Platform

Unit tests tell you that components work in isolation. Integration tests tell you that components work together within a service. Neither tells you that a TypeScript SDK talking to the production API returns the right value when a flag is flipped through the Python management client.

End-to-end tests tell you that.

We have a separate smplkit/e2e repository whose sole job is to run real scenarios against real infrastructure and verify that everything works together. This post describes what we test, how the test runner is organized, and what we found the first time we ran it against production.

Why a Separate Repository

The e2e tests could live in any of the product service repositories, or in a monorepo alongside the service code. We chose a separate repository for a few reasons.

First, the e2e tests are infrastructure-agnostic. They test the system as a customer would experience it — through the public API and the SDKs. They don’t know or care about which service runs which test; they only know about the API surface.

Second, the e2e tests don’t run on every push. They run on workflow_dispatch — manually triggered, not automatically triggered by code changes. Running e2e tests on every commit would require a stable test environment, slow down CI, and create noise from infrastructure flakiness. Triggering them manually before major releases gives us the safety of e2e coverage without the CI overhead.

Third, e2e tests require credentials. They need real API keys for the test environment, access to the deployed API, and the ability to create and clean up test accounts. Keeping these credentials out of product service repositories reduces the blast radius of a credential leak.

The Test Structure

Each SDK language has a test program in the e2e repo — a script that exercises the SDK’s management and runtime surface against a live account.

The test programs are not pytest test suites or language-native test frameworks. They’re scripts that:

  1. Create a fresh account for the test run (via the management API)
  2. Perform a series of operations (create a flag, set targeting rules, evaluate the flag, verify the value)
  3. Assert the results
  4. Clean up the account

Using a fresh account per run means tests are independent of each other and don’t accumulate state. It also means a test can create resources with simple names (checkout-v2, payments-timeout) without worrying about collisions with previous runs.

The test programs are runnable locally as well as in CI. A developer who makes a change to the Python SDK can run the Python e2e test program against the staging environment to verify it before pushing.

What the Tests Cover

The e2e suite covers three areas:

Flags. Create a flag. Set targeting rules (attribute-based, with constrained values). Evaluate the flag from an SDK client with contexts that match and don’t match the rules. Verify the correct values are returned. Update the flag. Verify the update propagates to the SDK client via WebSocket (with a bounded wait time — up to 5 seconds for the WebSocket event to arrive).

Config. Create a config with parent inheritance. Verify that the child config inherits keys from the parent. Override a key in the child. Verify the override resolves correctly in the child and doesn’t affect the parent. Resolve the config from an SDK client. Verify live update propagation.

Logging. Install the SDK’s logging integration. Verify that loggers are auto-discovered. Change a logger’s level remotely. Verify that the level change is applied in the running process within the bounded wait time. Verify that the adapter correctly reports the current level after the change.

Each test program produces human-readable output as it runs: “Creating flag ‘checkout-v2’… OK”, “Setting targeting rule… OK”, “Evaluating with matching context… expected ‘treatment’, got ‘treatment’ OK”. Failures print the expected and actual values with enough context to diagnose.

The pytest Harness

The test programs are individual scripts, but they’re orchestrated by a pytest harness in the e2e repo. The harness has test functions that shell out to each language’s test program and verify it exits with code 0.

This architecture means:

  • The harness doesn’t know any language-specific details. It just runs programs and checks exit codes.
  • Adding a new language SDK requires adding a test program in that language and registering it in the harness. The harness doesn’t change.
  • Failures in one language’s tests don’t affect other languages — pytest reports them independently.

The harness also handles setup and teardown: creating AWS credentials for the test run, setting environment variables, and ensuring test accounts are cleaned up even if a test program crashes.

What We Found the First Time

We ran the e2e suite against staging for the first time three months into the project, after the Python SDK was feature-complete but before we’d integrated the TypeScript SDK. We found three real bugs.

Bug 1: Flag evaluation with nested context attributes. The targeting rule {"==": [{"var": "user.metadata.tier"}, "enterprise"]} was failing to match even when the context contained {"user": {"metadata": {"tier": "enterprise"}}}. The JSON Logic evaluator was handling top-level attribute access correctly but not nested attribute access through dotted paths. The Python SDK’s unit tests didn’t cover this case. The e2e test, which ran a realistic scenario with realistic context shapes, caught it immediately.

Bug 2: WebSocket reconnection state. After a WebSocket disconnection and reconnection, the SDK was not re-subscribing to the correct channels. Flag updates after reconnection were not being received by the client. This was invisible in unit tests (which mock the WebSocket layer) and only reproducible in a real environment where network conditions could be simulated.

Bug 3: Config inheritance with missing parent. The config resolution endpoint was returning 500 when a child config referenced a parent that had been deleted. It should have returned 404 with a useful error. The e2e test for config cleanup (deleting configs after the test run) happened to delete the parent before the child, triggering this code path.

All three bugs were fixed before the TypeScript SDK was developed. The TypeScript e2e tests didn’t hit any of them because they were already fixed in the API by the time TypeScript was tested.

This is the value of e2e testing: it found real bugs in production-realistic scenarios that unit and integration tests missed.

Handling Test Environment Cleanup

Tests create accounts. Accounts hold data. We want clean environments after tests, especially in staging.

The test programs are responsible for their own cleanup: at the end of a test run (successful or not), they call the account deletion API. If the test program crashes before the cleanup step, the account is orphaned.

We handle this with a cleanup script that runs on a schedule in the staging environment: it lists all accounts with the test marker in their name (we use a prefix like e2e-test- for all test account names), checks if they’re older than 24 hours, and deletes them. This catches orphaned accounts from crashed test runs.

What We’d Revisit

Automated e2e on release candidates. Currently e2e tests are manually triggered. We’d like to automatically trigger them when a new version of any SDK is published — verifying the published package, not the local source. This requires the e2e environment to have access to the SDK registries (PyPI, npm) and to install the published version dynamically.

Cross-language scenarios. The current tests exercise each SDK in isolation. A useful test category we haven’t built yet: create a flag using the Python management SDK, evaluate it using the TypeScript runtime SDK, verify the value. This tests the API contract between languages, not just each SDK in isolation.

Performance baselines. The e2e tests verify correctness but not performance. A regression that causes flag evaluation latency to double wouldn’t be caught by the current tests. Adding timing assertions (flag evaluation must complete in under 100ms) would catch performance regressions.

smplkit’s e2e test suite lives in a separate repository, runs on workflow_dispatch, and uses fresh accounts per run. It found three real production bugs on first execution that unit tests missed.