Designing a feature flag system that catches your typos

Feature flags sound simple until you design them seriously. A boolean switch — feature on, feature off — is the obvious implementation and it’s insufficient almost immediately: teams want string variants for A/B tests, numeric values, JSON payloads for configuration-shaped features, per-environment state, targeting rules over user attributes, and rollouts that don’t require code changes.

Smpl Flags is our design for all of that, and this post covers the core decisions. The one we’d defend hardest is also the least glamorous: making a certain class of typo impossible.

Four types, and the constrained/unconstrained split

We support four flag types: BOOLEAN, STRING, NUMERIC, and JSON. BOOLEAN is on/off. STRING and NUMERIC return a value rather than a state, which is what experiment infrastructure needs. JSON returns structured configuration, for features whose behavior is parameterized by more than one scalar.

Each type can also be constrained or unconstrained. A constrained flag declares its set of valid values up front: a constrained STRING flag with variants ["control", "treatment-a", "treatment-b"] can only ever return one of those three, and the console validates that targeting rules only produce defined variants. An unconstrained flag can return any value of its type — right for open-ended value spaces like a numeric timeout threshold, or a JSON experiment config that evolves during the experiment.

The split exists because of a bug class we’ve met before: unconstrained string flags are a typo trap. A targeting rule that’s supposed to return "treatment-a" but returns "treatement-a" (note the typo) is a silent bug — nothing errors, the experiment just quietly runs with a variant nobody’s code recognizes. Constrained flags catch it at configuration time, before it can become a week of confusing dashboard data. That’s the design decision in the title: the invalid value has nowhere to live.

JSON Logic for targeting rules

Targeting rules decide what a flag returns for a given evaluation context. Our rule format is JSON Logic — an open specification for expressing logical conditions and computations as JSON data. “Return ‘treatment’ if the user’s plan is ‘enterprise’ or the user ID is in a specific list” looks like:

{
  "or": [
    {"==": [{"var": "user.plan"}, "enterprise"]},
    {"in": [{"var": "user.id"}, ["user-123", "user-456", "user-789"]]}
  ]
}

Why JSON Logic instead of a custom DSL? Three reasons. Rules are data, not code — they can live in a database column, travel over HTTP, and reach SDK clients for local evaluation without anyone writing a parser for a custom grammar. Evaluators already exist in all six of our SDK languages — Python, TypeScript, Go, Java, C#, and Ruby — so we vet existing implementations for correctness instead of writing and maintaining six of our own. And it’s a known quantity: operators who’ve used JSON Logic elsewhere recognize it on sight, whereas a proprietary expression language arrives with no documentation, no training, and no debugging tooling except what we’d have to build.

The trade-off is a learning curve for non-technical users. The console handles that with a UI rule builder that generates JSON Logic without anyone hand-writing it; the raw JSON stays accessible for power users with conditions the UI doesn’t express.

One table, one JSONB column

Flags are stored in a single table. Each row is a flag; its full configuration — targeting rules, per-environment state, type information — lives in a JSONB data column.

Why not a relational model with separate tables for rules, environments, and values? Because a flag’s configuration is read and written as a unit. Updating a targeting rule means updating the whole flag — you need the current state anyway to validate that the update is consistent — so a JSONB column holding the full state maps exactly to the access pattern. And because a flag’s structure evolves with the product: adding tags, or last-modified-by, or evaluation analytics means adding a key to the blob, not running a schema migration. For a product still maturing, that flexibility is worth a lot.

The structure inside the JSONB is enforced by the service’s Pydantic models. The database stores data reliably; the application expresses the complicated validation rules — each layer doing the thing it’s good at.

Auto-discovery

Flags are declared in code but managed in the console. When an application calls client.flags.get("checkout-v2") and no such flag exists in the account, the SDK reports the access and the flags service creates a record in “discovered” state — not yet managed, but visible in the console. From there an operator can manage it (give it explicit targeting rules) or leave it discovered, in which case it returns the default value defined at the call site.

The result is that adding a flag to your code requires no registration step at all. Write the code, deploy it, and the console shows you what your code is actually asking about.

Per-environment state

A flag that’s true in staging and false in production is the normal state of a rollout in progress, so flags carry per-environment state — a map inside the flag’s JSONB data, one entry per environment the account has configured. SDKs include the environment in the evaluation context, get back that environment’s state, and receive real-time updates over WebSocket when any flag in their environment changes.

What we’d revisit

Audit history. If a flag value changes and something breaks, you can see the current state but not who changed it or what it was before. Audit history is being built as a platform capability (see our post on Smpl Audit), and flags will be among the first resources to get full version history.

Percentage rollout. We don’t yet have a native “roll out to 10% of users” control with consistent hashing so the same user always lands in the same bucket. Yes: we shipped typo-proof variants before we shipped percentage rollout, which is table stakes for the category. We stand by the ordering and expect approximately nobody to agree. It’s on the roadmap.

Dependency tracking. A production incident caused by a flag change would be easier to diagnose if flag evaluations were correlated with application errors. Today we don’t instrument evaluations beyond the raw count.