Resist dyanmic config as long as possible

When you’re trying to understand an application bug, outage, or performance regression, one of the first and most evergreen questions to ask is, “What changed?”.

“Premature dynamic configuration” is the mortal enemy of this question. I try to resist it as long as I can. Here’s why.

What’s dynamic configuration?

“Dynamic configuration” can take many forms, and though there’s no standard definition that I know of, it typically means:

A data source, external from committed code, which controls and changes application behavior without requiring a commit or rebuild.

Some examples:

Records in a datastore like Redis, PostgreSQL, or a SQLite database.
Environment variables managed by your PaaS, like Heroku environment variables or Kubernetes secrets.

This data could be used to do things like:

Enable or disable a feature globally.
Change a quota, limit, or some other static numeric threshold.
Manage other application policy, like whether a certain situation should cause an error or not.

Why resist?

It boils down to a simple fact:

You probably have far more infrastructure and tooling around commits and deploys than you do around dynamic config systems.

How can you test my claim? Let’s look at the qualities of the dynamic configuration system.

When a value is introduced, changed, or removed…

Is there a system to review that change first (like a code review)?
Does the “before” value get saved in a durable, auditable log somewhere (like a commit history)?
On deploy, is there a team-visible broadcast that the change occurred (like a build & deploy log)?
Does the application use a new version number in logs (like a RELEASE=$(git rev-parse --short HEAD))?

In almost every case I’ve seen of “premature dynamic configuration”, the answers to all of these questions are NO while the benefits of dynamism are marginal—low cardinality, low frequency of change—compared to managing the record in code.

Now imagine the answers are all no, and you’re the oncall responder trying to figure out why 500s increased around 2pm. Or you’re the data analyst trying to understand why revenue jumped. I hope you know how to use @channel in Slack.

Getting away with something simpler

At one pre-product/market fit startup, we had a few hundred customers—and about 6 that really mattered. We also had a pretty fancy quota system, and we needed a way to configure it to give those important customers higher rates.

We immediately dreamed up a cool quota configuration system. We’d use Redis, or maybe it was Consul, to set and adjust limits by customer. We’d build a little dashboard for it which Ops could use. And we’d be able to onboard new customers and change policy in seconds!

Of course, this was completely overkill for 6 records (which might be 7 next year) that rarely needed to change. Instead we did the low-tech thing first and added const VIP_CUSTOMER_IDS = [...] in code.

Yes, this solution meant it took 3-4 minutes to get a quota change out (and probably 30 when you added in PR overhead). But it introduced no new infrastructure, it gave us full observability of when and how a change was made, and it made those changes trivial to correlate with deploys.

Tolerable exceptions

You might have noticed me hedging a bit by shaming premature dynamic config.

There are indeed many cases where dynamic config is not only unavoidable, but actually pretty great:

Environment variables for rarely-changing core configuration (such as DATABASE_URL), managed by your cloud provider¹ or orchestration layer.
- We all love 12-factor apps.
Secret stores, such as Vault and those that are built in to basically every cloud provider.
- Secrets should never be stored in code.
Mature feature flag systems like Flagsmith and LaunchDarkly.
- These systems tend to have better answers to the rubric above (e.g. come with decent logging and auditablity).
- These also usually bring quickly-provable benefit (by accelerating experimentation and iteration—a worthy tradeoff against prod simplicity).

Not never, but not now

Readers of this blog will know I’m reluctant

Boring old Heroku handles this excellently: Any change to environment settings bumps the strictly-increasing HEROKU_VERSION number, and causes an internal re-deploy. ↩︎

What’s dynamic configuration?#

Why resist?#

Getting away with something simpler#

Tolerable exceptions#

Not never, but not now#

What’s dynamic configuration?

Why resist?

Getting away with something simpler

Tolerable exceptions

Not never, but not now