You changed the setting. The world still sees the old one. Why.

The frustration. You changed a setting. You can see the new value on your own server. You open any outside tool and it still shows the old one. An hour passes. Still the old one. Nothing you do on your side seems to change anything.

This post is about why that happens, and the small, elegant trick some systems use to make it stop happening. The example is a real rollout we shipped this week on a domain we run. No email-security background required. By the end you will know why a configuration change can sit invisible for hours, and what the systems that solved this problem properly did differently.

The short answer: the world cached you

When one program asks another program for a piece of information, there are two possible outcomes. Either the second program answers fresh every time, or the first program keeps the answer around for a while and reuses it. The second option, reusing the old answer, is called caching. Almost everything on the internet caches almost everything by default, because the alternative is unbearably slow and expensive.

When you open a web page, your browser caches the images so the next page on the same site loads faster. When your laptop looks up sudory.com, the local DNS resolver caches the answer so the next lookup does not hit the internet. When your mail server delivers a message, it caches whatever it learned about the recipient domain so the next message is quicker. Every layer of the stack has its own cache, its own rules about how long to keep things, and its own quiet assumption that whatever you said a minute ago is still true.

Which is great, until you want to change something.

There are always two caches

When you change a value the outside world observes, you are racing two different caches at the same time.

The first cache is one you can reach. It is on your side: the content delivery network in front of your website, the cache in front of your database, the build artifact in your deploy pipeline. You control these. When you deploy, you can usually tell them to invalidate, or you can serve content with headers that ask them to revalidate every time. Either way, once the deploy is done, someone who asks your server directly gets the new answer.

The second cache is one you cannot reach. It is on the other side: every laptop, phone, mail server, DNS resolver, browser, and build tool that ever asked you for the thing you are now changing. Each of them has its own copy and its own rule about how long to keep it. You cannot make them let go of their copy. You can only hope that the rule you encoded lets them.

Everything in this post is about the second cache.

Two strategies for distant caches

There are, roughly, two ways to get distant caches to notice a change.

The first strategy is wait it out. You set a short expiry time on the cached copy, and the distant cache forgets the old value when the timer runs out. This is what DNS does with its TTL field. Every DNS record carries a number that says how long resolvers are allowed to remember the answer. Thirty seconds, an hour, a day: the operator picks it. If you know you want to change the record next Tuesday, you drop the TTL from an hour to thirty seconds this Friday, and by Tuesday the world is refreshing the record every thirty seconds. Your change flips cleanly. The price is operational prep: you have to plan the change window ahead of time, and you cannot rush a change that was not pre-staged.

The second strategy is send a signal. You design the system so the distant cache can check, cheaply, whether its copy is still valid. If yes, great, it keeps the cached copy. If no, it refetches. The check is something small: a string that functions as a version number. Change the string, every cache notices, every cache refetches. The system does not need to wait for any timer. This strategy needs more design up front (you have to bake the version number into the protocol), but it gives you the ability to force a refresh whenever you want.

The rest of this post is about one specific, elegant implementation of the second strategy, and what happens to systems that forgot to include it.

Our rollout this week

This week we updated a small configuration on the sudory.com domain. The configuration governs inbound mail: when someone out on the internet sends a message to an address at our domain, which rules must their mail server follow?

The mechanism has an unfriendly name. It is called MTA-STS, short for Mail Transfer Agent Strict Transport Security, defined in an internet standard called RFC 8461. The idea itself is simple: we publish a small text file on a specific subdomain of ours, mta-sts.sudory.com. The file contains rules. Sending mail servers that have heard of MTA-STS (all the big ones: Gmail, Microsoft 365, Apple, Proton, Fastmail) fetch this file before delivering mail to us, and they apply the rules on the way in. The primary rule is "encrypt the connection before handing the message over, or refuse to deliver."

The file has a mode field. In testing mode, we publish the rule but do not yet block anything: senders who cannot encrypt still deliver, and they send us a report saying "I couldn't encrypt". In enforce mode, the same rule has teeth: senders who cannot encrypt refuse to deliver and mail bounces. The normal rollout is testing for a few weeks, then a flip to enforce once the reports look clean.

So we flipped. Our file now reads:

version: STSv1
mode: enforce
mx: aspmx1.migadu.com
mx: aspmx2.migadu.com
max_age: 604800

max_age is the number of seconds a sending mail server is allowed to keep our policy in cache: 604800 is seven days. Read in plain English, the five lines say "this is version 1 of the standard; enforce the rule; only deliver to these two specific mail servers; feel free to remember this for up to seven days."

We deployed. Within a minute curl against https://mta-sts.sudory.com/.well-known/mta-sts.txt returned the new file with mode: enforce. We opened an external scanner to verify. It still said testing. We opened another one. Also testing. We waited. Still testing.

Nothing was broken. Every mail server that had cached our policy during the testing phase was allowed to keep using it for up to seven days. Without an extra nudge, our flip from testing to enforce would roll out at exactly that glacial pace.

The trick: a version tag, out of band

Here is the elegance. Alongside the policy file, MTA-STS requires a tiny second piece of public data: a DNS record at _mta-sts.sudory.com with this shape:

v=STSv1; id=20260423T161500Z

A DNS record is just a small labeled string that any computer on the internet can look up, cheaply, in milliseconds. Mail servers (and only mail servers) read this one when they need to decide whether to trust their cached copy of our policy.

The id field is an opaque string we pick. It is not a hash of anything. It is not a checksum. It is just a label: "this is the version of our policy we currently publish." The protocol says: every sending mail server that receives a message destined for our domain does this:

Look up _mta-sts.sudory.com. Note the id.
Compare it to the id attached to whatever copy of our policy the sender has in cache.
If they match, skip the download. Keep using the cached policy.
If they differ, or no cached copy exists, download the policy fresh from https://mta-sts.sudory.com/.well-known/mta-sts.txt, store it labeled with the new id, and use that one.

Which means: changing the file and leaving the id alone changes nothing for anyone who already has a copy. They go on happily using the old one for up to seven days. Changing the id is how you tell every sender on the internet, at the same time, that their cache is stale.

So after deploying the new policy file, we edited the _mta-sts.sudory.com DNS record. We changed the id from 20260423T143000Z (the timestamp when we first published the testing policy) to 20260423T161500Z (the timestamp of the enforce flip). A three-character difference. The protocol specifies nothing about the contents of the string: only that a change of any kind forces a refresh.

Within minutes, scanners picked up the new policy. Mail servers that had our old policy in cache refetched on their next attempt to deliver mail to us and saw mode: enforce. The rollout had moved from a seven-day drag to a ten-minute blip.

The same pattern shows up everywhere

Once you have seen this trick, you cannot unsee it. Every system that caches by default and needs a way to force refresh has some version of it: a small, cheap-to-check identifier, carried separately from the heavy content, with a rule that says "refetch when the identifier changes."

System	What the cache holds	The version tag	What a change looks like
MTA-STS (email policy)	The policy file	`id` in the `_mta-sts` DNS record	Edit the record, every sender refetches
Web page resources	An image, a script, a stylesheet	The `ETag` response header	Server returns a new ETag, browser drops the cached copy
Static frontend assets	Compiled JS and CSS bundles	A hash baked into the filename: `app.9a3f.js`	New build produces a new filename; old filename still cacheable forever
Container images	Image layers on every host	The image digest `sha256:...`	A new image has a new digest; pulling the new tag fetches only changed layers
Git	The tree of files	The commit SHA	Any change produces a new SHA; clients fetch only the diff to the new one
Service workers	An installed app shell in a browser	A version constant inside the worker script	Change the string, the browser installs the new worker and activates it on next load

All of these look different on the surface: one uses DNS, another uses an HTTP header, another uses a filename, another uses a cryptographic hash. They are the same idea in different costumes. A version tag that lives next to the thing, is cheap to fetch, and tells you whether your cached copy is still good.

What happens when the pattern is missing

Not every system got this right. A few important ones forgot to include a version tag, and the operational cost has been considerable.

HSTS

HSTS is a close cousin of MTA-STS for web traffic. A web server tells a browser "always speak HTTPS to me for the next N seconds" by sending a response header called Strict-Transport-Security. The browser respects that for N seconds. There is no version tag, no override, no way to invalidate the entry early. If you set the timeout to one year and discover a week later that your HTTPS setup is subtly broken, every browser that saw your header will refuse to speak to you over plain HTTP for a year. The only way out is to serve a new header with a timeout of zero, and hope the same browsers come back to see it before the old one expires.

This is why experienced operators always roll out HSTS by stages. Start with an hour. Confirm nothing broke. Try a day. Then a week. Then a month. Then a year. Each stage is a safety net, because the design has no abort switch.

The HSTS preload list

Worse is HSTS preload. Instead of shipping the "use HTTPS only" promise as a response header, operators can submit their domain to a static list of known-HTTPS-only domains inside the Chromium source code. Every browser derived from Chromium (Chrome, Edge, Brave, Tor, Opera) and, on their own schedule, Firefox and Safari, treat the domains on that list as HTTPS-required out of the box. There is no header, no timeout, no version tag. You are on the list until Google removes you, and removal takes a browser release cycle to roll out in practice: months.

There is no cache invalidation problem here because there is no cache: the list is the state, shipped with the browser binary. Invalidation is a release process. This is fine for something narrow like "please use TLS." It was fatal for a similar mechanism called HTTP Public Key Pinning (HPKP, RFC 7469), which tried to apply the same idea to TLS certificates. One of the most-cited public casualties was Smashing Magazine, which self-documented their own HPKP-induced outage in a post titled Be Afraid Of HTTP Public Key Pinning. Operators pinned keys they later lost or rotated wrongly, and every browser that had visited them refused to trust the recovered site until the pin's max-age elapsed. Researcher Scott Helme called the failure pattern "HPKP suicide" in a widely-read 2017 post. Chrome announced deprecation of HPKP in late 2017, released it out in Chrome 72 in January 2019, and every other browser followed. A good reminder that ship-and-can't-unship only works when the rolled-out commitment is almost impossible to get wrong.

DNS itself

DNS has no version tag either. Every record carries a TTL, and resolvers cache for exactly that long. If you need to make a change and want the world to notice quickly, you have to lower the TTL ahead of time, wait for the old TTL to drain out of every resolver, then make the change with the new low TTL in place. It is "wait it out" in the most literal sense. Operators call this pre-lowering the TTL, and it is why mid-week DNS cutovers are a multi-day project and not a thirty-second edit.

The sibling pattern: never flip what you have not watched

The other half of a boring policy rollout is not about invalidation at all. It is about making sure what you flip on actually works before you force it on the world. Every well-designed policy system ships with a shadow mode: publish the rule, watch the reports, do not yet enforce. The names differ. The pattern is the same.

Policy	Shadow mode	Enforce mode	What senders report during shadow
MTA-STS (inbound email)	`mode: testing`	`mode: enforce`	Daily TLS reports ("I could not encrypt to you")
DMARC (email authentication)	`p=none`	`p=quarantine` or `p=reject`	Daily aggregate reports with pass/fail counts
CSP (browser content security)	Header named `Content-Security-Policy-Report-Only`	Header named `Content-Security-Policy`	Violation reports sent to a URL you specify
Software feature rollouts	Canary release, one percent of traffic	Full release, all traffic	Error rates, latency, explicit telemetry

Shadow mode only works if you actually read the reports. The observation window depends on what traffic you care about: two weeks catches the sender that only mails you monthly; a single hour catches the UI that breaks at peak traffic. Skipping the shadow window because "it will almost certainly work" is the universal source of outages you thought were safe to flip.

When you finally flip, an invalidatable policy (MTA-STS, CSP, anything with a version tag) is forgiving: if something surprises you, you flip back, bump the version tag, and the world goes back to the shadow state in minutes. An un-invalidatable one (HSTS preload, HPKP) is unforgiving: once it is out, it stays out. This is why the discipline matters more for the un-invalidatable policies, not less.

What we actually did, step by step

For people who want the literal sequence, here is our Thursday afternoon:

Edit the policy file. Change mode: testing to mode: enforce in nuxt/public/.well-known/mta-sts.txt. Merge to main. Vercel deploys it, around a minute. Verify by running curl -s https://mta-sts.sudory.com/.well-known/mta-sts.txt and reading the output.
Bump the id. Log in to the DNS provider. Open the _mta-sts.sudory.com TXT record. Change the id value to a new one (we use a UTC timestamp; the protocol does not care what you pick, as long as it differs). Save. Thirty seconds.
Wait a few minutes for authoritative nameservers to announce the new record to caching resolvers. Verify with dig +short TXT _mta-sts.sudory.com and again with dig @1.1.1.1 +short TXT _mta-sts.sudory.com from a different resolver to confirm propagation.
Confirm the reporting channel is still pointed at a real inbox: dig +short TXT _smtp._tls.sudory.com. This is where sending mail servers will send us failure reports once they start enforcing. If something goes wrong after the flip, this is where we find out.
Done. Any sender whose connection to our mail host cannot use encryption now refuses to deliver, and we hear about it. No mail is lost in silence.

Rollback is the same sequence in reverse: revert the file edit, deploy, bump the id again. Five minutes.

Three habits that make this boring

Every team that runs policy changes on a regular basis converges on the same three habits.

One. If the policy you are changing has an out-of-band version tag, always bump it when you change the content. If it does not have one (HSTS, DNS), assume the worst case for invalidation is whatever cache timer you set, and plan the rollout around that window.

Two. Never flip a policy you have not watched in shadow mode first. The reporting channels exist exactly for this: to let you see how senders, browsers, and downstream systems behave against a not-yet-enforcing rule. The instinct to skip the shadow window is always wrong, because the things shadow mode catches are the things you cannot think of in advance.

Three. Verify every change from at least two independent vantage points. Your own local DNS resolver is not the internet. Two different public resolvers plus a direct fetch is the minimum convincing check. When those three agree on the new state, you are done. While they disagree, you are in a window where different observers see different states, and that window is where the surprises live.

That is the entire discipline. Written down it sounds obvious. In practice, most policy outages come from skipping exactly one of the three.

FAQ

I updated my DNS record an hour ago. Why do some services still see the old value?

Because DNS is designed to be cached. Every DNS record carries a TTL in seconds that says how long resolvers are allowed to remember the old answer. Until that TTL expires on each resolver, they keep serving the old value. The trick is to lower the TTL well before you plan the change, wait for the old long TTL to drain out of the world, then change the record. Now the world refreshes fast.

Why do some systems let me force a refresh, and others make me wait?

It depends on whether the system was designed with an explicit version tag. Web resources have ETag headers. Container images have digests. MTA-STS policies have an id field in a separate DNS record. These systems let you change one string to tell the world "old copy is stale, refetch." Systems without a version tag (HSTS max-age, DNS itself, the HSTS preload list) force you to wait out the cache or the release cycle.

What is MTA-STS in plain terms?

A public announcement that says "when you send mail to my domain, you must use encryption." You publish it as a small text file on a specific subdomain, and you point to it from a DNS record. Major mail providers (Gmail, Microsoft 365, Apple, Fastmail) read the announcement before delivering mail to you. If the rules say encryption is required and their connection to you cannot do encryption, they refuse to deliver and send you a report. It is inbound-mail hardening, not outbound.

What is the difference between MTA-STS testing mode and enforce mode?

Testing mode publishes the rule but does not yet block anything. Senders who cannot satisfy the rule still deliver, and they send you a report saying "I could not encrypt." Enforce mode is the same rule with actual teeth: senders who cannot satisfy it refuse to deliver. The pattern is common. Ship in testing first for a couple of weeks, read the reports, fix any surprises, then flip to enforce.

How do I know my policy change has actually propagated?

Run at least two independent checks. One, fetch the resource directly and confirm the new content is there. Two, wait a few minutes and ask a different DNS resolver or a different network what it sees. Your local resolver is not the internet. If both reads show the new state, the change has landed everywhere that matters.