The frustration. You changed a setting. You can see the new value on your own server. You open any outside tool and it still shows the old one. An hour passes. Still the old one. Nothing you do on your side seems to change anything.
This post is about why that happens, and the small, elegant trick some systems use to make it stop happening. The example is a real rollout we shipped this week on a domain we run. No email-security background required. By the end you will know why a configuration change can sit invisible for hours, and what the systems that solved this problem properly did differently.
The short answer: the world cached you
When one program asks another program for a piece of information, there are two possible outcomes. Either the second program answers fresh every time, or the first program keeps the answer around for a while and reuses it. The second option, reusing the old answer, is called caching. Almost everything on the internet caches almost everything by default, because the alternative is unbearably slow and expensive.
When you open a web page, your browser caches the images so the next page on the same site loads faster. When your laptop looks up sudory.com, the local DNS resolver caches the answer so the next lookup does not hit the internet. When your mail server delivers a message, it caches whatever it learned about the recipient domain so the next message is quicker. Every layer of the stack has its own cache, its own rules about how long to keep things, and its own quiet assumption that whatever you said a minute ago is still true.
Which is great, until you want to change something.
There are always two caches
When you change a value the outside world observes, you are racing two different caches at the same time.
The first cache is one you can reach. It is on your side: the content delivery network in front of your website, the cache in front of your database, the build artifact in your deploy pipeline. You control these. When you deploy, you can usually tell them to invalidate, or you can serve content with headers that ask them to revalidate every time. Either way, once the deploy is done, someone who asks your server directly gets the new answer.
The second cache is one you cannot reach. It is on the other side: every laptop, phone, mail server, DNS resolver, browser, and build tool that ever asked you for the thing you are now changing. Each of them has its own copy and its own rule about how long to keep it. You cannot make them let go of their copy. You can only hope that the rule you encoded lets them.
Everything in this post is about the second cache.
Two strategies for distant caches
There are, roughly, two ways to get distant caches to notice a change.
The first strategy is wait it out. You set a short expiry time on the cached copy, and the distant cache forgets the old value when the timer runs out. This is what DNS does with its TTL field. Every DNS record carries a number that says how long resolvers are allowed to remember the answer. Thirty seconds, an hour, a day: the operator picks it. If you know you want to change the record next Tuesday, you drop the TTL from an hour to thirty seconds this Friday, and by Tuesday the world is refreshing the record every thirty seconds. Your change flips cleanly. The price is operational prep: you have to plan the change window ahead of time, and you cannot rush a change that was not pre-staged.
The second strategy is send a signal. You design the system so the distant cache can check, cheaply, whether its copy is still valid. If yes, great, it keeps the cached copy. If no, it refetches. The check is something small: a string that functions as a version number. Change the string, every cache notices, every cache refetches. The system does not need to wait for any timer. This strategy needs more design up front (you have to bake the version number into the protocol), but it gives you the ability to force a refresh whenever you want.
The rest of this post is about one specific, elegant implementation of the second strategy, and what happens to systems that forgot to include it.
Our rollout this week
This week we updated a small configuration on the sudory.com domain. The configuration governs inbound mail: when someone out on the internet sends a message to an address at our domain, which rules must their mail server follow?
The mechanism has an unfriendly name. It is called MTA-STS, short for Mail Transfer Agent Strict Transport Security, defined in an internet standard called RFC 8461. The idea itself is simple: we publish a small text file on a specific subdomain of ours, mta-sts.sudory.com. The file contains rules. Sending mail servers that have heard of MTA-STS (all the big ones: Gmail, Microsoft 365, Apple, Proton, Fastmail) fetch this file before delivering mail to us, and they apply the rules on the way in. The primary rule is "encrypt the connection before handing the message over, or refuse to deliver."
The file has a mode field. In testing mode, we publish the rule but do not yet block anything: senders who cannot encrypt still deliver, and they send us a report saying "I couldn't encrypt". In enforce mode, the same rule has teeth: senders who cannot encrypt refuse to deliver and mail bounces. The normal rollout is testing for a few weeks, then a flip to enforce once the reports look clean.
So we flipped. Our file now reads:
version: STSv1
mode: enforce
mx: aspmx1.migadu.com
mx: aspmx2.migadu.com
max_age: 604800max_age is the number of seconds a sending mail server is allowed to keep our policy in cache: 604800 is seven days. Read in plain English, the five lines say "this is version 1 of the standard; enforce the rule; only deliver to these two specific mail servers; feel free to remember this for up to seven days."
We deployed. Within a minute curl against https://mta-sts.sudory.com/.well-known/mta-sts.txt returned the new file with mode: enforce. We opened an external scanner to verify. It still said testing. We opened another one. Also testing. We waited. Still testing.
Nothing was broken. Every mail server that had cached our policy during the testing phase was allowed to keep using it for up to seven days. Without an extra nudge, our flip from testing to enforce would roll out at exactly that glacial pace.
The trick: a version tag, out of band
Here is the elegance. Alongside the policy file, MTA-STS requires a tiny second piece of public data: a DNS record at _mta-sts.sudory.com with this shape:
v=STSv1; id=20260423T161500ZA DNS record is just a small labeled string that any computer on the internet can look up, cheaply, in milliseconds. Mail servers (and only mail servers) read this one when they need to decide whether to trust their cached copy of our policy.
The id field is an opaque string we pick. It is not a hash of anything. It is not a checksum. It is just a label: "this is the version of our policy we currently publish." The protocol says: every sending mail server that receives a message destined for our domain does this:
- Look up
_mta-sts.sudory.com. Note theid. - Compare it to the
idattached to whatever copy of our policy the sender has in cache. - If they match, skip the download. Keep using the cached policy.
- If they differ, or no cached copy exists, download the policy fresh from
https://mta-sts.sudory.com/.well-known/mta-sts.txt, store it labeled with the new id, and use that one.
Which means: changing the file and leaving the id alone changes nothing for anyone who already has a copy. They go on happily using the old one for up to seven days. Changing the id is how you tell every sender on the internet, at the same time, that their cache is stale.
So after deploying the new policy file, we edited the _mta-sts.sudory.com DNS record. We changed the id from 20260423T143000Z (the timestamp when we first published the testing policy) to 20260423T161500Z (the timestamp of the enforce flip). A three-character difference. The protocol specifies nothing about the contents of the string: only that a change of any kind forces a refresh.
Within minutes, scanners picked up the new policy. Mail servers that had our old policy in cache refetched on their next attempt to deliver mail to us and saw mode: enforce. The rollout had moved from a seven-day drag to a ten-minute blip.
The same pattern shows up everywhere
Once you have seen this trick, you cannot unsee it. Every system that caches by default and needs a way to force refresh has some version of it: a small, cheap-to-check identifier, carried separately from the heavy content, with a rule that says "refetch when the identifier changes."
| System | What the cache holds | The version tag | What a change looks like |
|---|---|---|---|
| MTA-STS (email policy) | The policy file | id in the _mta-sts DNS record | Edit the record, every sender refetches |
| Web page resources | An image, a script, a stylesheet | The ETag response header | Server returns a new ETag, browser drops the cached copy |
| Static frontend assets | Compiled JS and CSS bundles | A hash baked into the filename: app.9a3f.js | New build produces a new filename; old filename still cacheable forever |
| Container images | Image layers on every host | The image digest sha256:... | A new image has a new digest; pulling the new tag fetches only changed layers |
| Git | The tree of files | The commit SHA | Any change produces a new SHA; clients fetch only the diff to the new one |
| Service workers | An installed app shell in a browser | A version constant inside the worker script | Change the string, the browser installs the new worker and activates it on next load |
All of these look different on the surface: one uses DNS, another uses an HTTP header, another uses a filename, another uses a cryptographic hash. They are the same idea in different costumes. A version tag that lives next to the thing, is cheap to fetch, and tells you whether your cached copy is still good.
What happens when the pattern is missing
Not every system got this right. A few important ones forgot to include a version tag, and the operational cost has been considerable.
HSTS
HSTS is a close cousin of MTA-STS for web traffic. A web server tells a browser "always speak HTTPS to me for the next N seconds" by sending a response header called Strict-Transport-Security. The browser respects that for N seconds. There is no version tag, no override, no way to invalidate the entry early. If you set the timeout to one year and discover a week later that your HTTPS setup is subtly broken, every browser that saw your header will refuse to speak to you over plain HTTP for a year. The only way out is to serve a new header with a timeout of zero, and hope the same browsers come back to see it before the old one expires.
This is why experienced operators always roll out HSTS by stages. Start with an hour. Confirm nothing broke. Try a day. Then a week. Then a month. Then a year. Each stage is a safety net, because the design has no abort switch.
The HSTS preload list
Worse is HSTS preload. Instead of shipping the "use HTTPS only" promise as a response header, operators can submit their domain to a static list of known-HTTPS-only domains inside the Chromium source code. Every browser derived from Chromium (Chrome, Edge, Brave, Tor, Opera) and, on their own schedule, Firefox and Safari, treat the domains on that list as HTTPS-required out of the box. There is no header, no timeout, no version tag. You are on the list until Google removes you, and removal takes a browser release cycle to roll out in practice: months.
There is no cache invalidation problem here because there is no cache: the list is the state, shipped with the browser binary. Invalidation is a release process. This is fine for something narrow like "please use TLS." It was fatal for a similar mechanism called HTTP Public Key Pinning (HPKP, RFC 7469), which tried to apply the same idea to TLS certificates. One of the most-cited public casualties was Smashing Magazine, which self-documented their own HPKP-induced outage in a post titled Be Afraid Of HTTP Public Key Pinning. Operators pinned keys they later lost or rotated wrongly, and every browser that had visited them refused to trust the recovered site until the pin's max-age elapsed. Researcher Scott Helme called the failure pattern "HPKP suicide" in a widely-read 2017 post. Chrome announced deprecation of HPKP in late 2017, released it out in Chrome 72 in January 2019, and every other browser followed. A good reminder that ship-and-can't-unship only works when the rolled-out commitment is almost impossible to get wrong.
DNS itself
DNS has no version tag either. Every record carries a TTL, and resolvers cache for exactly that long. If you need to make a change and want the world to notice quickly, you have to lower the TTL ahead of time, wait for the old TTL to drain out of every resolver, then make the change with the new low TTL in place. It is "wait it out" in the most literal sense. Operators call this pre-lowering the TTL, and it is why mid-week DNS cutovers are a multi-day project and not a thirty-second edit.
The sibling pattern: never flip what you have not watched
The other half of a boring policy rollout is not about invalidation at all. It is about making sure what you flip on actually works before you force it on the world. Every well-designed policy system ships with a shadow mode: publish the rule, watch the reports, do not yet enforce. The names differ. The pattern is the same.
| Policy | Shadow mode | Enforce mode | What senders report during shadow |
|---|---|---|---|
| MTA-STS (inbound email) | mode: testing | mode: enforce | Daily TLS reports ("I could not encrypt to you") |
| DMARC (email authentication) | p=none | p=quarantine or p=reject | Daily aggregate reports with pass/fail counts |
| CSP (browser content security) | Header named Content-Security-Policy-Report-Only | Header named Content-Security-Policy | Violation reports sent to a URL you specify |
| Software feature rollouts | Canary release, one percent of traffic | Full release, all traffic | Error rates, latency, explicit telemetry |
Shadow mode only works if you actually read the reports. The observation window depends on what traffic you care about: two weeks catches the sender that only mails you monthly; a single hour catches the UI that breaks at peak traffic. Skipping the shadow window because "it will almost certainly work" is the universal source of outages you thought were safe to flip.
When you finally flip, an invalidatable policy (MTA-STS, CSP, anything with a version tag) is forgiving: if something surprises you, you flip back, bump the version tag, and the world goes back to the shadow state in minutes. An un-invalidatable one (HSTS preload, HPKP) is unforgiving: once it is out, it stays out. This is why the discipline matters more for the un-invalidatable policies, not less.
What we actually did, step by step
For people who want the literal sequence, here is our Thursday afternoon:
- Edit the policy file. Change
mode: testingtomode: enforceinnuxt/public/.well-known/mta-sts.txt. Merge to main. Vercel deploys it, around a minute. Verify by runningcurl -s https://mta-sts.sudory.com/.well-known/mta-sts.txtand reading the output. - Bump the id. Log in to the DNS provider. Open the
_mta-sts.sudory.comTXT record. Change theidvalue to a new one (we use a UTC timestamp; the protocol does not care what you pick, as long as it differs). Save. Thirty seconds. - Wait a few minutes for authoritative nameservers to announce the new record to caching resolvers. Verify with
dig +short TXT _mta-sts.sudory.comand again withdig @1.1.1.1 +short TXT _mta-sts.sudory.comfrom a different resolver to confirm propagation. - Confirm the reporting channel is still pointed at a real inbox:
dig +short TXT _smtp._tls.sudory.com. This is where sending mail servers will send us failure reports once they start enforcing. If something goes wrong after the flip, this is where we find out. - Done. Any sender whose connection to our mail host cannot use encryption now refuses to deliver, and we hear about it. No mail is lost in silence.
Rollback is the same sequence in reverse: revert the file edit, deploy, bump the id again. Five minutes.
Three habits that make this boring
Every team that runs policy changes on a regular basis converges on the same three habits.
One. If the policy you are changing has an out-of-band version tag, always bump it when you change the content. If it does not have one (HSTS, DNS), assume the worst case for invalidation is whatever cache timer you set, and plan the rollout around that window.
Two. Never flip a policy you have not watched in shadow mode first. The reporting channels exist exactly for this: to let you see how senders, browsers, and downstream systems behave against a not-yet-enforcing rule. The instinct to skip the shadow window is always wrong, because the things shadow mode catches are the things you cannot think of in advance.
Three. Verify every change from at least two independent vantage points. Your own local DNS resolver is not the internet. Two different public resolvers plus a direct fetch is the minimum convincing check. When those three agree on the new state, you are done. While they disagree, you are in a window where different observers see different states, and that window is where the surprises live.
That is the entire discipline. Written down it sounds obvious. In practice, most policy outages come from skipping exactly one of the three.
FAQ
I updated my DNS record an hour ago. Why do some services still see the old value?
Because DNS is designed to be cached. Every DNS record carries a TTL in seconds that says how long resolvers are allowed to remember the old answer. Until that TTL expires on each resolver, they keep serving the old value. The trick is to lower the TTL well before you plan the change, wait for the old long TTL to drain out of the world, then change the record. Now the world refreshes fast.
Why do some systems let me force a refresh, and others make me wait?
It depends on whether the system was designed with an explicit version tag. Web resources have ETag headers. Container images have digests. MTA-STS policies have an id field in a separate DNS record. These systems let you change one string to tell the world "old copy is stale, refetch." Systems without a version tag (HSTS max-age, DNS itself, the HSTS preload list) force you to wait out the cache or the release cycle.
What is MTA-STS in plain terms?
A public announcement that says "when you send mail to my domain, you must use encryption." You publish it as a small text file on a specific subdomain, and you point to it from a DNS record. Major mail providers (Gmail, Microsoft 365, Apple, Fastmail) read the announcement before delivering mail to you. If the rules say encryption is required and their connection to you cannot do encryption, they refuse to deliver and send you a report. It is inbound-mail hardening, not outbound.
What is the difference between MTA-STS testing mode and enforce mode?
Testing mode publishes the rule but does not yet block anything. Senders who cannot satisfy the rule still deliver, and they send you a report saying "I could not encrypt." Enforce mode is the same rule with actual teeth: senders who cannot satisfy it refuse to deliver. The pattern is common. Ship in testing first for a couple of weeks, read the reports, fix any surprises, then flip to enforce.
How do I know my policy change has actually propagated?
Run at least two independent checks. One, fetch the resource directly and confirm the new content is there. Two, wait a few minutes and ask a different DNS resolver or a different network what it sees. Your local resolver is not the internet. If both reads show the new state, the change has landed everywhere that matters.