Day 14: The Bug That Didn't End the World, and the One That Still Might
On December 31, 1999, a measurable percentage of the developed world stockpiled bottled water, withdrew cash from ATMs, and stayed up to see if the lights would go out at midnight.
They didn’t. Planes did not fall from the sky. Power grids did not collapse. Bank balances did not reset. The new millennium arrived, the champagne was opened, and by January 3rd everyone agreed it had been a hoax.
It was not a hoax. It was a save.
Y2K was a global $300+ billion engineering effort spread across roughly five years and almost every government and Fortune 500 IT department on Earth. The reason nothing happened on January 1, 2000 is that for half a decade, an enormous number of people worked very hard so that nothing would happen. The bug was real. The fix worked. Most people forgot it was ever a problem.
Twelve years from now, a structurally identical bug detonates again, but first let’s understand what happened in ‘99.
Y2K: the bug
The Y2K bug is almost embarrassingly simple. From the 1960s through the 1980s, computer storage was expensive enough that programmers had a habit of representing years with two digits, 99 instead of 1999, 73 instead of 1973. It saved two bytes per date. Across a payroll system tracking millions of employees, that mattered.
The assumption baked into that decision was: we’ll have rewritten this system long before the century rolls over.
This is the most consistently wrong assumption in software engineering. Code outlives its authors’ confidence. By the late 1990s, vast amounts of critical infrastructure, bank ledgers, airline reservation systems, hospital records, utility billing, social security disbursement, military logistics, nuclear plant monitoring, were running on COBOL programs from the 60s and 70s that had been patched but never rewritten. The language is unfamiliar to many but the fix it later approach is relable to everyone. The developers at the time all quietly assumed that the year 99 was less than the year 00.
When the rollover hit, 99-12-31 + 1 day = 00-01-01 looked, mathematically, like jumping back to 1900. Interest calculations would compute negative ages. Pensioners would suddenly be billed for a century of debt. Reservation systems would mark every upcoming flight as having departed in the past. Insurance policies would expire en masse.
The reason planes did not fall is that, starting roughly in 1995, every major airline, manufacturer, FAA system, and air traffic controller began an exhaustive audit-and-fix campaign.
The reason the power grid did not collapse is that every utility company in North America and Europe ran the same campaign on their SCADA systems.
The reason your bank balance was still correct on January 1, 2000 is that someone, somewhere, spent late nights in 1997 reading printouts of code written before they were born.
The estimated total global cost: $300 to $600 billion. The amount of measurable damage on January 1, 2000: small enough that people argued for the next decade about whether the spend had been justified.
It was. The bug was real. The fix worked. The result of a successful preventive engineering campaign is that it looks, in retrospect, like the problem was never there.
Y2038: the same bug, different number
Twelve years from now, specifically, January 19, 2038, at 03:14:07 UTC, a structurally identical bug fires for a different reason.
Unix time is stored, on a huge amount of legacy infrastructure, as a signed 32-bit integer. That gives you about 2.1 billion seconds of positive range from the 1970 epoch. 2.1 billion seconds is 68 years. 1970 + 68 = 2038.
At 03:14:07 UTC on that date, the counter hits its maximum value, 2,147,483,647. The next tick overflows. In two’s-complement signed integer arithmetic, the value rolls over to its most negative possible value: -2,147,483,648. Interpreted as a Unix timestamp, that’s December 13, 1901.
Every 32-bit Unix-derived system that hasn’t been patched will, in the span of one tick, conclude that it is now the early 20th century. The effects are the same family of effects as Y2K, but applied to a much wider deployment surface.
File modification times become nonsensical. SSL certificates appear expired, or worse, not-yet-valid. NTP synchronization fails. Filesystems with 32-bit inode timestamps lose ordering. Embedded device firmware that schedules tasks based on wall-clock time begins executing at random intervals. Industrial control systems that latch state machines on “time since last event” calculations latch on negative durations and either freeze or behave unpredictably.
Modern desktop and server operating systems are mostly fine. Linux finished migrating to 64-bit time_t on all architectures by kernel 5.6 (2020) and glibc 2.32. macOS and Windows have been 64-bit-clean for over a decade. AWS, GCP, and Azure all run 64-bit kernels.
The problem is not where you are reading this. The problem is in the physical world that keeps everything running.
The long tail is enormous
Estimates of the number of currently deployed 32-bit embedded devices that interact with time_t in some way range from a few hundred million to several billion.
Industrial controllers, automotive ECUs, network routers, smart-meter firmware, point-of-sale terminals, medical imaging devices, GPS units, cable boxes, elevator controllers, traffic light systems, ATM internals, payment terminals, building HVAC, water-treatment SCADA, satellite firmware, oil rig control systems, and the embedded computer in your refrigerator.
Each one, depending on vintage and vendor, may or may not have been patched.
Many of these devices are not internet-connected and cannot be patched remotely. Many are running firmware whose source code has been lost. Many are running firmware whose vendor no longer exists. Many are in places where physical access is hard, a deep-sea oil platform, a satellite in geostationary orbit, a controller welded inside an industrial machine.
The Y2K fix worked because the affected systems were largely centralized: mainframes in data centers, software at named companies, code with active maintainers. You could audit it. You could rewrite it. You could ship a patch.
Y2038 is decentralized. The affected systems are everywhere.
The Buff Must Flow
In 2022, Microsoft Exchange Server stopped delivering email worldwide. The cause was a 32-bit signed integer in Exchange’s anti-malware scanner that stored the date as a long-form number. On New Year’s Day, the value tipped over the limit and the scanner refused to load. Mail queues backed up everywhere. Microsoft shipped an emergency script the next day. They called it Y2K22.
On April 6, 2019, the GPS week number counter rolled over. The failure mode was familiar, an integer designed when the engineers thought it was going to be big enough turned out, decades later, not to be. NYC’s municipal wireless network went down. KLM grounded a flight. Older car and marine GPS units showed dates in 1999.
Two examples of overflows hitting production and breaking real things. Y2038 will be every one of those at once, in places nobody is thinking about.
Y2038 is foreseeable. We know about it. We know what needs to be done. We have twelve years. We should get started sooner rather than later. A lot of important systems need to be replaced, and the fewer that fall through the cracks, the better.
There’s no checklist for the devices we’ve already forgotten about, but maybe there should be.
Tomorrow: The Smear, how Google, Amazon, and Meta quietly decided to stop telling the truth about leap seconds, and why everyone else followed.
Sources
- Year 2000 problem — Wikipedia
- Year 2038 problem — Wikipedia
- Microsoft Exchange year 2022 bug in FIP-FS breaks email delivery — BleepingComputer
- Microsoft Exchange Fixes Disruptive ‘Y2K22’ Bug — BankInfoSecurity
- GPS week number rollover — Wikipedia
- GPS Week Number Rollover — GPS.gov
- The impact and resolution of the GPS week number rollover of April 2019 — Geoscientific Instrumentation (Copernicus)
- Linux kernel 5.6 — 64-bit time_t support for 32-bit architectures (KernelNewbies)
- The Open Group Base Specifications: time.h
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ Programming / History / Time / 30daysoftime / Unix