Losing Track of Time
On September 14, 2004, around eight hundred aircraft were making long-distance flights above Southern California. A mathematical mistake was about to threaten the lives of the tens of thousands of people onboard. Without warning, the Los Angeles Air Route Traffic Control Center lost radio voice contact with all the aircraft. A justifiable amount of panic ensued.
The radios were down for about three hours, during which time the controllers used their personal cell phones to contact other traffic control centers to get the aircraft to retune their communications. There were no accidents but, in the chaos, ten aircraft flew closer to each other than regulations allowed (five nautical miles horizontally or two thousand feet vertically); two pairs passed within two miles of each other. Four hundred flights on the ground were delayed and a further six hundred canceled. All because of a math error.
Official details are scant on the precise nature of what went wrong, but we do know it was due to a timekeeping error within the computers running the control center. It seems the air-traffic control system kept track of time by starting at 4,294,967,295 and counting down once a millisecond. Which meant that it would take 49 days, 17 hours, 2 minutes, and 47.295 seconds to reach 0.
Usually, the machine would be restarted before that happened, and the countdown would begin again from 4,294,967,295. From what I can tell, some people were aware of the potential issue, so it was policy to restart the system at least every thirty days. But this was just a way of working around the problem; it did nothing to correct the underlying mathematical error, which was that nobody had checked how many milliseconds there would be in the probable runtime of the system. So, in 2004, it accidentally ran for fifty days straight, hit zero, and shut down. Eight hundred aircraft traveling through one of the world's biggest cities were put at risk because, essentially, someone didn't choose a big enough number.
People were quick to blame the issue on a recent upgrade of the computer systems to run a variation of the Windows operating system. Some of the early versions of Windows (most notably Windows 95) suffered from exactly the same problem. Whenever you started the program, Windows would count up once every millisecond to give the "system time" that would drive all the other programs. But once the Windows system time hit 4,294,967,295, it would loop back to zero. Some programs-drivers, which allow the operating system to interact with external devices-would have an issue with time suddenly racing backward. These drivers need to keep track of time to make sure the devices are regularly responding and do not freeze for too long. When Windows told them that time had abruptly started to go backward, they would crash and take the whole system down with them.
It is unclear if Windows itself was directly to blame or if it was a new piece of computer code within the control center system itself. But, either way, we do know that the number 4,294,967,295 is to blame. It wasn't big enough for people's home desktop computers in the 1990s, and it was not big enough for air-traffic control in the early 2000s. Oh, and it was not big enough in 2015 for the Boeing 787 Dreamliner aircraft.
The problem with the Boeing 787 lay in the system that controlled the electrical power generators. It seems they kept track of time using a counter that would count up once every 10 milliseconds (so, a hundred times a second) and it topped out at 2,147,483,647 (suspiciously close to half of 4,294,967,295). This means that the Boeing 787 could lose electrical power if turned on continuously for 248 days, 13 hours, 13 minutes and 56.47 seconds. This was long enough that most planes would be restarted before there was a problem but short enough that power could, feasibly, be lost. The Federal Aviation Administration described the situation like this:
The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode. If the four main GCUs (associated with the engine-mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.
I believe that "regardless of flight phase" is official FAA-speak for "This could go down midflight." Their official line on airworthiness was the requirement of "repetitive maintenance tasks for electrical power deactivation." That is to say, anyone with a Boeing 787 had to remember to turn it off and on again. It's the classic computer programmer fix. Boeing has since updated its program to fix the problem, so preparing the plane for takeoff no longer involves a quick restart.
When 4.3 Billion Milliseconds Is Just Not Enough
Why would Microsoft, Los Angeles Air Route Traffic Control Center, and Boeing all limit themselves to this seemingly arbitrary number of around 4.3 billion (or half of it) when keeping track of time? It certainly seems to be a widespread problem. There is a massive clue if you look at the number 4,294,967,295 in binary. Written in the 1s and 0s of computer code, it becomes 11111111111111111111111111111111; a string of thirty-two consecutive ones.
Most humans never need to go near the actual circuits or binary code on which computers are built. They only need to worry about the programs and apps that run on their devices and, occasionally, the operating system on which those programs run (such as Windows or iOS). All these use the normal digits of 0 to 9 in the base-10 numbers we all know and love.
But beneath it all lies binary code. When people use Windows on a computer or iOS on a phone, they are interacting only with the graphical user interface, or GUI (delightfully pronounced "gooey"). Below the GUI is where it gets messy. There are layers of computer code taking the mouse clicks and swipe lefts of the human using the device and converting them into the harsh machine code of 1s and 0s that is the native language of computers.
If you had space for only five digits on a piece of paper, the largest number you could write down would be 99,999. You've filled every spot with the largest digit available. What the Microsoft, air-traffic control, and Boeing systems all had in common is that they were 32-bit binary-number systems, which means the default is that the largest number they can write down is thirty-two 1s in binary, or 4,294,967,295 in base-10.
It was slightly worse in systems that wanted to use one of the thirty-two spots for something else. If you wanted to use that piece of paper with room for five symbols to write down a negative number, you'd need to leave the first spot free for a positive or negative sign, which would mean that you could now write down all the whole numbers between 9,999 and +9,999. It's believed Boeing's system used such "signed numbers," so, with the first spot taken, they only had room for a maximum of thirty-one 1s, which translates into 2,147,483,647. Counting only centiseconds rather than milliseconds bought them some time-but not enough.
Thankfully, this is a can that can be kicked far enough down the road that it does not matter. Modern computer systems are generally 64-bit, which allows for much bigger numbers by default. The maximum possible value is of course still finite, so any computer system is assuming that it will eventually be turned off and on again. But if a 64-bit system counts milliseconds, it will not hit that limit until 584.9 million years have passed. So you don't need to worry: it will need a restart only twice every billion years.
The analog methods of timekeeping we used before the invention of computers would, at least, never run out of room. The hands of a clock can keep spinning around; new pages can be added to the calendar as the years go by. Forget milliseconds: with only good old-fashioned days and years to worry about, you will not have any math mistakes ruining your day.
Or so thought the Russian shooting team as they arrived at the 1908 Olympic Games in London a few days before the international shooting was scheduled to start on July 10. But if you look at the results of the 1908 Olympics, you'll see that all the other countries did well but there are no Russian results for any shooting event. And that is because what was July 10 for the Russians was July 23 in the UK (and indeed most of the rest of the world). The Russians were using a different calendar.
It seems odd that something as straightforward as a calendar can go so wrong that a team of international athletes shows up at the Olympics two weeks late. But calendars are far more complex than you'd expect; it seems that dividing the year up into predictable days is not easy and there are different solutions to the same problems.
The universe has given us only two units of time: the year and the day. Everything else is the creation of humankind to try to make life easier. As the protoplanetary disk congealed and separated into the planets as we know them, the Earth was made with a certain amount of angular momentum, sending it flying around the sun, spinning as it goes. The orbit we ended up in gave us the length of the year, and the rate of the Earth's spin gave us the length of the day.
Except they don't match. There is no reason they should! It was just where the chunks of rock from that protoplanetary disk happened to fall, billions of years ago. The yearlong orbit of the Earth around the sun now takes 365 days, 6 hours, 9 minutes, and 10 seconds. For simplicity, we can call that 365 days.
This means that, if you celebrate New Year's Eve after a year of 365 days, the Earth still has a quarter of a day of movement before you'll be back to exactly where you were last New Year's Eve. The Earth is tearing around the sun at a speed of around 30 kilometers every second, so this New Year's Eve you will be over 650,000 kilometers away from wherever you were last year. So, if your New Year's resolution was to not be late for things, you're already way behind.
This goes from being a minor inconvenience to becoming a major problem because the Earth's orbital year controls the seasons. The Northern Hemisphere summer occurs around the same point in the Earth's orbit every year because this is where the Earth's tilt aligns the north toward the position of the sun. After every 365-day year, the calendar year moves a quarter of a day away from the seasons. After four years, summer would start a day later. In less than four hundred years, within the lifespan of a civilization, the seasons would drift by three months. After eight hundred years, summer and winter would swap places completely.
To fix this, we had to tweak the calendar to have the same number of days as the orbit. Somehow, we needed to break away from having the same number of days every year, but without having a fraction of a day; people get upset if you restart the day at a time other than midnight. We needed to link a year to the Earth's orbit without breaking the tie between a day and the Earth's rotation.
The solution that most civilizations came up with was to vary the number of days in any given year so there is a fractional number of days per year on average. But there is no single way to do that, which is why there are still a few competing calendars around today (which all start at different points in history). If you ever have access to a friend's phone, go into the settings and change their calendar to the Buddhist one. Suddenly, they're living in the 2560s. Maybe try to convince them they have just woken up from a coma.
Our main modern calendar is a descendant of the Roman Republican calendar. They had only 355 days, which was substantially fewer than required, so every few years an entire extra month was inserted between February and March, adding an extra twenty-two or twenty-three days to the year. In theory, this adjustment could be used to keep the calendar aligned with the solar year. In practice, it was up to the reigning politicians to decide when the extra month should be inserted. As this decision could either lengthen their year of ruling or shorten that of an opponent, the motivation was not always to keep the calendar aligned.
A political committee is rarely a good solution to a mathematical problem. The years leading up to 46 BCE were known as the "years of confusion," as extra months came and went, with little relation to when they were needed. A lack of notice could also mean that people traveling away from Rome would have to guess what the date back at home was.
In 46 BCE Julius Caesar decided to fix this with a new, predictable calendar. Every year would have 365 days-the closest whole number to the true value-and the bonus quarter days would be saved up until every fourth year, which would have a single bonus day. The leap year with an extra leap day was born!
To get everything back into alignment in the first place, the year 46 BCE had a possible-world-record 445 days. In addition to the bonus month between February and March, two more months were inserted between November and December. Then, from 45 BCE onward, leap years were inserted every four years to keep the calendar in sync.
Well, almost. There was an initial clerical error, by which the last year in a four-year period was double-counted as the first year of the next period, so leap years were actually put in every three years. But this was spotted, fixed, and by 3 CE, everything was on track.