Google’s Site Reliability team is responsible for keeping Google’s services and data centers up and running 24/7. In a blog post, Christopher Pascoe, Site Reliability Engineer talk about a project the Site Reliability Engineers took on to make sure that the fluctuations of time don’t adversely affect Google’s products and services.
Pascoe writes “It turns out that being on a revolving imperfect sphere floating in space, being reshaped by earthquakes and volcanic eruptions, and being dragged around by gravitational forces makes your rotation somewhat irregular. Fluctuations in Earth’s rotational speed mean that even very accurate clocks, like the atomic clocks used by global timekeeping services, occasionally have to be adjusted slightly to bring them in line with “solar time.” There have been 24 such adjustments, called “leap seconds,” since they were introduced in 1972. Their effect on technology has become more and more profound as people come to rely on fast, accurate and reliable technology.”
“Very large-scale distributed systems, like ours, demand that time be well-synchronized and expect that time always moves forwards. Computers traditionally accommodate leap seconds by setting their clock backwards by one second at the very end of the day. But this “repeated” second can be a problem.”[…]
“Our systems are engineered for data integrity, and some will refuse to work if their time is sufficiently “wrong.” We saw some of our clustered systems stop accepting work on a small scale during the leap second in 2005, and while it didn’t affect the site or any of our data, we wanted to fix such issues once and for all,” stated Pascoe.
This was the problem that a group of our engineers identified during 2008, with a leap second scheduled for December 31. Given our observations in 2005, we wanted to be ready this time, and in the future. How could we make sure everything at Google stays running as if nothing happened, when all our server clocks suddenly see the same second happening twice? Also, how could we make this solution scale? Would we need to audit every line of code that cares about the time? (That’s a lot of code!)”
“The solution we came up with came to be known as the “leap smear.” Pascoe writes “We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day. All of our servers were then able to continue as normal with the new year, blissfully unaware that a leap second had just occurred. We plan to use this “leap smear” technique again in the future, when new leap seconds are announced by the IERS.””
“Usually when a leap second is almost due, the NTP protocol says a server must indicate this to its clients by setting the “Leap Indicator” (LI) field in its response. This indicates that the last minute of that day will have 61 seconds, or 59 seconds. (Leap seconds can, in theory, be used to shorten a day too, although that hasn’t happened to date.) Rather than doing this, we applied a patch to the NTP server software on our internal Stratum 2 NTP servers to not set LI, and tell a small “lie” about the time, modulating this “lie” over a time window w before midnight: lie(t) = (1.0 – cos(pi * t / w)) / 2.0,” explains Pascoe.[…]
He siad “In an experiment, we performed two smears–one negative then one positive–and tested this setup using about 10,000 servers. We’d previously added monitoring to plot the skew between atomic time, our Stratum 2 servers and all those NTP clients, allowing us to constantly evaluate the performance of our time infrastructure. We were excited to see monitoring showing plots of those servers’ clocks tracking our model’s predictions, and that we were continuing to serve users’ requests without errors.”
“Following the successful test, we reconfigured all our production Stratum 2 NTP servers with details of the actual leap second, ready for New Year’s Eve, when they would automatically activate the smear for all production machines, without any further human intervention required. We had a “big red button” opt-out that allowed us to stop the smear in case anything went wrong.”