Google’s SRE (Site Reliability Engineering) on traffic and load balancing team recently reported an overheating incidence in one of its edge network points-of-presence which was caused by crushed rack wheels.
The team was alerted that some of its data center machines were producing an abnormally high number of errors. In response, the team was said to immediately remove the affected machines from serving, thus eliminating the errors that might result in a degraded state for customers.
“Why would a single rack be overheating to the point of CPU throttling when its neighbors were totally unaffected?? What is it about the physical support for machines that would cause kernel errors? It didn’t add up.
The SRE then sent the machine to repairs, which means that they filed a bug in our company-wide issue tracking system. In this case, the bug was sent to the on-site hardware operations and management team,” Steve McGhee, Solutions Architect at Google Cloud said in the report.
Checking the base system error log, the message read thus:
Package temperature above threshold, CPU clock throttled (total events = 1596886)
The hardware team then took over the investigation and determined the physical issue that resulted in this software chain of events.
They finally detected that the wheels supporting the rack had been crushed under the weight of the fully loaded rack. The rack then had physically tilted forward, disrupting the flow of liquid coolant and resulting in some CPUs heating up to the point of being throttled.
The team moved on to fix the wheels and the rack was returned to proper alignment. It was further said that the team considered other existing racks which are at risk of similar failures, and carried out replacement for all racks with the same issue, while avoiding impact on any Google Cloud customer.