News & Events

CLOUDFLARE SUFFERS OUTAGE FROM UNPLUGGING REDUNDANT CABLES

DNS and CDN service provider Cloudflare has reported that its Wednesday outage was caused by the disconnection of multiple, redundant fibre connections from one of the company’s core data centers.

The unplugged cables which provide all external connectivity to other Cloudflare data centers were mounted on a patch panel, and housed in a cabinet that was marked for components decommissioning. As part of the planned maintenance, the technician instructed to remove all the unused equipment in this cabinet also disconnected the cables on the said patch panel.

The disconnection led to an immediate outage, starting from 1531 UTC and lasting until 1952 UTC. Since the affected data center hosts Cloudflare’s main control plane and database, the company’s Dashboard and API became unavailable immediately.

“During the outage period, we worked simultaneously to cut over to our disaster recovery core data center and restore connectivity. Dozens of engineers worked in two virtual war rooms, as Cloudflare is mostly working remotely because of the COVID-19 emergency. One room dedicated to restoring connectivity, the other to disaster recovery failover,” said John Graham-Cumming, CTO at Cloudflare.

“We take this incident very seriously, and recognize the magnitude of impact it had. We have identified several steps we can take to address the risk of these sorts of problems from recurring in the future, and we plan to start working on these matters immediately.”

Rather than blame the technicians for the outage, the company laid the blame on unclear instructions given to the maintenance technicians and the poor labelling of cables and panels, which also slowed down the identification of critical cables providing external connectivity during restoration.

Moving forward, Cloudflare said it plans to create multiple physical points of failure for external connectivity, ensure the various cables and panels are labeled for quick identification by anyone working to remediate the problem, and send technicians a more detailed instruction for retiring hardware.