Sometimes web hosts can have a bad day. We had one yesterday. We had a series of abnormal issues with one of our UK servers, UKRS5, over the duration of just over a week. The server would randomly reboot itself with no instruction from us from the command line,no instruction from us from the ILO card, no instruction from us at PDU level.
We’re very experienced at handling server issues; hardware, software, upstream network, etc. The hardware that we choose is always the best which minimises potential hardware issues (and maximises our credit card bills…) but it pays off in the stability and performance it offers.
In the UK specifically, we use HP and Supermicro exclusively. Around half our servers are HP Gen7 Xeon E3s and half are Supermicro Xeon E3s/E5s. All kitted out with tons of RAM, 4 x hard drives in RAID 10, RAID card, remote access card, etc. We really leave nothing to chance on hardware and we stock extensive spares just to be on the safe side.
So going back to UKRs5. The first port of call for a random reboot issue is the PDU, or power distribution unit. We use managed Raritan PDUs in our UK datacentre, giving full remote reboot capabilities and basically a fall back to reboot a server should the command line and the ILO / IPMI card be unavailable Unlikely, but redundancy is key. The Raritans have a great reputation and have caused us no issues elsewhere, and we could see nothing in the logs to suggest this particular one was powering the server down.
But just to be sure, we swapped the server to the B PDU in the rack (all racks run A + B PDUs).
So the next thing to check was the PSU, or power supply unit, in the server itself. We swapped the PSU overnight last week to a brand new PSU.
Next up, we did a complete system replacement. This is actually the cause of the extended downtime incident as we took the 4 hard drives from the old UKRS5 server to a brand new UKRS5 server. We keep spare servers of exactly the same specification (motherboard, RAM configuration, network card) so there are no driver compatibility issues. Even with a clean shutdown of the old UKRS5, we had to do a filesystem check and RAID rebuild in the new server and this took quite some time. This caused the downtime witnessed by some of you and we made the mistake of underestimating how long this would take.
Fast forward to today. The server rebooted itself again. Huge frustration in our team as other than underestimating the length of time for the RAID rebuild, we did everything by the book. Despite us being sure of the server’s health, the PSU’s health and the PDU’s health, the UPS (battery backup health), as a precaution we decided to bring in yet another PDU and connected to yet another power feed. So in effect, this is a C PDU connected to a 3rd UPS (PDU A is connected to UPS A, PDU B is connected to UPS B, etc). So far, things have been stable but we continue to monitor the situation closely. I actually travelled to be onsite at the datacentre to oversee the build out of the C phase / C PDU and the connecting of the server to this. I snapped a quick picture of the one of the power room A (holding the UPS and power switch gear for the A side UPS) while I was there. It’s only camera phone quality though.
So why this blog post? We got some very constructive feedback that customers want more information on what’s being done to rectify the issue and what happens behind the scenes. We’ve taken this on board and will post as much information as we have during any maintenance (scheduled or emergency) and do our level best to post a follow up / post-mortem like this afterwards.