Over the past few days, we’ve discussed with several of you that the recent experience with cPanel Linux hosting has been less than ideal. The issues have predominantly been about websites taking considerable time to load. Naturally, we were extremely concerned by it. This week, we undertook a comprehensive review of the Linux hosting setup in order to get to the bottom of this issue.
In course of our investigation, we had to take some of the servers offline today (Tuesday) for brief periods ranging from 5 minutes on a couple of servers to 45 minutes on one server, during which time your customers’ websites would have stopped resolving. Our analysis revealed that there indeed were some issues with the infrastructure. For your reference, here is a recap of the issues-
- We use the CloudLinux Kernel which ensures fair distribution of resources across all servers. There was a bug in the CloudLinux Kernel which was reporting faulty numbers for the %iowait and %idle parameters.
- Due to this bug, the load average on the servers was being reported falsely, which is why it did not trigger our alerting systems.
- At the same time, we also disabled a few sites which seemed to be consuming excessive server resources – however, this didn’t help.
- We realized quite late that we are running out of I/O on our storage devices (SANs), and they were saturated. We should have caught this earlier, but didn’t do so as the numbers reported by our hosting servers indicated otherwise.
It was a blunder on our part that we missed the %ioutil trends in our graphs 🙁 For this we unreservedly apologise, we’ll ensure that this never reoccurs. This exercise did enable us to put together a plan of action to fix these issues permanently. What follows is a preview of this plan – our product engineering team will actively comment on this post as we go forward with this process, to keep you up to speed –
- We were already working on building a new storage architecture which heavily utilizes SSDs to overcome I/O bottlenecks. This was on the cards for the upcoming weeks, but we are expediting this to begin in the next 2-3 days. This might require downtimes on the servers – we’ll ensure to let you know about it in advance.
- We’re communicating with the folks at CloudLinux to ensure that the bug causing the faulty stats reporting is fixed.
- Meanwhile, we’ve increased the memory (RAM) on all our hosting servers to support/augment the MySQL performance, allowing it to cache more aggressively.
We’re confident that, with the steps we’re taking, the issues that customers have been facing will be fixed permanently. I recommend that you follow the comments on this page for our updates as we go forward.