We understand a lot of website hosting companies make claims for uptime guarantees. They would make claims of 99.9% uptime but it was really based on their “network”, not actual servers, meaning that they had Internet connectivity 99.9% of the time, not that the servers would stay online. When we guarantee uptime, we mean it.
One of our major priorities is to be able to offer an extremely reliable infrastructure for our users. We know first hand how much just one hour of downtime can affect an online business.
In order to achieve this, we have invested in top-of-the-line hardware, expensive software, and choosing a data center partner that also shared this same goal. This was a large order and something that has to be constantly tweaked and monitored.
But once all of this is in place, we have to carefully monitor it so we can see the actual results.
To do this, we have several levels of monitoring:
- NOC (network management center) Monitoring – To start, we have a dedicated 24x7x365 NOC monitoring team that watches our servers AND network constantly. Being their only job is to “monitor” this also includes proactive monitoring where they’re watching inbound network traffic to try and avoid an outage (such as when a DDOS attack maybe starting). Once an issue is identified they’re able to begin working towards a resolution immediately. Most often issues are identified long before they cause an issue that’s visible to our customers. In the case that an issue does create a service disruption, every effort possible is immediately taken. This even includes completely replacing server hardware in an instant (especially if the situation is severe).
- Third Party Server Uptime Monitoring – We also use Pingdom as a third-party monitoring service where we set up a specific location on each server that is constantly pinged to see if it is available. When the ping fails to respond, we immediately get a notification (this also includes to my personal cell phone). Some prefer to wait a few minutes to see if it’s just a small network blurb, as sometimes it is, but we feel it’s best to immediately begin an investigation to minimize downtime, especially if it is a real issue.
- Server Response Monitoring – Another key metric many web hosts fail to monitor (or wish not to) is server response time. For us, it’s very important to monitor how long it takes for a server to respond to a request. If a server begins to trend upward in response time, meaning it’s taking longer and longer to respond, then an outage could be coming. We closely monitor this to see if there is an underlying issue that may be approaching. Given our Web Hosting and Business, Building packages are on a “shared” infrastructure we also have to constantly evaluate server resources to see if more need to be made available.
By closely analyzing these three monitoring layers we are able to achieve both proactive and reactive responses to issues. The majority of the time, we are also able to mitigate issues well ahead of any downtime to our customers.
The other part to simply “monitoring for uptime” is incident handling. It would be foolish of us to say we never have any issues because our infrastructure is “perfect.” Unfortunately when dealing with servers/computers there is no such thing. A server with the best software coupled with the most expensive software in the world’s best data center is not exempt from having issues.
In short, stuff is still going to come up, but the important part of continuing to achieve high uptime is how we deal with such issues. We fend off DDOS attacks on a weekly basis, sometimes we replace every piece of hardware in one server, and add resources to another. Multiple updates have also taken place.
With our three main levels of monitoring in place, when incidents arise, we promptly take action to begin resolving them. As mentioned above, our entire network is monitored constantly, meaning we’re well aware of many issues prior to ever having an outage or when we do have an outage, we’re made aware instantly.
Incidents that aren’t immediately causing an outage are scheduled during non-peak hours but are handled promptly. Incidents that cause a complete outage means “all hands on deck,” they take priority over everything else. Even our Krepling CEO jumps right into action when such a case arises. We do whatever it takes to bring a service back online.
Regardless of the incident size, we also believe in transparency and flow of information to our customers. Our Network Status page inside our customer interface allows the ability to see what exactly is going on and how it is being dealt with.