Downtime on a website can be very embarrassing and costly. Just ask Google or any other major website. On February 24, 2009, there was a major outage in Google’s Gmail system that made the national press. Downtime also costs money due to lost sales because the website is down. Website outage is something all CIOs and IT directors should try to avoid.
There are a number of steps any CIO and IT manager needs to take to prevent unplanned downtime. From network monitoring tools to change control processes, an IT operations department can get much lower downtime without buying expensive failover and clustering hardware. Hardware solutions definitely help, but it’s better to pick the cheaper low hanging fruit first.
It is important to have monitoring software on network management systems to check the status of all devices in the enterprise. Commercial SNMP (Simple Network Management Protocol) monitoring tools such as HP Openview and CA Unicenter automatically monitors all devices for problems, and report any issues to staff quickly.
There are free open source products as well that do this such as Nagios. Without any SNMP monitoring tools, the duration of outages can be much longer since nobody in the IT organization knows that the website is down.
Best Practices for IT Maintenance
The lower the number of changes on a production system, the lower the chance of server downtime. On the other hand, neglecting patches such as those relating to security may expose the website to other risks. The IT director and manager needs to weight the pros and cons when deciding how much change is necessary on a production site. As long as the production environment doesn’t resemble the 1800s lawless wild west, the amount of changes and improvements is probably not excessive.
All proposed changes to a production website must be tested in an identical test environment. This is often called a QA (quality assurance) environment. The hardware in QA may not be exactly the same due to cost reasons, but the QA environment should mirror the production environment as much as possible.
Many site outages are caused by human errors from undocumented maintenance work, and become long outages because there isn’t a good trail of records for changes on a website. The best way to prevent this is to have a change control board, where all changes have to be documented and approved by the board. If an incident occurs anyway, there will be documentation available that’s helpful in resolving any outages quickly.
Many software ticketing systems such as Remedy can be customized to create change control requests as well. The advantage of using a ticketing system is to have the software enforce all change control requirements (eg. all tickets must contain a backout plan). The change control board can communicate notifications about approved changes so that everyone affected, such as operations staff or network operations center (NOC), is notified.
Scheduled Maintenance Times
It is important to schedule maintenance during hours of low usage on the website. For sites servicing United States customers, the ideal time to do a maintenance starts at 11PM Eastern. That gives around six hours for those times when maintenance runs too long without affecting too many customers.
Many websites do recurring maintenance every week at the same time in order to inform customers when the website will be down. Even though IT workers may grumble at this suggestion, Saturday evenings are the time of the week for doing maintenance.
Service Level Agreements and Escalation Paths
Service Level Agreements (SLA) also gives IT operations departments and clients expectations on how quickly downtime needs to be dealt with. For example, in a typical SLA at a major bank, the agreement may require that the IT operations department notify the Vice President of Online Banking if the outage lasts longer than two hours. This helps management monitor the situation closely to see if everything is being done to bring up the website as quickly as possible without having to find out about the outage on the news.
With processes in place to control the production environment, a website can significantly reduce server downtime by following the above best practices. None of the above steps require any expensive high availability hardware, but a lot of problems such as unauthorized changes in the production IT environment cannot be fixed by hardware alone. The first step for an IT director should be to fix processes before going on an IT hardware shopping spree.