Server Downtime Preparedness
If you have been watching the news lately, you’ll have seen the recent headlines about Amazon’s servers, dubbed Amazon Web Services (AWS) went down. This meant that not just Amazon devices like Alexa stopped responding, but it also meant national brands that store their data on AWS servers were also down, like Netflix, Tinder, Disney+, even dog food bowls that have automatic feeders weren’t working properly. This article covers how to manage your company’s server downtime, and determining whether your company has a plan when it comes to server downtime preparedness.
Why does this happen? Business Insider explains “Netflix, for instance, uses AWS ‘for nearly all its computing and storage needs, including databases, analytics, recommendation engines, video transcoding, and more,’ according to Amazon. But the services that Amazon’s AWS customers each use vary widely, from remote database software used for tracking business to remote storage for files and much more.”
But how did it happen? Quote: “The issue was caused by network congestion between parts of the AWS Backbone and a subset of Internet Service Providers, which was triggered by AWS traffic engineering, executed in response to congestion outside of our network.
“This traffic engineering incorrectly moved more traffic than expected to parts of the AWS Backbone that affected connectivity to a subset of Internet destinations. The issue has been resolved, and we do not expect a recurrence.”
In plainer terms, AWS saw a dramatic uptick in traffic but mistakenly misdirected it to the “Backbone” of AWS which inhibited internet connectivity for several different companies.
How can a company prepared for server downtime?
First things first: What is server downtime? In the most basic sense it means that a company’s dedicated server (either physical or virtual) is not accessible via the internet. Most downtime can be categorized into three issues- physical (like cabling or electric), software (like updates or bugs), or human (like hackers or a data input errors). The best way to insure constant and complete protection to your server means making a plan.
Why? Well, Fortune.com makes important points as to why preparing for server downtime an important part in our new emerging dependence on cloud infrastructure: “Even as more core services weave these products into daily operations, software outages aren’t going to magically stop. What companies are building now far exceeds what any single person, team, or organization can mentally model regarding how these systems are built, much less how they function under unanticipated pressure or otherwise unexpected conditions. On top of that, user and organizational demands are only accelerating the pace and complexity of software development.”
Most companies will need to recognize that a some downtime is inevitable, even with a top tier tech environment. Instead, focus should lit in creating backups and a downtime preparedness strategy. The last thing anyone wants is to be alerted that a whole website or even operation is down, but it’s best not to get caught unaware. Instead, use the time before any issues to garner buy-in and set up a proper plan. A downtime preparedness strategy will include planning, maintenance, preparation and training if it is going to be successful.
Downtime preparedness strategy: planning, maintenance, preparation and training
The first (and possibly best) way to protect your server is with monitoring and alerting systems. This includes alerts to your IT about server load, disk space, hardware health, program load times, and software status. It’s best to insure that there is a tech that is held accountable that the server is constantly in the best health possible.
The next is to make sure there is high server availability. This usually means having a primary and secondary server so that in case of an event (like a traffic surge) so that the system can switch over to the other server when overloaded.
Different than high availability is geography redundancy, which means making the backups created are not stored in the same place as the original (physically, or virtually) as the risk that one error may happen to them both. If power outages or natural disasters occur, not storing data in the same place would be most logical. It is also advised that they farther the distance between the backups the better! Though more of an investment, it insures that there is as minimal downtime as possible.
Our last recommendation is to make sure there are catch-fails for the things that do need human interaction, like patches or updates. That means making sure you are able to revert back code when necessary, and that changes are being approved by multiple parties. Think about enabling versioning, so that any changes made are clearly labeled with when and who made the changes.
If your company is struggling with downtime and would like an entirely free network assessment, KPInterface is available. Simply use the form below to get in contact with us, and we’re happy to review what changes you could make to improve your IT environment! Or take our assessment!