Reducing 2AM headaches part 1: Standardize
One of the most effective ways to reduce fire fighting in daily administration is by standardizing the operating environments and automating deployment and configuration. A standard operating environment (SOE) that can support multiple use cases is a more robust and tested platform to build upon. It provides a uniform environment for troubleshooting when something goes awry. Reducing the differences in your operational environment to critical changes also reduces the overall complexity in multi-tier environments. Using centralized automation tools to define, build, and deploy these standards streamlines the process even more.
Standardization is not a new concept, nor is it disruptive way of thought. The industrial revolution owes part of its existence to standardization. Computers and gadgets get reviewed and reviled based on adhering to standardized parts and ports. Yet for some reason, every environment I've worked in has one-off, bespoke systems to one degree or another. Some had admins who thought it was easier, better, smarter, more secure to build custom environments. Some had admins who wrote wrapper scripts around standard UNIX utilities because they didn't like the way a particular error was handled. The only real outcome was increasing difficulty of maintaining and replicating the systems. While bespoke suits will fit better at first, you'd best be prepared to work hard to exactly maintain your shape otherwise you're in for regular and expensive tailoring.
Building a common platform that can service all the needs of the environment will identify the moving parts and reduce the number of variables in play for the environment. During a hair-on-fire maintenance window at 2AM, this means there are fewer things to check and account for while troubleshooting. When standing up new applications, there are fewer potential avenues of investigation for requirements and pitfalls. As the auditors scour the environment for security and compliance, there is a smaller potential vulnerable surface to explain and manage. And you can finally use some of that accumulated vacation time since you are no longer the only team member who understands the arcane workings of a particular set of services.
Simply installing world+dog from the OS install disc is not a viable standardization strategy. That method creates a raft of potential security holes, introduces unnecessary failure vectors, and increases the amount of investigation that needs to happen to troubleshoot failures or possible interactions. And if that isn't enough, it increases the amount of time needed to do simple things like deploy a new system or apply vendor patches. Applying 300+MB of patches to every world+dog system during 'patch day' maintenance windows is a time consuming process that increases the possibility of introducing new problems. No one wants maintenance windows that take vast amounts of time and wind up destabilizing production systems.
Take the time to build a small, controlled SOE for tailored your needs. You will gain knowledge of the hard requirements of your applications, you will reduce the number of variables in play, you will gain valuable time in the long run for more important work than figuring out which user Apache was installed under on this particular system. Make sure that you apply that approach to building and managing the SOE as well. If you don't have a standard way to manage changes to the SOE, you will wind up back in the same boat. Next time we will look at automation tools that not only help you manage systems but also the SOEs.