Building and maintaining technology systems that are always available to their users—as in 24 hours a day, 365 days per year—is not a simple endeavor. Engineering such high availability for a given system or application requires not only a mix of advanced, unfailingly reliable technologies and services, but also a particular culture within the organization responsible for keeping said system perpetually online.Â
For the systems that support the federal government’s most critical missions, high availability is an absolute necessity. Critical mission systems ensure that critical government services are accessible, functioning properly—and in the rare and unfortunate event of a service disruption—that they are returned to operational status as close to immediately as possible.
Mission-critical systems require high availability because they fundamentally support mission execution. Whether it’s an agent in the field relying on a system for real-time data to inform decision making or a citizen accessing benefits, these systems are critical to carrying out the mission of an agency.
For example, a disruption in the Internal Revenue Service’s tax processing system during tax season could delay the processing of tax returns, refunds, and payments. Such delay could cause significant financial consequences for the government, businesses, and citizens alike.
Critical mission systems often must integrate modern components with legacy architectures while maintaining strict security and compliance with government regulations. They also typically handle immense amounts of data. These aspects contribute to the complexity of maintaining their never-fail status.
Technology is often cited as the key factor when discussing the feasibility and success of highly available systems, but the human component—or culture—is equally important.
Culture plays a central role in the success and failure of these systems because they are essentially reliant on people and processes to drive and maintain their operational excellence. Operational excellence means consistently achieving optimal performance through meticulous planning, robust infrastructure, clear operational processes, and proactive IT management to maintain functionality under normal and adverse system conditions. Technology alone cannot guarantee continuous, always-available service without the strategic and collaborative efforts of the people developing and managing the systems in play, thus the importance of high availability culture.
A highly available culture tends to focus on preventing failure. Focusing on failure in highly available systems creates a culture where discussing and addressing failure is not only accepted but encouraged. This level of openness is critical because it destigmatizes failure and further builds a culture of improvement and learning.
Based on our experience in establishing high availability cultures across numerous government agencies, we have found some common cultural characteristics that promote success.