Gaining an Edge with AIOps

The CrowdStrike outage highlighted the fragility of today’s IT systems. This incident is far from an isolated case. Research has consistently shown that unplanned system outages can have significant impacts on critical infrastructure sectors. The interconnectedness of these systems means that a failure in one area can affect multiple critical services simultaneously. As early as 2012, MIT Technology Review noted that the most highly interconnected systems can give rise to catastrophic domino effects.²

The proliferation of AI agents capable of performing tasks with minimal human oversight has the potential to fundamentally change how IT systems are managed and lessen the significant cognitive burden IT professionals carry. The introduction of agents also comes with a set of complex questions. Which tasks within IT environments should be automated? How do enterprises ensure human IT professionals can focus on the critical tasks they alone can handle? What role should humans play in overseeing the work of agents?

In this article, we use the Endsley Situational Awareness Model—a three-level theoretical model of situational awareness—to deconstruct situational awareness and decision making in IT environments, provide prescriptive guidance on how to apply artificial intelligence (AI) for IT Operations (AIOps) to address many of the challenges human IT operators face, and identify necessary cognitive supports for times when humans are required to coordinate with and oversee AI agents.

Perception: Cutting Through the Noise

According to the Endsley model, the first level of situational awareness entails perceiving the status, attributes, and dynamics of relevant elements in a given environment. Processing and recalling vast amounts of information can overtax human memory and attention and lead to burnout and critical errors. This cognitive overload can manifest in two ways: focusing on irrelevant data points (errors of commission) or overlooking crucial information (errors of omission). Both scenarios compromise perception and degrade the ability to make informed decisions, which is why it comes as no surprise that this is the stage where most errors occur—70 to 80 percent of situational awareness mistakes happen during perception.³

The impact of information overload on perception is not just theoretical. Research shows that information overload is associated with serious performance losses, especially in connection with disruptions and interruptions.⁴ In IT operations, where quick responses to system alerts and anomalies are crucial, such performance losses can have far-reaching consequences, such as:

Delayed Detection of Security Breaches: An overwhelmed IT team might miss or delay responding to critical security alerts, potentially allowing cybercriminals more time to exfiltrate sensitive data or spread malware across the network.
Extended Service Outages: If an operator fails to notice or properly interpret system performance degradation alerts, it could lead to full-scale service outages.
Cascading Infrastructure Failures: In interconnected systems, a small issue left unaddressed due to information overload can cascade into larger, systemwide failures. For instance, an overlooked server overload could lead to a data center shutdown.
Critical Service Disruptions: In sectors like healthcare or emergency services, delays in addressing IT issues could directly impact life-critical systems.

Figure 1: Endsley Model

Applying AIOps to Enhance Perception

AIOps addresses the perception-related challenges inherent in these tasks by filtering vast amounts of data to highlight only the most relevant information. Presenting human operators with actionable insights will result in reduced cognitive load. AIOps can go one step further by identifying subtle patterns or anomalies that might escape human attention, thereby enhancing perception and reducing the risk that an important signal will be overlooked. In addition, AIOps can provide or enhance several critical tasks:

Real-Time Data Monitoring: AI systems continuously monitor network traffic, system logs, and user activities to analyze the rapidly changing data within an IT environment. This real-time data monitoring helps identify issues as they occur and ensure teams stay aware of the evolving circumstances.
Data Collection Automation: Generative AI (GenAI) and machine learning can automate the process of gathering contextually relevant data alongside information that highlights their dynamics and relationships. Retrieving and presenting this data to operators ensures that teams have immediate access to critical environmental elements without manual searching.
Anomaly and Pattern Detection: Applying machine learning models to specific data can identify patterns or deviations that might be too complex or subtle for humans to notice, making it easier to detect potential security threats or system issues as they emerge.

Comprehension: Making Sense of the Data

The second level of Endsley addresses the importance of understanding the significance of the perceived elements in relation to operational goals. This is where context becomes crucial, and patterns start to emerge.

AIOps supports comprehension by correlating data from multiple sources to provide a holistic view of the IT environment, offering context-aware insights that explain the potential impact of observed phenomena and reducing the time required to diagnose issues by automatically identifying root causes. With improved comprehension, operators at all experience levels can make sense of the environment and make more informed decisions about how to respond to various situations:

Data Analysis and Interpretation: Use a combination of GenAI and traditional machine learning models to analyze collected data to identify correlations and root causes of detected anomalies or performance issues.
Contextual Understanding: Integrate information from various sources, like traditional machine learning models, to give meaning to the signals present in the data while providing a nuanced understanding of how different elements in the IT environment interact and impact each other.
Prioritization of Concerns: By understanding both the current situational context and the significance of the perceived elements, AI can help prioritize and focus the operators on the most crucial issues first.

Projection: Anticipating Future States

The highest level of situational awareness involves projecting future actions and states of elements in the environment. In IT operations, this translates to predicting potential issues before they occur and understanding the likely outcomes of different actions.

AIOps enhances projection capabilities by using machine learning models to forecast system behavior and potential failures, simulating the impact of proposed changes before implementation, and providing decision support by evaluating multiple courses of action. By improving projection capabilities, AIOps empowers operators to take proactive measures, preventing issues before they impact critical services:

Predictive Analytics: Traditional machine learning models for trend analysis and forecasting can model future system behaviors, such as slowing system performance, potential failures resulting from hardware degradation, or increased demand for resources.
Capacity Planning: By predicting future resource needs, AI aids in strategic planning to scale infrastructure proactively.
Evaluation of Potential Outcomes: As decisions need to be made, AIOps can help analyze the impacts of various actions. Assessing potential impacts to performance, security posture, and resiliency can all help operators make better decisions.

Building Trust Toward a More Automated Future

While decision authority within IT environments currently resides with human operators, this paradigm may not remain feasible as the size and complexity of our IT landscape continues to grow. To keep pace with this accelerating growth, augmenting the work of human operators with autonomous AI agents will become unavoidable. Agent-based architectures—designed to handle non-deterministic scenarios through integrating specialized AI agents capable of perceiving their environment, making decisions, and taking actions to achieve specific goals—are poised to become more prevalent across industry.

The Endsley Situational Awareness model provides insights on how best to enable agent autonomy. As organizations begin to appreciate the immense capabilities of these AI systems, the role of humans in the decision-making process will undergo a significant transformation.

This transition isn’t just a matter of technological capability but also of necessity. According to the Oracle study, 70% of business leaders would trust a robot more than a human to make financial decisions. This startling statistic shows the growing recognition that AI systems may be better equipped to handle certain complex decision-making tasks, particularly in data-rich environments like IT operations.

Initially, humans will remain in the loop, actively overseeing and guiding the actions of AI agents to ensure accuracy and alignment with predefined goals and ethical standards. AI agents will become responsible for perceiving the environment, analyzing the data to determine its significance, and providing possible projections about what may happen next. But humans will retain the decision authority. Teams of agents working to create recommendations for human consideration is a crucial first step in building trust and confidence in the AI’s capabilities. It also allows human operators to intervene when necessary.

As the AI agents work together and demonstrate increasing reliability and effectiveness, humans will transition to on-the-loop roles. In this capacity, they will provide oversight and intervention only when required. The same teams of agents will still create recommendations structured by perception, comprehension, and projection; however, another agent will oversee these insights and make decisions about which course of action to take. Maintaining the same human-centric situational awareness structure will be critical for building trust in the AI agent’s decision making because it allows humans to better understand the “thinking” of the AI agent.

Trust is not a binary state but a continuum that develops over time through consistent, reliable performance. Research by Zhang, Liao, & Bellamy (2020) on has shown that providing explanations for AI recommendations significantly improves accuracy and trust calibration. By further structuring the decision-making thought process in human terms, the development of trust can accelerate, and human oversight becomes natural rather than investigative.

From On the Loop to Out of the Loop

As AIOps systems evolve, the relationship between human operators and AI will transform, with humans finding themselves out-of-the-loop and entrusting AI agents with full autonomy for many decision-making processes. This evolutionary process will require a new framework for gradually delegating decisions to the AI systems and placing human oversight at the macro level. This new framework enabling agentic autonomy should be comprehensive and adaptable, and it must consider several factors:

Potential Impact and Risk: High-stakes decisions with far-reaching consequences may require a longer timeline for autonomy than routine, low-risk decisions. These decisions will require a high degree of trust that can only be built through consistent, reliable performance. Additionally, ceding decision authority when lives are on the line may require acknowledgment that the AI’s capabilities exceed our own human performance.
Decision Latency: In scenarios where split-second decisions are crucial, AI systems may be better equipped to respond. However, for decisions that allow for more deliberation, a balanced approach involving a human might be more appropriate. Human reaction time has limitations, which machine intelligence can surpass with ease.
Model Maturity: More mature, well-tested models might be trusted with greater autonomy, while newer or less proven models may require closer human supervision. As models continue to advance, reinstating oversight for certain processes may become necessary until trust in the new model can be reestablished. Less predictable models or models without adequate red teaming to root out any potential deceptive, manipulative, or malicious behaviors should be treated skeptically as they are introduced.
Responsible AI Risk Factors: Ethical considerations, potential biases, and legal implications of AI decisions should be integral to the framework. The potential harm to vulnerable or marginalized groups should be considered before allowing high degrees of autonomy in related decisions.
Division of Labor: The framework should clearly define the roles of AI and humans in the decision-making process. Many complex decisions can be decomposed into smaller, more discrete decisions. This may enable offloading much of the load to AI without fully ceding decision authority.

While operating an IT environment will still require thousands of decisions each day, during this final stage, AI agents will make most of those decisions independently on our behalf. The transition to this level of autonomy will be gradual and carefully monitored, ensuring that the AI systems are fully capable of handling the complexities and nuances of their designated tasks. Agent-based architectures represent a paradigm shift in the way organizations approach decision making and operations. The transition from human-in-the-loop to human-on-the-loop, and eventually to human-out-of-the-loop, will require careful planning, continuous evaluation, and a commitment to ethical AI practices. As trust in AI systems grows, so will the number of autonomous AI agents we rely on each day.

Powering Mission Success with AIOps

As the complexity and scale of IT operations grow, so does the risk of inaction. AIOps offers the vital support needed to transform overwhelmed operations into efficient, proactive systems capable of withstanding even the most daunting challenges. Every moment of delay increases vulnerability to costly failures, missed opportunities, and disrupted human lives.

By leveraging AIOps, organizations can significantly reduce downtime, optimize resource allocation, and enhance overall system performance. These benefits directly translate to improved mission outcomes, whether in government agencies, healthcare systems, or critical infrastructure.

Rather than replace human insight, AIOps empowers it. AIOps creates a strong partnership between human expertise and AI, allowing teams to make faster, more informed decisions while reducing the cognitive strain of managing countless data points. By using the Endsley model to deconstruct the human decision-making process, organizations have a blueprint for meshing together AI with human insights to form this human-AI partnership. The future of IT operations lies in evolving this partnership, with AI agents handling a number of routine tasks and decisions and allowing operators to focus on critical mission objectives.

Enhancing human decision making and learning to rely on AIOps solutions and agents are necessities. For us to meet the growing challenges of our day and ensure the systems that power our nation remain resilient, we must learn to rely on AIOps solutions and agents. Building trust will take time and the careful, gradual relinquishing of control. By understanding ourselves and our own cognition, we begin the journey of shaping this future partnership with AI and take the first step to overcoming the challenges of IT operations today.

Take the First Step to Overcoming the Challenges of IT Operations Today

Key Takeaways

The complexity and scale of modern enterprise IT environments can overwhelm human decision-making abilities and lead to delayed actions, missed insights, or incorrect decisions that ultimately prolong issues or have unintended downstream impacts on these critical systems.
We can enable humans to be more effective decision makers in these fast-paced, information-rich environments by using what we know about human psychology to re-envision the processes and technology solutions that underpin IT operations.
Agent-based architectures—that embed teams of fit-for-purpose AI agents in support of human decision makers—will become increasingly common and as trust in their capabilities builds, human IT operators will shift from being in-the-loop to on-the-loop and eventually out-of-the-loop in many circumstances.

Meet the Authors

Jesse Russell

is a research developement leader in Ģ��Ƶ Allen's Chief Technology Office. He specializes in intelligent automation, distributed systems, cloud architecture, and DevSecOps methodologies.

Maryrose Blank, Psy.D.,

provides expertise in psychology, health, and cognitive performance to the Ģ��Ƶ Allen Human Performance team.

Alexa Hoffman

is a generative AI strategist and product management leader with Ģ��Ƶ Allen's Chief Technology Office.

Jennifer Sheppard

is a chief engineer on Ģ��Ƶ Allen's Civil Tech team, with expertise in enhancing large-scale software delivery through reuse, improving developer experience, and integrating AI throughout the software development lifecycle.

References

¹.
²“The Perils of Highly Interconnected Systems,” MIT Technology Review, 2012, .
³.
⁴“Dealing with Information Overload: A Comprehensive Review,” National Library of Medicine, 2023, .

VELOCITY MAGAZINE

Ģ��Ƶ Allen's annual publication dissecting issues at the center of mission and innovation.

Browse Stories

Want more insights from Velocity? Sign up to receive more stories about emerging technologies and the impacts they’re making on missions of national importance.

First Name*

Last Name*

Email Address*

Company*

Title*

Country*

Would you like to receive occasional email updates from Ģ��Ƶ Allen Hamilton?*

Yes

No

Comment or Question

address1

Ģ��Ƶ

Ģ��Ƶ Allen: AI Cyber Insights