The Case for Eyeballs-Near-LLM Usage

Discover why agentic LLMs are incredibly powerful, and why maintaining human oversight and applying them with intention is critical to scaling insight safely and reliably.

January 20, 2026
by Kenneth Ray and Mike Mylrea

At Arctic Wolf, we believe the future of cybersecurity is built on AI guided by human expertise. Staying at the forefront of security operations means not just adopting new technology but deeply understanding how and where it should be applied. In this blog series, Ken Ray, SVP and Chief Innovation Architect at Arctic Wolf, along with other Arctic Wolf experts and leaders from the wider security community, share their perspectives on large language models (LLMs), what they are good at, where they fall short, and what it really takes to use them responsibly in real-world security operations.

These posts offer a candid look at technical perspectives from across the industry and contribute to an open conversation about the future of AI in cybersecurity. In this post, they share their thoughts on why agentic LLMs are incredibly powerful, and why maintaining human oversight and applying them with intention is critical to scaling insight safely and reliably.

As organizations continue to explore the transformative potential of large language models (LLMs) — including in areas like autonomous workflows and strategic decision making — important nuances are being uncovered in their capabilities.

Agentic LLMs¹ are proving tremendously useful for automation, demonstrating great strength in understanding open-ended problems, extrapolating knowledge found on the internet, and dynamically applying that to new contexts. This same strength can introduce challenges when precision, accuracy, and completeness are paramount, especially in scenarios where errors may be hard to detect.

With regard to LLM’s, organizations should exercise caution around when and how they pair fresh insight or nuanced understanding with putting that knowledge into action. Once organizations know what needs to be automated, more deterministic tools are often better suited for carrying it out reliably.

Challenges with LLMs

Regression Testing

It is telling that there is a large and growing body of research and development focused on regression testing — a form of quality assurance used to determine if code updates have caused a piece of software to malfunction or provide a poorer output. Experts in the field are continually seeking new methods to separate genuine understanding — such as depth of reasoning, nuance, and practical application — from superficial performance or pattern recognition. This ongoing effort reflects the broader challenge of distinguishing true capability and adaptability from memorization or mechanical repetition in both software engineering and AI evaluation. Some of those efforts are reflected below:

1. Agentic Metrics via Azure AI Evaluation Library

Microsoft’s Azure AI Evaluation library introduces specialized metrics for agentic workflows:

Task Adherence: Measures how well the agent’s output aligns with the original goal

Tool Call Accuracy: Evaluates whether the agent used the correct tools appropriately

Intent Resolution: Assesses if the agent understood and followed the user’s intent

2. Human-in-the-Loop + AI-Assisted QA

As described in testRigor’s blog, combining automated testing with human oversight is key. This includes:

- Defining testable behaviors (e.g., tool use, plan execution)

- Using AI to assist in grading and anomaly detection

3. Anthropic’s Evaluation Approaches Overview

This resource provides a comprehensive overview of Anthropic’s evaluation strategies for LLMs, including:

- Human-Graded Evaluations: Considered the gold standard for nuanced or subjective tasks

- Code-Graded Evaluations: Automated checks using deterministic logic, ideal for structured outputs

- Model-Graded Evaluations: Using other models to assess outputs, enabling scalable regression testing

- Integration with tools like Promptfoo and Anthropic Workbench for traceable, repeatable testing pipelines

- Emphasis on early and systematic evals to reduce downstream debugging and improve product quality [deepwiki.com]

4. Anthropic’s Statistical Framework for Model Evaluation

This paper introduces a rigorous statistical methodology for evaluating LLMs:

Advocates for using the Central Limit Theorem to estimate true model performance across unseen question distributions

Encourages reporting Standard Error of the Mean (SEM) alongside eval scores to quantify uncertainty

Helps teams distinguish between real performance differences and random variance — critical for regression testing over time

Supports benchmark stability and scientific reproducibility in LLM evaluations [anthropic.com]

Variability of Responses

When repeatedly asking an LLM an identical question about facts (e.g., distance to the moon), in most cases, you will get accurate facts. You will also receive a large variability of how that fact is communicated, in what context, and how it is explained. You will likely also receive a variety of adjacent facts: The average distance between the centers of masses, the exact distance on a specific day, the shortest / longest distance over a specific period of time, etc

And when you have reason to clearly specify how to give those answers, there are clever and powerful methods to accomplish this.

When you ask for an opinion from an LLM, it will frequently respond with material differences in the answer. This is because it looks for a plausible match to its prompt (system, orchestration, user), with any ambiguity in the data or the prompt increasing the response variability. While there are many tools to improve prompts, the point is that even the same prompt can cause variability in responses. And the longer your prompt, the greater the chance is for increased (unintended) ambiguity².

Variability of Execution Time

The way an LLM solves a problem also has high variability for the same reasons. An agent’s ability to identify multiple problem- or prompt-solving methods and is powerful, as is its ability to apply knowledge to identify the best way to orchestrate the answer. It also means that, the further down the path you go toward autonomy, the more challenging it will be to tune its performance.

Debuggability

Debugging an LLM’s confusion over ambiguous inputs is challenging and usually requires asking the LLM to explain the differences in interpretation of its inputs (including the answers it got back from follow-up questions). While logs can be generated for each run — expressly indicating the options evaluated, investigation steps executed, thinking performed, etc. — the logs themselves can vary in detail. And the effort put into “locking down” how to log can hamper the LLM’s creativity and make the whole system increasingly fragile.

For this reason, there are great tools to help spot problems and reduce their prevalence:

Optimization

The point of using an agentic LLM is to include the nuance of the current situation in answers, rather than prescribing the steps that the LLM takes to get to a conclusion. While one could provide the relative expense of any step taken to the LLM and ask it to be efficient, the result is that this efficiency calculation will be made at the time of its execution. This makes optimization challenging as well.

Ambiguity

All engineers spend a considerable amount of their time designing tests to validate changes they make to their models, and this is especially true when building LLMs.

Part of the goal for using an LLM is to allow it to come up with answers based on the specific context to improve the results. This can be challenging, however, because prompts that are added to an existing set of instructions, even if they are meant to add value and context to the original prompt, could potentially throw off the entire system. To an LLM, there is no separability between an original prompt and additional instructions — it will always read and interpret the entirety of both with every new input. And the more additional instructions that are added to a prompt, the more ways the LLM might start to infer meaning differently than intended.

What that means for engineers is that they may need to do complete testing over all the functionality of the prompt whenever they change even a single character, as that one character will contribute to the risk of greater ambiguity — a known risk that can result in a previously thoroughly vetted scenario no longer working in its edge cases.

An Example of a System Prompt Edit Gone Wrong

An edit to a system prompt can lead to answers on non-related queries, such as when Elon Musk’s xAI publicly admitted that an “unauthorized modification” to its Grok chatbot’s system prompt was responsible for the AI’s repeated and unprompted mentioning of the false claim of “white genocide” in South Africa.

As part of its response to the controversy, xAI has stated that it will begin publishing the system prompts that shape Grok’s behavior on a public GitHub repository. This is one approach AI organizations can use to increase transparency — making it possible for external stakeholders to track how foundational instructions change over time.

xAI has also announced plans to add tighter controls to prevent unreviewed edits to system prompts, reinforcing a broader industry need for clear audit trails and approval processes for model-level changes. In addition, the company has stated that it will create a team dedicated to continuous monitoring of the model’s outputs, a practice that can help AI organizations catch issues that automated systems miss.

Taken together, these steps illustrate several measures AI developers can adopt to reduce the risk of unauthorized changes and improve oversight going forward.

“Near Eyeballs” Use Cases

The phrase “near eyeballs” is used to describe the need for methods of use that have a higher probability to catch mistakes like omissions, subtle flaws in logic, and extra unneeded steps. There are plenty of ways to harness agentic LLM power to your advantage where they are “near eyeballs.”

At Arctic Wolf, a practical example of this is how we process our internal, plain-English records of a customer’s specific circumstances and communication preferences.

If we used an LLM to interpret these notes in real time for every investigation, we would introduce significant risks like data exposure and inconsistent results.

To avoid this, we apply a core engineering principle: caching the results of expensive processing. Our system is designed to run the LLM transformation only when the underlying data changes. When an analyst updates the notes, the LLM generates a structured summary of that information, which is then stored. This cached result is used across all subsequent investigations, making the process both efficient and secure. This method also ensures human oversight, as the analyst who updates the notes can validate the transformation, and it is reviewed again if our AI escalates an investigation to a human analyst for support.

Additional examples of “Near-eyeballs”

Interactive Dialog and Execution: Direct Chatbot & Analyst/Customer Interaction

Case Explainer — Customer: This is the skill, or set of instructions that the AI agent follows, that powers how we succinctly communicate the important findings and consequences to a person, whether that be to the customer, a new analyst picking up the case, for an audit, or any time we need to revisit an investigation. “Eyeballs” here are the readers of the tickets and the analysts receiving the cases with the ability to dialog about findings within.

Case Explainer — Quick Brief: We have a need to summarize and digest the history of a case, the details of investigation steps, what was found, conclusions drawn and questions still unanswered. There are multiple target audiences for those summaries, such as customer tickets, dialog with the “AI Security Assistant,” transfer from one analyst to another to get them up to speed quickly, and long-after-the-fact review to remind analysts of the important details.

Formatting Data for Situational Reports: Creating a uniform executive overview of the state of a customer’s estate. What are the highlights, known issues, active or recent investigations and conclusions, etc.

Reviewing and Executing on Inbound/Response from the Customer: Understanding, triage, and routing Customer requests (inbound)

Reports, Insights and Trends: Allowing the agent to determine the data to pull and the way to pull it, execute, review/analyze, and present the results back to the user. Here again, it’s appropriate and easy to strictly specify the output format.

And many others!

All this is in the service of harvesting the tremendous value of agentic LLMs while mitigating their weaknesses. A critical question left unanswered in this blog, and that is central to effective LLM management, is how to effectively run an LLM without a nearby eyeball. Arctic Wolf will have more on this in coming research.

Conclusion

Agentic LLMs are extraordinarily powerful, and that power must be applied with intention. The closer an LLM operates to autonomy, the more variability, ambiguity, and risk it introduces—often in ways that are difficult to test, debug, or optimize. By keeping LLMs “near eyeballs,” organizations can preserve what these systems do best, which is reasoning, synthesis, and contextual understanding, while maintaining human oversight where precision and accountability matter most. This approach allows teams to scale insight without sacrificing reliability, and to move faster without losing control.

As you evaluate where and how to deploy agentic AI, start by asking not just what can be automated, but where human review meaningfully reduces risk. To learn how Arctic Wolf applies this philosophy in real-world security operations—from Alpha AI to analyst workflows—explore our approach to building AI that’s powerful, practical, and accountable.

By Kenneth Ray, SVP and Chief of Innovation at Arctic Wolf & Mike Mylrea, AI Technical Fellow and Architect at Arctic Wolf

An LLM is a language model that responds to prompts (text in, text out), while an agentic LLM (or AI Agent) uses an LLM as its “brain” to autonomously plan, reason, use tools (APIs, databases), learn from interactions, and complete complex, multi-step goals without constant human intervention, making it a proactive system vs. a reactive one. Think of an LLM as a brilliant but passive writer, while an agentic system is a proactive assistant that can use that writer (and other tools) to achieve objectives.
https://community.openai.com/t/playground-prompt-length-what-is-too-long/627123/2

© 2025 Arctic Wolf Networks Inc. All Rights Reserved.
Privacy Notice	Terms of Use	Cookie Policy	Accessibility Statement	Information Security	Sustainability Statement	Cookies Settings

Platform

Concierge

Journey

Reduce Attack Frequency

Reduce Attack Severity

Transfer Risk

Get Started

Why Arctic Wolf

Expertise by Topic

Incident Response Timelines

Expertise by Industry

Resource Center

Trending Resources

The Arctic Wolf Threat Report draws upon the first-hand experience of our security experts, augmented by research from our threat intelligence team.

The Arctic Wolf State of Cybersecurity: 2025 Trends Report serves as an opportunity for decision makers to share their experiences over the past 12 months and their perspectives on some of the most important issues shaping the IT and security landscape.

Join Arctic Wolf on an interactive journey to discover a better path past the hazards of the modern threat landscape.

Security Bulletins

CVE-2026-21643: Critical SQL Injection in FortiClientEMS

CVE-2026-1731: Unauthenticated OS Command Injection Vulnerability in BeyondTrust Remote Support and Privileged Remote Access

Notepad++ Publishes Full Details of 2025 Compromise

Partners

Company

Careers

Press

Brand Partnerships

The Case for Eyeballs-Near-LLM Usage

Challenges with LLMs

Regression Testing

1. Agentic Metrics via Azure AI Evaluation Library

2. Human-in-the-Loop + AI-Assisted QA

3. Anthropic’s Evaluation Approaches Overview

4. Anthropic’s Statistical Framework for Model Evaluation

Variability of Responses

Variability of Execution Time

Debuggability

Optimization

Ambiguity

An Example of a System Prompt Edit Gone Wrong

“Near Eyeballs” Use Cases

Conclusion

Popular Topics

Share this post:

What to read next

2 min read

CVE-2026-21643: Critical SQL Injection in FortiClientEMS

2 min read

CVE-2026-1731: Unauthenticated OS Command Injection Vulnerability in BeyondTrust Remote Support and Privileged Remote Access

19 min read

The Role of Artificial Intelligence in Zero Trust Cybersecurity Frameworks

3 min read

Notepad++ Publishes Full Details of 2025 Compromise

Arctic Wolf Networks 8939 Columbine Rd Eden Prairie, MN 55347

1.888.272.8429

© 2025 Arctic Wolf Networks Inc. All Rights Reserved.

Privacy Notice

Terms of Use

Cookie Policy

Accessibility Statement

Information Security

Sustainability Statement

Cookies Settings

Arctic Wolf Networks
8939 Columbine Rd
Eden Prairie, MN 55347