The Case for Eyeballs-Near-LLM Usage

Discover why agentic LLMs are incredibly powerful, and why maintaining human oversight and applying them with intention is critical to scaling insight safely and reliably. 
6 min read

At Arctic Wolf, we believe the future of cybersecurity is built on AI guided by human expertise. Staying at the forefront of security operations means not just adopting new technology but deeply understanding how and where it should be applied. In this blog series, Ken Ray, SVP and Chief Innovation Architect at Arctic Wolf, along with other Arctic Wolf experts and leaders from the wider security community, share their perspectives on large language models (LLMs), what they are good at, where they fall short, and what it really takes to use them responsibly in real-world security operations. 

These posts offer a candid look at technical perspectives from across the industry and contribute to an open conversation about the future of AI in cybersecurity. In this post, they share their thoughts on why agentic LLMs are incredibly powerful, and why maintaining human oversight and applying them with intention is critical to scaling insight safely and reliably. 

As organizations continue to explore the transformative potential of large language models (LLMs) — including in areas like autonomous workflows and strategic decision making — important nuances are being uncovered in their capabilities.  

Agentic LLMs1 are proving tremendously useful for automation, demonstrating great strength in understanding open-ended problems, extrapolating knowledge found on the internet, and dynamically applying that to new contexts. This same strength can introduce challenges when precision, accuracy, and completeness are paramount, especially in scenarios where errors may be hard to detect.  

With regard to LLM’s, organizations should exercise caution around when and how they pair fresh insight or nuanced understanding with putting that knowledge into action. Once organizations know what needs to be automated, more deterministic tools are often better suited for carrying it out reliably. 

Challenges with LLMs 

Regression Testing 

It is telling that there is a large and growing body of research and development focused on regression testing — a form of quality assurance used to determine if code updates have caused a piece of software to malfunction or provide a poorer output. Experts in the field are continually seeking new methods to separate genuine understanding — such as depth of reasoning, nuance, and practical application — from superficial performance or pattern recognition. This ongoing effort reflects the broader challenge of distinguishing true capability and adaptability from memorization or mechanical repetition in both software engineering and AI evaluation. Some of those efforts are reflected below:  

1. Agentic Metrics via Azure AI Evaluation Library

Microsoft’s Azure AI Evaluation library introduces specialized metrics for agentic workflows: 

  • Task Adherence: Measures how well the agent’s output aligns with the original goal 
  • Tool Call Accuracy: Evaluates whether the agent used the correct tools appropriately 
  • Intent Resolution: Assesses if the agent understood and followed the user’s intent  

2. Human-in-the-Loop + AI-Assisted QA

  • As described in testRigor’s blog, combining automated testing with human oversight is key. This includes: 
    • Defining testable behaviors (e.g., tool use, plan execution) 
    • Using AI to assist in grading and anomaly detection 

3. Anthropic’s Evaluation Approaches Overview 

  • This resource provides a comprehensive overview of Anthropic’s evaluation strategies for LLMs, including: 
    • Human-Graded Evaluations: Considered the gold standard for nuanced or subjective tasks 
    • Code-Graded Evaluations: Automated checks using deterministic logic, ideal for structured outputs 
    • Model-Graded Evaluations: Using other models to assess outputs, enabling scalable regression testing 
    • Integration with tools like Promptfoo and Anthropic Workbench for traceable, repeatable testing pipelines 
    • Emphasis on early and systematic evals to reduce downstream debugging and improve product quality [deepwiki.com] 

4. Anthropic’s Statistical Framework for Model Evaluation 

This paper introduces a rigorous statistical methodology for evaluating LLMs: 

  • Advocates for using the Central Limit Theorem to estimate true model performance across unseen question distributions 
  • Encourages reporting Standard Error of the Mean (SEM) alongside eval scores to quantify uncertainty 
  • Helps teams distinguish between real performance differences and random variance — critical for regression testing over time 
  • Supports benchmark stability and scientific reproducibility in LLM evaluations [anthropic.com] 

Variability of Responses 

When repeatedly asking an LLM an identical question about facts (e.g., distance to the moon), in most cases, you will get accurate facts. You will also receive a large variability of how that fact is communicated, in what context, and how it is explained. You will likely also receive a variety of adjacent facts: The average distance between the centers of masses, the exact distance on a specific day, the shortest / longest distance over a specific period of time, etc

And when you have reason to clearly specify how to give those answers, there are clever and powerful methods to accomplish this.   

When you ask for an opinion from an LLM, it will frequently respond with material differences in the answer. This is because it looks for a plausible match to its prompt (system, orchestration, user), with any ambiguity in the data or the prompt increasing the response variability. While there are many tools to improve prompts, the point is that even the same prompt can cause variability in responses. And the longer your prompt, the greater the chance is for increased (unintended) ambiguity2.  

Variability of Execution Time 

The way an LLM solves a problem also has high variability for the same reasons. An agent’s ability to identify multiple problem- or prompt-solving methods and is powerful, as is its ability to apply knowledge to identify the best way to orchestrate the answer. It also means that, the further down the path you go toward autonomy, the more challenging it will be to tune its performance.  

Debuggability 

Debugging an LLM’s confusion over ambiguous inputs is challenging and usually requires asking the LLM to explain the differences in interpretation of its inputs (including the answers it got back from follow-up questions). While logs can be generated for each run — expressly indicating the options evaluated, investigation steps executed, thinking performed, etc. — the logs themselves can vary in detail. And the effort put into “locking down” how to log can hamper the LLM’s creativity and make the whole system increasingly fragile.  

For this reason, there are great tools to help spot problems and reduce their prevalence: 

  1. Using the Evaluation Tool – Claude Docs  
  2. Use examples (multishot prompting) to guide Claude’s behavior – Claude Docs  
  3. Extended thinking tips – Claude Docs  

Optimization 

The point of using an agentic LLM is to include the nuance of the current situation in answers, rather than prescribing the steps that the LLM takes to get to a conclusion. While one could provide the relative expense of any step taken to the LLM and ask it to be efficient, the result is that this efficiency calculation will be made at the time of its execution. This makes optimization challenging as well. 

Ambiguity 

All engineers spend a considerable amount of their time designing tests to validate changes they make to their models, and this is especially true when building LLMs. 

Part of the goal for using an LLM is to allow it to come up with answers based on the specific context to improve the results. This can be challenging, however, because prompts that are added to an existing set of instructions, even if they are meant to add value and context to the original prompt, could potentially throw off the entire system. To an LLM, there is no separability between an original prompt and additional instructions — it will always read and interpret the entirety of both with every new input. And the more additional instructions that are added to a prompt, the more ways the LLM might start to infer meaning differently than intended. 

What that means for engineers is that they may need to do complete testing over all the functionality of the prompt whenever they change even a single character, as that one character will contribute to the risk of greater ambiguity — a known risk that can result in a previously thoroughly vetted scenario no longer working in its edge cases.  

An Example of a System Prompt Edit Gone Wrong 

An edit to a system prompt can lead to answers on non-related queries, such as when Elon Musk’s xAI publicly admitted that an “unauthorized modification” to its Grok chatbot’s system prompt was responsible for the AI’s repeated and unprompted mentioning of the false claim of “white genocide” in South Africa. 

As part of its response to the controversy, xAI has stated that it will begin publishing the system prompts that shape Grok’s behavior on a public GitHub repository. This is one approach AI organizations can use to increase transparency — making it possible for external stakeholders to track how foundational instructions change over time. 

xAI has also announced plans to add tighter controls to prevent unreviewed edits to system prompts, reinforcing a broader industry need for clear audit trails and approval processes for model-level changes. In addition, the company has stated that it will create a team dedicated to continuous monitoring of the model’s outputs, a practice that can help AI organizations catch issues that automated systems miss. 

Taken together, these steps illustrate several measures AI developers can adopt to reduce the risk of unauthorized changes and improve oversight going forward.

“Near Eyeballs” Use Cases 

The phrase “near eyeballs” is used to describe the need for methods of use that have a higher probability to catch mistakes like omissions, subtle flaws in logic, and extra unneeded steps. There are plenty of ways to harness agentic LLM power to your advantage where they are “near eyeballs.” 

At Arctic Wolf, a practical example of this is how we process our internal, plain-English records of a customer’s specific circumstances and communication preferences. 

If we used an LLM to interpret these notes in real time for every investigation, we would introduce significant risks like data exposure and inconsistent results. 

To avoid this, we apply a core engineering principle: caching the results of expensive processing. Our system is designed to run the LLM transformation only when the underlying data changes. When an analyst updates the notes, the LLM generates a structured summary of that information, which is then stored. This cached result is used across all subsequent investigations, making the process both efficient and secure. This method also ensures human oversight, as the analyst who updates the notes can validate the transformation, and it is reviewed again if our AI escalates an investigation to a human analyst for support.  

Additional examples of “Near-eyeballs” 

  • Interactive Dialog and Execution: Direct Chatbot & Analyst/Customer Interaction  
  • Case Explainer — Customer: This is the skill, or set of instructions that the AI agent follows, that powers how we succinctly communicate the important findings and consequences to a person, whether that be to the customer, a new analyst picking up the case, for an audit, or any time we need to revisit an investigation. “Eyeballs” here are the readers of the tickets and the analysts receiving the cases with the ability to dialog about findings within. 
  • Case Explainer — Quick Brief: We have a need to summarize and digest the history of a case, the details of investigation steps, what was found, conclusions drawn and questions still unanswered. There are multiple target audiences for those summaries, such as customer tickets, dialog with the “AI Security Assistant,” transfer from one analyst to another to get them up to speed quickly, and long-after-the-fact review to remind analysts of the important details. 
  • Formatting Data for Situational Reports: Creating a uniform executive overview of the state of a customer’s estate. What are the highlights, known issues, active or recent investigations and conclusions, etc. 
  • Reviewing and Executing on Inbound/Response from the Customer: Understanding, triage, and routing Customer requests (inbound) 
  • Reports, Insights and Trends: Allowing the agent to determine the data to pull and the way to pull it, execute, review/analyze, and present the results back to the user. Here again, it’s appropriate and easy to strictly specify the output format. 
  • And many others!  

All this is in the service of harvesting the tremendous value of agentic LLMs while mitigating their weaknesses. A critical question left unanswered in this blog, and that is central to effective LLM management, is how to effectively run an LLM without a nearby eyeball. Arctic Wolf will have more on this in coming research.  

Conclusion 

Agentic LLMs are extraordinarily powerful, and that power must be applied with intention. The closer an LLM operates to autonomy, the more variability, ambiguity, and risk it introduces—often in ways that are difficult to test, debug, or optimize. By keeping LLMs “near eyeballs,” organizations can preserve what these systems do best, which is reasoning, synthesis, and contextual understanding, while maintaining human oversight where precision and accountability matter most. This approach allows teams to scale insight without sacrificing reliability, and to move faster without losing control. 

As you evaluate where and how to deploy agentic AI, start by asking not just what can be automated, but where human review meaningfully reduces risk. To learn how Arctic Wolf applies this philosophy in real-world security operations—from Alpha AI to analyst workflows—explore our approach to building AI that’s powerful, practical, and accountable. 

By Kenneth Ray, SVP and Chief of Innovation at Arctic Wolf & Mike Mylrea, AI Technical Fellow and Architect at Arctic Wolf

  1. An LLM is a language model that responds to prompts (text in, text out), while an agentic LLM (or AI Agent) uses an LLM as its “brain” to autonomously plan, reason, use tools (APIs, databases), learn from interactions, and complete complex, multi-step goals without constant human intervention, making it a proactive system vs. a reactive one. Think of an LLM as a brilliant but passive writer, while an agentic system is a proactive assistant that can use that writer (and other tools) to achieve objectives. 
  2. https://community.openai.com/t/playground-prompt-length-what-is-too-long/627123/2 
Share this post: