What Makes an AI Agent Safe For Use? Inside Our Agentic Validation Work with Arva AI.

There’s a joke in the AI world that an agent is just “three prompts in a trench coat.”

In regulated workflows, that joke stops being funny.

If your agent is touching sanctions screening, KYC, fraud, or investigations, the downside isn’t just a bad recommendation. It’s regulatory exposure. Missed risk. Or a compliance team that quietly stops trusting the system.

That’s the context behind our work with Arva AI — and why we’re sharing what we learned about validating agentic systems alongside our podcast conversation with Arva AI CEO, Rhim Shah, CAMS .

TL;DR

Agentic systems break the “validate once, monitor drift” model.
Early architectural decisions determine downstream validation burden.
The missing ingredient in regulated AI is domain-grounded testing — what we call SMEvals.

Watch the conversation with Rhim.

The Validation Problem Changes When the System Can Change

Traditional model validation assumes relative stability.

You validate a model. You monitor for drift. You revalidate periodically.

Agentic AI breaks that assumption.

Agents combine LLMs with rules engines, tools, retrieval layers, data sources, and decision logic. Even with similar inputs, they can take different reasoning paths, invoke different tools, and produce different rationales.

And critically: the system can change even when you don’t.

A vendor updates the foundation model. A document enters the retrieval corpus. A prompt template evolves. A tool permission expands. The output shifts.

Validation frameworks designed for static statistical models don’t account for that.

In agentic AI, everyone talks about “evaluation.” In regulated workflows, the question is more specific:

Where are the domain tests?

For sanctions, KYC, and fraud operations, there is no public benchmark you can point to and say, “this is what good looks like.”

That gap is exactly why we built SMEvals — our subject-matter-expert–driven evaluation libraries — and why we used them in our work with Arva AI to validate their agents against real-world scenarios, edge cases, and known failure modes.

What Makes an AI Agent Validatable in the First Place?

Validation doesn’t start at deployment. It starts at design.

The architectural decisions a team makes early determine how testable, auditable, and governable the system will be downstream.

Rhim Shah led the financial crime product team at Revolut Business and managed hundreds of analysts doing the work Arva’s agents now support. That operator background shows up in Arva’s product design.

In our work with Arva, three patterns stood out — patterns any team building or buying agentic AI in regulated environments should study.

1) LLM Minimization Use the LLM only where you have to. Anything that can be expressed as deterministic logic should be. The smaller the LLM surface area, the more predictable the workflow, the clearer the test cases, and the tighter the governance.

2) Independent Confidence Scoring Don’t let the LLM grade its own homework. A separate scoring layer that evaluates decision quality and uncertainty outside the model is one of the most important architectural decisions a team can make. In our experience, self-evaluation is one of the most common failure modes in agentic systems.

3) SOP-Driven Workflow Decomposition There’s a big difference between pasting an SOP into a prompt and decomposing it into structured, auditable steps that mirror how a human expert actually works. The second approach creates validation checkpoints at every stage — not just at the final output. This makes governance structural.

These are deliberate design choices. They separate agents you can defend to a regulator from agents you simply hope work.

What We’re Building at FairPlay for Agentic Validation

Our work with Arva AI is part of FairPlay’s broader expansion into agentic evaluation for regulated industries.

At the center of that effort is something we call SMEvals.

SMEvals are domain-specific libraries of tests built by subject matter experts and grounded in how real compliance and risk teams evaluate outputs.

Each library includes:

Structured scenario sets
Edge cases
Known failure modes
Explicit rubrics defining what “good” looks like

For financial crime workflows, we are building SMEvals across:

KYC (CDD/EDD)
Sanctions screening
Fraud operations
Collections and loss mitigation

SMEvals sit alongside continuous evaluation of the full decision pipeline — including confidence calibration, stability across segments and data conditions, explainability under stress, and regression testing after system updates.

Validation is not a launch ritual. It is continuous.

The Bottom Line

The companies that win in regulated agentic AI will not be the ones with the most autonomy. They will be the ones with the most control.

They will be the ones who can show how their agents are governed, how their decisions are scored, how edge cases are handled, and how agent behavior is monitored over time.

Arva AI understands this. They sought validation early. They built for testability and governance. And they are raising the bar for how agentic AI should be deployed in financial services.

If you care about deploying agents in regulated workflows, this conversation is worth your time.

Building or buying agentic systems for compliance or risk? See how FairPlay validates agent behavior with SMEvals, independent confidence checks, and end-to-end testing.

Related Podcasts

Relevant podcasts and insights from the FairPlay™️ Blog

Ready to get more out of your models?

Join leading financial institutions in deploying better AI faster.

Schedule a DemoChat With Us