Agent Benchmarks Are Missing the Boring Part: Data Readiness

Agent benchmarks score reasoning and tool use while ignoring the data-readiness layer — freshness, schema, validation, lineage, and permissions — that decides whether production agents can be trusted

Watch enough agent demos and you’d swear the hard part is reasoning. Give the model a goal, hand it a browser or a shell or an MCP server, and it slices the task into steps, calls a few tools, and hands you something that looks like work.

It’s a good trick. It’s almost never the thing that breaks once you put one of these systems in front of real data. The position file is a day stale and nothing flagged it. Someone changed the schema overnight, so status_cd doesn’t mean what it meant yesterday. The trade export has duplicate rows because an upstream job ran twice. The custodian’s API returned a clean 200 and quietly dropped half the rows on the floor. In every case the agent had all the tool access it could want, and no way to know whether what it was reading was current, complete, correct, or even allowed to leave the building.

Nobody demos that part. It’s also the part that decides whether you can trust the output enough to act on it.

Benchmarks measure the layer that already works

Most agent benchmarks score task completion: can it browse the site, write the function, pull the answer out of a document set, call the right tool in the right order. Reasonable things to measure. The trouble is what they assume going in — a clean environment, where the documents are present, the answer is in the context, and the grader can say cleanly whether the result was right.

Production hands you none of that. A real agent runs inside systems that are messy and changing while it works, and it has to keep asking the unglamorous questions: which dataset is the source of truth, when was it last refreshed, did validation reject anything, which chunks fed this answer, can I retry without creating duplicates. If the platform can’t answer those, the agent is guessing. A smarter model just guesses more convincingly.

Tool access isn’t data readiness

MCP was a real step forward. Before it, every integration was a bespoke adapter somebody had to write and babysit; now an agent can discover what’s available and call it through one protocol. But discovery is layer one. A tool called query_counterparty_exposure is fine. A tool that also tells you the schema, freshness, validation results, row count, job status, lineage, and permission boundary is a different animal. The first lets an agent call a function. The second lets it run a process and know what happened.

Take a question like: which counterparties breached their exposure limits last quarter, and does our credit policy explain the approved exceptions? Sounds like one question, but it crosses structured position records, counterparty reference data, policy and committee memos, approval history, and both vector search and SQL. The agent has to move across all of it and know what happened at every hop — check freshness, query the breaches, search the policy docs, and answer with citations and row counts it can stand behind. None of that is exotic. It’s ordinary back-office work, and the agent ecosystem just isn’t good yet at making that loop routine.

What “data readiness” actually means

“Our data isn’t ready for AI” sounds like an excuse until you break it down. Freshness: the agent has to know when something was last updated and whether the refresh succeeded — a confident stale answer is worse than no answer, because somebody acts on it. Shape: schema, types, descriptions, example values, so it isn’t inferring what status_cd means from a hopeful prior. Quality: validate before the agent touches it, so dates are dates and duplicates get caught. Lineage: it should be able to say which files, tables, chunks, and jobs produced an answer. Operational state: “still running,” “finished with warnings,” and “died on a dimension mismatch” are three different outcomes it has to tell apart. Permissions: what can be read, written, transformed, and exported belongs in the platform, not in a paragraph of prompt you’re hoping the model respects.

I’ve never seen a conference talk about any of this. It’s most of what separates something you can run unattended from something you have to stand over.

The good platforms are going to look boring underneath

I’ve watched this pattern repeat across a couple of decades of infrastructure work: the visible layer gets the credit, the boring layer decides whether the thing holds up. Everyone points the camera at the model reasoning through a task. Ingestion, validation, transformation, routing, retrieval, audit, retry, governance — that’s what determines whether you can run it on a Tuesday without anyone watching.

So I’d rather we stopped grading agents only on how clever they look on stage. Can it tell “I don’t know” apart from “the source hasn’t refreshed yet”? Will it stop when validation fails instead of barreling ahead? Can someone reconstruct afterward exactly what it did? Those questions matter the moment money or compliance is on the line, which in my world is always.

What we’re building at Datris

Datris starts from a simple premise: an agent needs more than a set of endpoints. It needs a data platform it can operate — stand up a pipeline, ingest files and documents, validate what came in, route records into databases or vector stores, poll jobs for status, search RAG stores, query Postgres or Mongo, and get a real explanation when something fails, all through MCP tools.

The point isn’t to dump every low-level detail onto the model. It’s to give the agent dependable handles — clear schemas, explicit validation, observable jobs, auditable results — so the model does the part it’s good at while the platform makes the data usable. The next phase of this won’t be won with one more impressive demo. It’ll be won by the infrastructure that lets agents do dull, repeatable, trustworthy work.

If you’re building agents that have to deal with real data — structured, unstructured, streaming, documents, whatever you’ve got — take a look at Datris Platform OSS on GitHub, or come find it at datris.ai.

Todd Fearn is the founder of Datris.ai. He’s spent 25+ years building AI and data infrastructure for financial services, including at Goldman Sachs, Bridgewater Associates, Deutsche Bank, Freddie Mac, and others.