I Stopped Waking Engineers Up at 3 AM. An Agent Lets Them Sleep

I’ve been building data pipelines since the mid-90s. Back then it was ETL tools, flat files, and a lot of custom code. The tooling is dramatically better now. The operational model isn’t.

Something breaks. An alert fires. A human reads it. The human fixes it.

That’s about to change — not incrementally, structurally.

AI agents are now good enough at tool use that they can operate data infrastructure, not just monitor it. Reroute flows. Adjust transformations. Quarantine bad records. Resume operations. Without waking anyone up at 2 AM.

This is autonomous data operations. And it’s closer to production-ready than most people think.

The Gap Between Observing and Fixing

Most mature data teams have excellent monitoring. They know within minutes when a pipeline fails, when data quality drops, when a schema changes unexpectedly. Observability is genuinely solved.

The gap is between observing a problem and resolving it. That gap still requires a human — and humans are slow, expensive, and unavailable at 3 AM on a Sunday.

Closing that gap requires three things from an AI agent:

Semantic understanding of the failure — not just a stack trace
A set of operations it can invoke — actions against the pipeline
Enough context to make safe decisions — knowing when to auto-fix versus escalate

The first and third are LLM problems. The second is an infrastructure problem. Giving agents a clean, safe operational surface against your data infrastructure is what most platforms don’t yet provide.

What This Looks Like in Practice

A firm is ingesting reference data from a major vendor into their pricing engine. No agents involved — a homegrown Python script pulls the vendor file each morning, transforms it, and loads it into PostgreSQL. Built in-house five years ago. Has mostly worked since.

Then the vendor pushes a silent schema update. As part of a broader move toward multi-standard identifier support, they rename cusip to instrument_id and add a new settlement_status field to every record. From their perspective, a minor enhancement. From the script’s perspective, a field it has been relying on no longer exists in the payload. It throws a key error, writes nothing to the database, and exits. Risk models downstream keep running on yesterday’s data. Nobody knows yet.

Nobody notices until the morning risk meeting, when someone asks why three positions are showing yesterday’s corporate actions. An engineer gets pulled in, digs through logs, finds the stack trace, figures out it’s a vendor schema change, patches the script, and manually re-runs the failed batch. By the time the data is current, hours have passed — and that’s assuming someone responds quickly and the fix is straightforward.

That’s the old model. Silent failure, delayed discovery, manual triage. Every time.

Now run the same scenario — but the firm has migrated to an agent-driven data platform.

Homegrown script

      Script throws key error, exits
    
↓

      Silent failure
nothing written to database
    
↓

      Discovered hours later
risk meeting flags stale data
    
↓

      Engineer patches + reruns
3–4 hours elapsed
    
MCP agent

      upload_data( ) → 100% rejected
    
↓

      Agent reads AI error
field "cusip" not found in payload
    
↓

      Remediates in memory
renames field, flags new column
    
↓

      Resolved, engineer notified
under 2 minutes elapsed

The agent’s job each morning is to pull the vendor file and call upload_data("corporate_actions_feed", <batch>) to push it through the pipeline. The platform returns a job token. The agent immediately calls get_job_status(<token>) to monitor the run.

This morning, status comes back with 100% record rejection. The agent reads the attached AI error explanation: schema field cusip not found in payload. It already has the raw vendor file — it was the one that pulled it. It inspects the payload directly, sees that cusip has been renamed to instrument_id and a new settlement_status field has been added. Vendor schema change. Not a platform problem.

The agent remediates the data in memory: renames instrument_id back to cusip to match the existing pipeline schema, drops settlement_status for now and flags it for human review. Then calls upload_data("corporate_actions_feed", <corrected_batch>). get_job_status(<token>) comes back clean. Zero failures. The database is current before anyone arrives at their desk.

The on-call engineer gets a Slack message summarizing what happened, what changed, and a rollback link. Not a pager alert. Not a ticket. A summary of work already done.

Elapsed time from first failure to data current: under 2 minutes.

But the agent isn’t done. It knows the root cause — a permanent vendor schema change — and that the pipeline config still expects the old field names. If left as-is, tomorrow’s batch will fail the same way and the agent will fix it again. That’s not the goal.

So alongside the Slack summary, the engineer gets a second message: a recommended pipeline config update with the schema change pre-populated and a one-click approval link. The engineer reviews it, confirms it looks right, approves. The pipeline is updated. Tomorrow’s batch runs clean without any intervention at all.

That’s the full loop. Agent handles the immediate problem autonomously. Human approves the permanent fix on their own schedule. Nobody got woken up. Nothing fell through the cracks.

The Architecture That Makes This Possible

The key insight is simple: if your data platform exposes its operations as MCP tools, any LLM-powered agent can become an operator. No proprietary integration code. No custom glue. Just tool definitions, permissions, and an agent that knows how to use them.

In this scenario, the agent doesn’t touch the pipeline config at all. It doesn’t need to. Because it’s the one driving ingestion, it already has the raw data. When something fails, it reads the error, inspects the payload it just submitted, remediates the data directly, and resubmits. The pipeline is unchanged. The fix lives in the agent’s reasoning loop, not in the infrastructure.

    Agent pulls vendor file
scheduled each morning
  
↓

    upload_data( ) → job token
pushes batch through Datris pipeline
  
↓

    get_job_status( ) → 100% rejected
reads AI error: field "cusip" not found
  
↓

    inspects raw payload
cusip → instrument_id, new settlement_status
  
↓

    remediates in memory
renames field, flags new column for review
  
↓

    upload_data( ) → corrected batch
resubmits to pipeline, job passes clean
  
upload_data( )
corrected batch

The agent doesn’t touch the pipeline config. That’s institutional knowledge — agreed schemas, validated rules, business logic. Not something you want rewritten on the fly. But remediating data before it enters the system? That’s exactly what agents should own.

The rule of thumb: reversible + low blast radius → auto-execute. Irreversible + broad impact → human approval. Everything else → agent recommends, human clicks. Build this into system prompt and tool permissions. Constraints should be structural, not conversational.

Where This Is Heading

Within 18 months, autonomous data ops will be a standard checkbox in enterprise data platform evaluations — the same way real-time streaming became table stakes around 2018. The teams building agent-operable infrastructure now will have a head start.

The data layer is becoming a first-class citizen in the AI agent stack. Not just a passive store agents query — an active system agents can configure, monitor, heal, and optimize. That shift is already underway.

If you’re building toward autonomous data operations, the Datris Platform is open source on GitHub and designed to be operated by both humans and agents. Docs at docs.datris.ai.

Todd Fearn is the founder of Datris.ai and has spent 30+ years building data infrastructure across financial services and enterprise — Goldman Sachs, Bridgewater Associates, Deutsche Bank, Salomon Brothers, and others.