A RAG chunker measuring characters while the embedding API counts tokens — the unit mismatch that silently kills pipelines on dense content

There’s a very specific failure that bites every team building a serious RAG pipeline, and it almost always shows up the same way. Things work great on a few sample PDFs. Then someone ingests a real document and the job dies with "input is too long for the requested model".

The instinct is to lower the chunk size. So you cut it in half. The job runs further this time, then dies again on a different chunk. You cut it in half again. Now retrieval quality drops because every chunk is the size of a tweet.

You’re fighting the wrong problem. Your chunker is measuring the wrong thing.

The unit mismatch

Almost every document chunker measures chunk size in characters. Every embedding provider enforces an input limit in tokens. The two correlate well on plain English prose, and they correlate terribly on everything else.

500 characters of an English paragraph is maybe 100 tokens. 500 characters of a financial-statement table is closer to 1,500. 500 characters of base64, dense JSON, or a numeric grid can hit 3,000+. Non-Latin scripts (Chinese, Japanese, Arabic) blow the ratio out the other direction.

A perfectly-tuned chunkSize: 500 works beautifully on the user manual you tested with and then explodes the day someone uploads a 10-Q. The chunker has no idea anything is wrong. The embedding API sees one chunk in a batch of 32 that’s over the model’s cap, rejects the whole batch, and the user gets back a stack trace.

This isn’t an OpenAI problem

Tempting to file this under “OpenAI is being picky.” It’s not. Every embedding provider has a cap, and the caps vary widely:

  • Some hosted models cap at 8,192 tokens.
  • bge and most sentence-transformers cap at 512.
  • Cohere endpoints cap at 512 with strict enforcement.
  • Some local models silently truncate instead of erroring, which is arguably the worst case — you don’t find out you lost the tail of half your document until search quality cratered three weeks ago.

A fix that only knows about OpenAI isn’t a fix. The platform has to know the cap for whatever provider you pointed it at, and protect every embedding call.

Catch it twice

There are exactly two places this can be caught: at the chunker, before the chunk exists, or at the embedding call, before the chunk goes on the wire. Most platforms do neither. A serious one does both.

The chunker is the primary defender. Tell it the token cap and it refuses to emit anything over the limit. Set up properly, the cap stops being something you think about.

But weird inputs still leak through. Pre-chunked documents. Custom taps that build their own chunks. A developer poking at the embedding endpoint directly from a notebook. Anything that puts text on the wire without going through your chunker. That’s what the server-side guard is for: a token check sitting in front of every embedding call, splitting or refusing oversized chunks before they hit the provider. You hope it never fires. When it does, the batch doesn’t die.

Split, don’t truncate

You’ve got three options when an oversized chunk shows up. You can truncate it, which is cheap but means you silently drop the tail. You can fail the job loudly — honest, but now the user has to come back and clean up. Or you can split the chunk into pieces that each fit under the cap, which takes a bit more code and loses nothing.

Split wins by default. Truncation is what most platforms do today and it’s the worst of the three because it looks like success — your batch goes through, you get embeddings back, no error anywhere. Three weeks later somebody asks a question that should match content from the back half of a chunk you silently chopped, and search returns nothing. You’ll never trace it.

When you split, each sub-chunk is independent in the vector store. Retrieval doesn’t care. The user doesn’t see a difference. The chunk count goes up, which shows up in the job log, and that’s it.

If the batch fails, name the chunk

Even with the chunker and the guard in place, embedding APIs sometimes return an error. A provider tightens a limit. A tokenizer changes under you. An input contains a character the encoder doesn’t handle. Something will eventually slip past.

What most platforms do at this point is bubble the raw provider error up to the user. "input too long" with no indication of which input. The user is staring at a batch of 32 documents wondering which one is the bad one.

Bare minimum: log the largest chunks in the failing batch with their token counts so the user has a place to start. Better: offer to retry the failed batch one chunk at a time, so one poison input doesn’t take down the other 31. That’s a tradeoff — N HTTP calls instead of one — but on the failure path it’s almost always worth it.

Provider-agnostic by construction

The pattern — token-aware chunker, server-side guard, loud and specific errors — doesn’t care which provider you point it at. Same defenses work for OpenAI, Cohere, Voyage, BGE-M3 running locally, Nomic via Ollama, Mistral, whatever you swap in next year. Caps are per-model. The guard reads them from a table of defaults, applies a safety margin, and that’s the whole thing.

That matters if you’re running a platform that ingests user-defined documents through user-defined chunkers into user-chosen vector stores with user-chosen embedding models. Every degree of freedom is one more way for the mismatch to surface, and you can’t ask the user to know in advance which model has which limit.

Two things to do Monday

If you’re building a RAG pipeline:

  1. Stop tuning chunkSize in characters. If your platform supports a token-aware cap on the chunker, use it. Set it to roughly 80% of your embedding model’s input limit and let the chunker handle the rest.
  2. Stop accepting opaque embedding failures. When a batch fails you should see which chunk caused it, what its token count was, and what the cap is. Anything less is the platform failing you, not you failing the platform.

If you’re building a platform that does this for other people: none of these pieces are enough on their own. The chunker is the primary defense, the guard catches what gets past it, and the error message saves you when something still slips through. You need all three, and the work isn’t expensive once you’ve decided the problem is worth solving.


Todd Fearn is the founder of Datris, an open-source, agent-native data platform built on the Model Context Protocol. Before Datris, he spent 30+ years building data infrastructure at Goldman Sachs, Bridgewater, Deutsche Bank, and other financial institutions.