From PDF to Searchable Knowledge Base — Without Writing a Single Line of Code

Every RAG tutorial starts the same way: load a PDF, split it into chunks, generate embeddings, upsert into a vector database, wire up a retrieval chain. It takes maybe 50 lines of Python to demo. It takes 500 to make it production-worthy. And it takes another thousand to handle the edge cases nobody mentions — duplicate documents, metadata management, chunking strategies that don’t butcher your content, swapping vector stores without rewriting everything.

I’ve been building data pipelines for 25 years. The pattern is always the same: something that takes an afternoon to prototype takes months to harden. RAG ingestion is no different. So when we built vector database support into Datris Platform, the goal was simple — make the production version as easy as the demo.

The Problem with DIY RAG Pipelines

Let me paint a picture. You’re a healthcare startup building a clinical decision support tool. Your doctors need to ask natural language questions against thousands of medical research papers, clinical guidelines, and drug interaction databases. The prototype works great with 50 PDFs in a Jupyter notebook.

Then reality hits:

New papers arrive daily. You need automated ingestion.
Some documents are 200-page PDFs, others are two-paragraph summaries. Your chunking strategy can’t be one-size-fits-all.
You need metadata filtering — search only cardiology papers, or only documents published after 2024.
The team wants to evaluate Qdrant vs. pgvector vs. Weaviate. Each one has a different API, different client library, different schema model.
Compliance wants an audit trail of every document ingested, every chunk generated, every embedding created.

You started with a script. Now you need a platform.

Config-Driven RAG Ingestion

Here’s what that same pipeline looks like in Datris. No Python. No custom code. Just JSON configuration:

{
  "name": "clinical_research_papers",
  "source": {
    "fileAttributes": {
      "unstructuredAttributes": {
        "fileExtension": "pdf",
        "preserveFilename": true
      }
    }
  },
  "destination": {
    "qdrant": {
      "collectionName": "clinical_research",
      "chunking": {
        "strategy": "recursive",
        "chunkSize": 500,
        "chunkOverlap": 50
      },
      "metadata": {
        "department": "cardiology",
        "documentType": "research-paper",
        "source": "pubmed"
      },
      "embeddingSecretName": "prod/embedding",
      "qdrantSecretName": "prod/qdrant"
    }
  }
}

That’s it. Upload a PDF through the API, the MCP server, or the UI — Datris extracts the text, chunks it with your chosen strategy, generates embeddings, and upserts everything into Qdrant with the metadata you specified. The collection is auto-created if it doesn’t exist.

Want to tune the chunking? Change chunkSize and chunkOverlap. Want to switch to a semantic chunking strategy? Change the strategy field. Want different metadata per document batch? Update the metadata object. Every change is a config change, not a code change.

Five Vector Databases, One Interface

This is where it gets interesting. Datris supports five vector database destinations out of the box:

Qdrant — High-performance, purpose-built for similarity search at scale
pgvector — Vector search inside your existing PostgreSQL. No new infrastructure.
Chroma — Lightweight, single Docker container, great for development and smaller workloads
Weaviate — Schema-aware with hybrid search (vector + keyword)
Milvus — Distributed, built for billion-scale vector datasets

The configuration interface is identical across all five. Switch from Qdrant to pgvector? Replace destination.qdrant with destination.pgvector and add your connection details. That’s a five-minute migration, not a five-week one.

This matters more than it sounds. An e-commerce company I talked to recently started with Chroma for their product search prototype, realized they needed pgvector for production because their ops team already managed PostgreSQL, and then six months later evaluated Milvus when their catalog hit 10 million products. With Datris, each of those transitions is a config change.

Real-World Use Cases

Manufacturing — Equipment Manuals and SOPs. A factory floor team needs to search across thousands of equipment manuals, safety procedures, and maintenance logs. Ingest PDFs with metadata tags for equipment type, facility location, and document category. Technicians ask “What’s the torque spec for the Series 400 bearing assembly?” and get answers grounded in the actual manuals.

SaaS — Customer Support Knowledge Base. Support agents need instant access to product docs, past tickets, and troubleshooting guides. Ingest your documentation into a vector store with metadata for product version and topic. Combine with an LLM for a support copilot that actually knows your product — not just generic answers.

Energy — Regulatory Compliance. Utilities deal with thousands of pages of regulatory filings, environmental assessments, and compliance guidelines across jurisdictions. Tag documents by state, regulation type, and effective date. When a new regulation drops, your compliance team searches against the entire corpus in seconds.

Logistics — Contract and Shipment Documentation. Freight brokers and 3PLs juggle carrier contracts, bills of lading, and customs documentation across hundreds of shipments. Ingest everything with carrier name, route, and date metadata. When a dispute arises, pull the relevant contract clauses instantly.

Agent-Operated RAG

Here’s the part that gets me most excited. Because Datris has a built-in MCP server, AI agents can operate the entire RAG pipeline autonomously. An agent can:

Register a new pipeline configuration
Upload documents
Trigger processing
Run semantic searches against the resulting vector store
Profile the ingested data

No human in the loop for routine ingestion. Your agent watches a document source — an S3 bucket, an email inbox, a Slack channel — picks up new files, and pushes them through the pipeline. The vector store stays current without anyone writing a cron job or a Lambda function.

The chat-vector-store example in the repo demonstrates this end-to-end: ingest documents through Datris, then chat with them using any of the five supported vector stores.

Getting Started

Datris runs anywhere Docker does. Clone the repo, run docker compose up, and you’ve got the full platform — including the vector database destinations and the MCP server.

git clone https://github.com/datris/datris-platform-oss.git
cd datris-platform-oss
docker compose up -d

Create a pipeline config, upload a document, and you’ve got a searchable knowledge base. No frameworks to install, no embedding code to write, no vector DB client libraries to manage.

The gap between a RAG demo and a RAG system is mostly unglamorous infrastructure work — chunking, metadata, secret management, multi-destination support, observability. That’s exactly the kind of work a data platform should handle for you.

Stop building RAG plumbing. Start building the applications that use it.

Check out the Datris Platform on GitHub or visit datris.ai to learn more. Full documentation at docs.datris.ai.

Todd Fearn is the founder of Datris.ai and has spent 25+ years building data infrastructure and AI solutions across financial services, healthcare, and enterprise — from Goldman Sachs and Bridgewater Associates to Deutsche Bank, Freddie Mac, and beyond.