I Built a News Engine That Cites Every Claim

I Built a News Engine That Cites Every Claim

I kept running into the same small frustration. I’d ask an LLM about a news story and get back something confident and fluent. ChatGPT and Claude both cite sources now, which helps, but the citations are often murky: a link that only half-supports the claim, or one you have to open and read before you’d trust it. So I’d end up verifying on the internet myself anyway. And for news, which outlet said what matters as much as the facts, and that’s the part that stays fuzzy.

So I started a project to find out: could I build something that answers questions about Indian news coverage and shows its work every time? Every claim tied to a real article, or it doesn’t make it into the answer.

That turned into Samvad. These are my notes from the build: what worked, what caught me off guard, and what I’d change next time.

The shape I landed on: four agents that can each say no

“Answer from sources” isn’t one job. It’s several, and they pull against each other. When I tried to do all of it in a single prompt, the answers came out either ungrounded or so cautious they were useless.

Splitting it into a small pipeline of agents — each with one job and the power to reject — worked much better:

        user query


   ┌─────────────────┐   rate limit + input checks run FIRST,
   │      Guard      │   before any model is called
   └────────┬────────┘
            │  (rejects abuse / off-topic / injection)

   ┌─────────────────┐
   │    Retriever    │   vector search → rerank → top 8 articles
   └────────┬────────┘
            │  (no articles? stop here, honestly say so)

   ┌─────────────────┐
   │     Analyst     │   synthesises an answer, every claim cited
   └────────┬────────┘


   ┌─────────────────┐
   │     Critic      │   grades how grounded the answer is;
   └────────┬────────┘   sends it back once if it's too loose


        answer + sources

You can try the whole thing on a real query at ask.samvadhq.com — the sources panel is the point, so click into them.

The Guard: cheap checks before expensive ones

The thing I underestimated was how much of the work happens before the model ever runs. The Guard agent runs first and decides whether a query is even a legitimate news question. It rejects prompt-injection attempts (“ignore previous instructions…”), off-topic requests (coding help, recipes), and queries that smuggle a command inside a news-shaped question.

The rule that mattered most turned out to be a dull one: judge the query as a whole. A query can read like a fine news question in one clause and hide an instruction in the next. Reading it whole instead of phrase by phrase closed a lot of gaps.

And one design choice I’m glad I made early — the truly cheap checks (rate limits, basic input validation) run before the Guard model is ever called. If someone hammers the endpoint, the abuse never reaches a paid model. A rejected request costs me nothing. It felt obvious in hindsight, but I’d originally put the model first and the rate limit second, which is exactly backwards.

The honest caveat: the Guard is never really finished. It catches the tricks I’ve seen so far, but people are endlessly creative about smuggling instructions past a filter, and I fully expect attempts I haven’t thought of yet. I treat it as a moving target to keep hardening, not a box I’ve ticked.

The Retriever: getting the right eight articles

It all comes down to retrieval. If the right articles aren’t in front of the Analyst, no amount of clever prompting downstream saves the answer.

The flow itself is nothing fancy: embed the query, run a vector search scoped to a recent time window of the corpus (recent news is the whole point), pull the top 20 candidates, then rerank down to the 8 that actually matter. Twenty is wide enough to catch the right article. Eight is few enough that the Analyst can reason over all of them properly.

The wrinkle with news is recency. Most RAG demos run on a static corpus. News is a moving window, where “what’s the latest coverage” is a different question from “what’s the best semantic match.” So time is a first-class part of the query. Today the corpus is only a few days deep, but as it grows the plan is to expose real time filters: last 7 days, 30, 90, a year. Building the window into retrieval from the start made answers feel far more current.

If the Retriever comes back with nothing, the pipeline stops right there and says so plainly. I went back and forth on this, but a deterministic “I don’t have coverage on that” beats letting the model improvise from thin air — and it doesn’t charge the user for an answer it couldn’t actually give.

The Critic: letting the system grade its own homework

Going in, I trusted this part the least. It’s the one I rely on most now.

After the Analyst writes an answer, a separate Critic agent reads it back against the retrieved articles and estimates what fraction of the claims are ungrounded — asserted but not actually supported by a source. If more than 30% of the answer is floating free, the Critic rejects it and the pipeline tries once more.

A couple of things came out of it:

  1. A strict, boring rule beats a vague one. “Approve only if ungrounded fraction is at or below 0.30” is easy to reason about and easy to tune. My earlier, fuzzier “is this answer good?” prompt was useless — it approved almost everything.
  2. One retry is plenty. I expected to need a loop. In practice, if the first answer is too loose, the pipeline regenerates it once and serves that second attempt marked low-confidence — rather than keep spending tokens chasing a clean pass. Being honest about confidence beat chasing perfection.

What I’d do differently

A few honest notes-to-self:

  • The snippet isn’t the article. Right now reranking mostly works off titles and short descriptions. The richest signal is in the article body, and I’m fairly sure ranking on the full text would sharpen results. That’s the next thing I want to explore.
  • Prompts deserve version control like code does. Mine evolved by editing in place, which makes it hard to say why answer quality moved. Treating prompts as versioned artifacts with a fixed eval set is on my list.
  • Returning nothing is sometimes the right answer. The instinct is to always produce something. Resisting that, letting it say nothing when it has nothing, did more for trust than any single improvement to the answers themselves.

Why I’m sharing this

Mostly because building it taught me more than reading about RAG ever did, and writing it down forces me to understand what I actually learned. If you want to poke at the thing itself, it’s live at ask.samvadhq.com. One thing worth knowing: the corpus is just getting started, so it only goes back about five days right now. Ask it about something from the last few days of Indian news and you’ll get the best picture of what it can do. Check the sources it hands back, and I’d genuinely like to know where it holds up and where it doesn’t.

More notes from this build coming — next up, the boring Postgres trick I used to run the ingestion queue without a queue service.

Full size image