Skip to content

perf: reduce RAG latency by removing the second generation pass #114

@yjing86

Description

@yjing86

Reduce latency by replacing the two-stage RAG answer flow with a single-pass generation

While looking through the current RAG pipeline, I noticed that /ask is still doing two separate model generations for a single user request.

From app/ai.py, the current flow in RAGAgent.run() is roughly:

  1. generate an initial answer from the retrieved context
  2. send that answer back into the model again with child_prompt
  3. return the rewritten version as the final response

So even for one question, the model is queried twice.

I understand why this was originally useful: it makes the final answer more child-friendly. But in practice it also adds avoidable latency and doubles the generation work for the main RAG endpoint.

Why I think this should change

The current two-pass design has a few downsides:

  • higher response latency
  • more compute per request
  • worse UX on slower/local models
  • makes future streaming support harder, since the user still waits for a second full generation pass
  • increases complexity in the main request path for something that may be better handled in the prompt itself

Proposed change

Change the main RAG answer flow to single-pass generation.

Instead of:

retrieval -> answer generation -> answer simplification

use:

retrieval -> one final answer generation

The child-friendly tone/style can be folded into the main prompt so the model produces the final response directly in one pass.

Suggested implementation

In RAGAgent.run():

  • remove the second model call through self.child_prompt
  • update the primary prompt so it directly asks for:
    • accurate answer
    • child-friendly wording
    • concise and clear explanation
  • return the first generated answer as the final answer

So the main endpoint becomes:

  • retrieve relevant context
  • build one prompt
  • call the model once
  • return the answer

Alternative if maintainers want to preserve both behaviors

If the two-stage rewrite is still useful in some situations, I think it should at least be made optional, not the default.

For example:

  • default: single-pass answer
  • optional: child_friendly_rewrite=True or config-based second pass

That would keep the feature available without forcing the latency penalty on every request.

Why this is a better default

For the main API path, I think correctness + speed should come first.

If we want a child-friendly response, the cleaner approach is usually to ask for that in the main prompt rather than generating an answer and then asking the model to rewrite itself again.

That should:

  • reduce latency noticeably
  • reduce compute cost
  • simplify the request pipeline
  • make future streaming much easier

Acceptance criteria

  • RAGAgent.run() uses only one model generation by default
  • the final response still keeps the intended child-friendly tone
  • no second rewrite pass in the default /ask path
  • documentation updated if behavior changes

If this direction sounds reasonable, I’d be happy to work on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions