Reduce latency by replacing the two-stage RAG answer flow with a single-pass generation
While looking through the current RAG pipeline, I noticed that /ask is still doing two separate model generations for a single user request.
From app/ai.py, the current flow in RAGAgent.run() is roughly:
- generate an initial answer from the retrieved context
- send that answer back into the model again with
child_prompt
- return the rewritten version as the final response
So even for one question, the model is queried twice.
I understand why this was originally useful: it makes the final answer more child-friendly. But in practice it also adds avoidable latency and doubles the generation work for the main RAG endpoint.
Why I think this should change
The current two-pass design has a few downsides:
- higher response latency
- more compute per request
- worse UX on slower/local models
- makes future streaming support harder, since the user still waits for a second full generation pass
- increases complexity in the main request path for something that may be better handled in the prompt itself
Proposed change
Change the main RAG answer flow to single-pass generation.
Instead of:
retrieval -> answer generation -> answer simplification
use:
retrieval -> one final answer generation
The child-friendly tone/style can be folded into the main prompt so the model produces the final response directly in one pass.
Suggested implementation
In RAGAgent.run():
- remove the second model call through
self.child_prompt
- update the primary prompt so it directly asks for:
- accurate answer
- child-friendly wording
- concise and clear explanation
- return the first generated answer as the final answer
So the main endpoint becomes:
- retrieve relevant context
- build one prompt
- call the model once
- return the answer
Alternative if maintainers want to preserve both behaviors
If the two-stage rewrite is still useful in some situations, I think it should at least be made optional, not the default.
For example:
- default: single-pass answer
- optional:
child_friendly_rewrite=True or config-based second pass
That would keep the feature available without forcing the latency penalty on every request.
Why this is a better default
For the main API path, I think correctness + speed should come first.
If we want a child-friendly response, the cleaner approach is usually to ask for that in the main prompt rather than generating an answer and then asking the model to rewrite itself again.
That should:
- reduce latency noticeably
- reduce compute cost
- simplify the request pipeline
- make future streaming much easier
Acceptance criteria
RAGAgent.run() uses only one model generation by default
- the final response still keeps the intended child-friendly tone
- no second rewrite pass in the default
/ask path
- documentation updated if behavior changes
If this direction sounds reasonable, I’d be happy to work on it.
Reduce latency by replacing the two-stage RAG answer flow with a single-pass generation
While looking through the current RAG pipeline, I noticed that
/askis still doing two separate model generations for a single user request.From
app/ai.py, the current flow inRAGAgent.run()is roughly:child_promptSo even for one question, the model is queried twice.
I understand why this was originally useful: it makes the final answer more child-friendly. But in practice it also adds avoidable latency and doubles the generation work for the main RAG endpoint.
Why I think this should change
The current two-pass design has a few downsides:
Proposed change
Change the main RAG answer flow to single-pass generation.
Instead of:
retrieval -> answer generation -> answer simplificationuse:
retrieval -> one final answer generationThe child-friendly tone/style can be folded into the main prompt so the model produces the final response directly in one pass.
Suggested implementation
In
RAGAgent.run():self.child_promptSo the main endpoint becomes:
Alternative if maintainers want to preserve both behaviors
If the two-stage rewrite is still useful in some situations, I think it should at least be made optional, not the default.
For example:
child_friendly_rewrite=Trueor config-based second passThat would keep the feature available without forcing the latency penalty on every request.
Why this is a better default
For the main API path, I think correctness + speed should come first.
If we want a child-friendly response, the cleaner approach is usually to ask for that in the main prompt rather than generating an answer and then asking the model to rewrite itself again.
That should:
Acceptance criteria
RAGAgent.run()uses only one model generation by default/askpathIf this direction sounds reasonable, I’d be happy to work on it.