perf: reduce RAG latency by removing the second generation pass

# Reduce latency by replacing the two-stage RAG answer flow with a single-pass generation

While looking through the current RAG pipeline, I noticed that `/ask` is still doing **two separate model generations** for a single user request.

From `app/ai.py`, the current flow in `RAGAgent.run()` is roughly:

1. generate an initial answer from the retrieved context
2. send that answer back into the model again with `child_prompt`
3. return the rewritten version as the final response

So even for one question, the model is queried twice.

I understand why this was originally useful: it makes the final answer more child-friendly. But in practice it also adds avoidable latency and doubles the generation work for the main RAG endpoint.

## Why I think this should change

The current two-pass design has a few downsides:

- higher response latency
- more compute per request
- worse UX on slower/local models
- makes future streaming support harder, since the user still waits for a second full generation pass
- increases complexity in the main request path for something that may be better handled in the prompt itself

## Proposed change

Change the main RAG answer flow to **single-pass generation**.

Instead of:

`retrieval -> answer generation -> answer simplification`

use:

`retrieval -> one final answer generation`

The child-friendly tone/style can be folded into the main prompt so the model produces the final response directly in one pass.

## Suggested implementation

In `RAGAgent.run()`:

- remove the second model call through `self.child_prompt`
- update the primary prompt so it directly asks for:
  - accurate answer
  - child-friendly wording
  - concise and clear explanation
- return the first generated answer as the final answer

So the main endpoint becomes:

- retrieve relevant context
- build one prompt
- call the model once
- return the answer

## Alternative if maintainers want to preserve both behaviors

If the two-stage rewrite is still useful in some situations, I think it should at least be made **optional**, not the default.

For example:

- default: single-pass answer
- optional: `child_friendly_rewrite=True` or config-based second pass

That would keep the feature available without forcing the latency penalty on every request.

## Why this is a better default

For the main API path, I think correctness + speed should come first.

If we want a child-friendly response, the cleaner approach is usually to ask for that in the main prompt rather than generating an answer and then asking the model to rewrite itself again.

That should:

- reduce latency noticeably
- reduce compute cost
- simplify the request pipeline
- make future streaming much easier

## Acceptance criteria

- `RAGAgent.run()` uses only one model generation by default
- the final response still keeps the intended child-friendly tone
- no second rewrite pass in the default `/ask` path
- documentation updated if behavior changes

If this direction sounds reasonable, I’d be happy to work on it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce RAG latency by removing the second generation pass #114

Reduce latency by replacing the two-stage RAG answer flow with a single-pass generation

Why I think this should change

Proposed change

Suggested implementation

Alternative if maintainers want to preserve both behaviors

Why this is a better default

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: reduce RAG latency by removing the second generation pass #114

Description

Reduce latency by replacing the two-stage RAG answer flow with a single-pass generation

Why I think this should change

Proposed change

Suggested implementation

Alternative if maintainers want to preserve both behaviors

Why this is a better default

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions