RAG answers are slow enough that a normal request-response API can make a product feel broken. Retrieval has to run, the prompt has to be assembled, the model has to generate, and only then does the user see anything. For Birr AI, that was the wrong interaction model.
The fix was to stream the answer. The chat endpoint returns text/event-stream, and the frontend can append tokens as they arrive. The backend still persists the final AI message, sources, and optional session title at the end of the stream, but the user is not stuck staring at a blank loading state.
I used Server-Sent Events instead of WebSockets because the communication pattern is one-way: the browser sends one question, then the server streams answer events back. WebSockets would have worked, but they would also have introduced connection state, extra infrastructure questions, and more frontend complexity. SSE matched the workflow.
The Django shape
The ask endpoint is an async DRF view using adrf. It validates the incoming message, fetches the ChatSession with async ORM calls, builds LangChain chat history, persists the human message, creates the RAG bot, then returns a StreamingHttpResponse.
Inside the response, an async generator yields JSON events:
data: {"token": "..."}
data: {"done": true, "content": "...", "source_urls": [...], "documents": [...]}
The RAG service uses chain.astream(inputs) to receive model tokens. Each token is yielded immediately to the HTTP stream and also appended to an in-memory list. When generation completes, the service joins those parts, strips the first-turn title tag if present, merges structured and semantic sources, and emits one final done event.
That final event is important. Tokens are good for perceived speed, but the product still needs a durable message record. The view creates the AI ChatMessage only when the final event arrives, then includes the serialized message, source URLs, structured document payloads, and session title in the done payload.
Why this used Django well
The nice part is that most of the app stayed synchronous. Document browsing, saved documents, account views, and normal session CRUD are ordinary DRF endpoints. Async is used only where the request benefits from holding a connection open while awaiting model output.
That is a pattern I want to reuse: use async surgically. Django can serve normal CRUD cleanly with the synchronous stack, and ASGI can handle streaming where it actually matters. Adding async everywhere would have increased the surface area without improving the product.
There are also several small HTTP details that matter in practice. The response sets Cache-Control: no-cache so intermediaries do not cache a stream. It sets X-Accel-Buffering: no to discourage proxy buffering. The payloads are JSON encoded with Django’s JSON encoder so dates and Django values serialize predictably.
The tradeoffs
SSE is simple, but it is not magic. The stream can drop. A proxy can buffer if configured badly. The browser cannot send messages over the same channel. Backpressure is also less explicit than with a queue-backed worker system.
For this project, those tradeoffs were acceptable. A dropped connection loses the live stream, but the human message was already persisted before the model call. A retry has enough history to answer again. The endpoint also sends a structured error event if the RAG chain fails, which gives the frontend a consistent failure shape.
The bigger tradeoff is where to persist. Persisting the human message before the model call means failed AI responses can leave a user message without a matching assistant turn. I accepted that because it makes retries more honest: the system knows what the user asked. The alternative, waiting to persist until everything succeeds, would make failures look cleaner in the database but less useful for recovery.
The general lesson is that streaming is not just a frontend polish feature. It changes the backend contract. You need a token event, a final event, persistence timing, source timing, error events, and retry behavior. Django gives enough primitives to do this without turning the app into a realtime platform.