EvalBoard only becomes useful after runs accumulate. A single run tells you how one prompt and model behaved once. A dashboard tells you which models are reliable, which datasets are hard, how latency changes, and whether recent prompt changes improved anything.
The dashboard is powered by one endpoint: GET /api/runs/stats/?period=.... The backend filters completed runs by time window, computes several ORM aggregations, caches the result for 60 seconds, and returns a shape that the React/Recharts dashboard can render directly.
The endpoint returns summary totals, top models, score distribution, runs over time, latency over time, and top datasets. There is no Pandas layer, no manual data warehouse, and no raw SQL. For the scale of this project, Django’s ORM is enough.
Aggregation as product code
The stats module uses the standard ORM tools:
aggregate() computes the summary cards: total runs, average score, and total evaluated items.
values().annotate().order_by() groups runs by model, provider, date, and dataset.
Conditional aggregation computes the score distribution in one query using Case, When, and Count: perfect, partial, and failed results.
Relational lookups like dataset__name let the dashboard rank datasets without writing explicit joins.
This is one of Django’s strengths: the code reads like business logic while still pushing work to the database. The chart definitions on the frontend stay simple because the API already returns chart-ready groups.
Store aggregates or compute every time?
EvalBoard stores aggregate fields on the Run: average score, total items, failed items, and latency. That means the dashboard can aggregate over runs without recalculating every row result every time.
The tradeoff is denormalization. If row results are ever edited after completion, the stored aggregate fields could drift. In this app, run results are append-only historical records, so the tradeoff is worth it. Completion is the moment when aggregates become facts.
That is a general lesson: denormalization is dangerous when the underlying data keeps changing. It is reasonable when the event is complete and immutable. A completed eval run is closer to an invoice than a live counter.
Cache where the user repeats
The stats endpoint uses Django’s cache framework with a period-based key and a short TTL. Dashboard users often refresh or switch back to the same view, so caching smooths repeated reads without creating cache invalidation complexity.
The TTL is deliberately short. A newly completed run may take up to a minute to appear in cached stats, which is acceptable for an analytics dashboard. It would not be acceptable for the run detail page, where the user expects immediate row-level results. Different screens deserve different freshness guarantees.
The frontend boundary
The React dashboard renders charts, but it does not own analytics logic. It asks for a period and receives grouped data. That keeps business meaning in the backend, where model relationships and scoring definitions live.
This split is something I would keep in larger apps. Frontend chart libraries are good at rendering. They are not the best place to define what “top model” means or how failed scores are bucketed. Django is closer to the data and easier to test for those rules.
EvalBoard’s dashboard taught me that analytics does not have to start big. If the product question is clear, a few well-shaped ORM aggregations can turn ordinary application tables into useful operational insight.