Designing a structured system for LLM evaluation
What if reviewing AI outputs felt as smooth as checking a message? That was the idea.
Human Review Mode turns a technical, high-focus process into something that flows. Reviewers can compare, score, and comment without breaking rhythm. Every detail, from the side-by-side layout to color-coded feedback, was built to make judgment feel simple, not stressful.

Role
Product Designer
Scope
UX design, interface design, system logic mapping, and component design for scalability.
Timeline
2 weeks
Main Review Interface:
The layout was built to make human evaluation intuitive. Reviewers can compare model outputs with reference answers, score them, and leave structured feedback… all without losing flow.
Rubric Configuration:
This screen gives admins full control to define what “good” means. They can create custom metrics like hallucination, tone, or clarity and decide how each should be scored.
Submission Flow:
The goal here was to make completion deliberate. The confirm modal ensures no work is lost or submitted by accident — a small detail that reduces friction in large-scale review tasks.
Comments System:
Feedback needed to feel fluid. Reviewers can highlight, tag, and link comments directly to evaluation categories. The flow was kept consistent from hover to save to keep context intact.
Human Review Mode handled the qualitative side of model evaluation… but it needed data to measure against. That’s where the Experiments and Datasets system came in… a parallel interface designed for the analytical side of the same workflow. Together, they created a full evaluation loop, from structured data testing to human judgment.
Experiments & Data Sets:
What if testing an AI model felt more like exploring than debugging? That was the goal with Experiments and Datasets.
Setup and Configuration
Laying the groundwork for meaningful evaluation.
What makes a good AI test isn’t just numbers, it’s the setup. Every experiment starts with clean prompts, clear evaluators, and datasets that actually reflect reality.
Prompt setup screen:
I designed this interface so model parameters don’t feel overwhelming. The goal was to make exploration fast — adjust, run, and see results without breaking focus.
Evaluator screen:
Defining what “good” means for a model isn’t trivial. This screen lets teams build custom evaluators with clear instructions and field mappings, without writing code.
Dataset creation screen:
Data shapes behavior. This view helps teams create structured datasets — defining properties, defaults, and edge cases in a few clicks.
New test case screen:
Adding a new test case feels instant. I made sure fields are minimal, keyboard-friendly, and auto-preview the expected structure on the right.
Running Experiments and Making Sense of Results
Where numbers start to tell stories.
All Experiments view:
A single place to see every experiment at a glance — clean, scannable, and built for pattern recognition. You can spot regressions and wins without clicking through pages.
Run Comparison:
This view makes comparisons visual. The box plots and metrics create a quick pulse check — how stable, how accurate, how much better the model really got.
Compare Outputs:
Designed for those “what changed?” moments. You can see inputs, outputs, and scores side by side, then hover for deeper insights. It’s visual debugging made friendly.
flamingo-test-run + Human Review modal:
Every experiment can be explored in detail or sent for human feedback. I kept that flow connected — no export, no context switching. Just click, assign, and continue testing.
Autoblocks Loader Animation:
Like other loader animations, I animated the loader in After Effects and made sure it looped perfectly. Exported it as a Lottie file so it stays smooth, lightweight, and works everywhere.
Mursalleen, 2026