EVALUATION SYSTEMS

AI Truth Guard – LLM Evaluation Microservice

Dedicated evaluation layer de-risking LLM deployments with measurable precision and recall metrics

Challenge

•Multiple assistants and agents in production with no consistent eval framework
•Quality decisions made via ad-hoc eyeballing of a few chats
•No replayable test sets or central place for metrics across models, prompts, or versions
•Hard to answer 'Is this new model actually safer/better than the old one?'

•Built AI Truth Guard, a standalone FastAPI microservice that runs batch evaluations on stored LLM interactions and fresh test suites
•Computes precision, recall, F1, and tool-selection metrics per scenario
•Streams live progress via Server-Sent Events (SSE) for long-running evals
•Persists runs, metrics, and configs in PostgreSQL for comparison over time
•Integrated with existing assistants via a simple HTTP contract (prompts, contexts, expected behaviours)
•Orchestrated evaluations with async task queues, keeping the API responsive while running heavy jobs in the background
•Designed the service with hexagonal architecture so teams can plug in their own models, tools, and scoring logic

•Introduced a shared evaluation baseline across all LLM apps
•Enabled safer rollouts: new models/prompts must pass Truth Guard before production
•Gave product & engineering clear metrics to prioritise improvements
•Turned evaluation from a side-quest into a reusable platform capability

Serverless AI backbone compressing idea → mockup → publish from days to under an hour