Back to Portfolio
EVALUATION SYSTEMS

AI Truth Guard LLM Evaluation Microservice

Dedicated evaluation layer de-risking LLM deployments with measurable precision and recall metrics

Challenge

  • Multiple assistants and agents in production with no consistent eval framework
  • Quality decisions made via ad-hoc eyeballing of a few chats
  • No replayable test sets or central place for metrics across models, prompts, or versions
  • Hard to answer 'Is this new model actually safer/better than the old one?'

Solution

  • Built AI Truth Guard, a standalone FastAPI microservice that runs batch evaluations on stored LLM interactions and fresh test suites
  • Computes precision, recall, F1, and tool-selection metrics per scenario
  • Streams live progress via Server-Sent Events (SSE) for long-running evals
  • Persists runs, metrics, and configs in PostgreSQL for comparison over time
  • Integrated with existing assistants via a simple HTTP contract (prompts, contexts, expected behaviours)
  • Orchestrated evaluations with async task queues, keeping the API responsive while running heavy jobs in the background
  • Designed the service with hexagonal architecture so teams can plug in their own models, tools, and scoring logic

Impact

  • Introduced a shared evaluation baseline across all LLM apps
  • Enabled safer rollouts: new models/prompts must pass Truth Guard before production
  • Gave product & engineering clear metrics to prioritise improvements
  • Turned evaluation from a side-quest into a reusable platform capability

Tech Stack

Core Technologies

  • Python
  • FastAPI
  • PostgreSQL
  • Async workers

Features

  • Server-Sent Events (SSE)
  • Evaluation harnesses
  • CI/CD-friendly API

Architecture

  • Hexagonal architecture
  • DI
  • Task queues

Ready for AI that drives revenue?

HEXALON