A fast-scaling Quantum AI SaaS provider are hiring an AI Evaluation Data Scientist to help shape the future of Generative AI systems. In this role, you’ll design and lead evaluation frameworks for Agentic AI and RAG systems, ensuring real-world reliability, reasoning quality, and user success before deployment. You’ll work across teams to turn insights into measurable product improvements.
This role would be on an initial fixed term contract until end of June 2026, with the option to potentially extend thereafter.
What You’ll Do
- Lead evaluation strategy for Agentic AI and RAG systems — defining metrics, success criteria, and real-world test cases.
- Build reproducible evaluation pipelines (datasets, configs, automated runs) to track progress over time.
- Develop and refine frameworks that go beyond benchmarks to assess reasoning, grounding, and robustness.
- Create and curate high-quality datasets (synthetic, adversarial, real-world).
- Implement LLM-as-a-judge evaluations aligned with human feedback.
- Analyze failures, identify root causes, and drive continuous system improvement.
- Partner with ML teams to close the loop between evaluation, data creation, and model refinement.
What You’ll Bring
MSc / PhD in CS, ML, Data Science, Engineering, or related field.3+ years (mid-level) or 5+ years (senior) in applied AI / ML, with hands-on production experience.Strong background in evaluating LLMs, RAG, or multi-agent systems.Proficiency in Python, Docker, Git, and ML frameworks (PyTorch, HuggingFace, LangGraph, LlamaIndex, etc.).Experience with data curation, synthetic data generation, and cloud environments (AWS preferred).Excellent communication skills and a passion for building reliable, intelligent AI systems.