We are seeking a highly skilled Evaluation Engineer to design and lead the evaluation strategy for our Agentic AI and Retrieval-Augmented Generation (RAG) systems.
In this role, you’ll translate complex customer workflows into measurable success metrics, ensuring our systems deliver reliable, explainable, and high-performing results across real-world applications.
Responsibilities
You will design and execute rigorous evaluation frameworks that measure reasoning, factual accuracy, reliability, and user success across diverse problem domains. This includes building reproducible evaluation pipelines with datasets, test suites, and automated tracking of regressions and improvements.
You’ll work closely with ML specialists and engineers to develop task-based, multi-step evaluations that reflect real-world system behavior—spanning retrieval, planning, memory, and tool usage—and inform continuous improvement.
Your work will also involve curating and generating high-quality datasets, implementing LLM-as-a-judge methods calibrated with human feedback, and conducting deep error analyses to identify and classify failure modes.
You’ll partner across teams to ensure evaluations align with production metrics such as latency, cost, and reliability, and you’ll contribute to high engineering standards through clear documentation, code reviews, and mentorship.
Qualifications
Master’s or Ph.D. in Computer Science, Machine Learning, Data Science, or a related technical field.
3+ years (mid-level) or 5+ years (senior) of experience in applied AI / ML, ideally with production-deployed systems.
Proven expertise in designing evaluation methodologies for ML systems, especially in LLMs, RAG, or multi-agent architectures.
Experience creating and curating datasets, including synthetic and adversarial data, for evaluation and training.
Strong proficiency in Python, with hands-on experience using frameworks such as PyTorch, HuggingFace, LangGraph, LlamaIndex, and related ML / agentic toolkits.
Familiarity with cloud environments (preferably AWS) and good software engineering practices (Git, Docker, reproducible ML pipelines).
Excellent analytical, problem-solving, and communication skills, with the ability to turn ambiguity into structured, data-driven evaluation approaches.
Fluent in English.
If you are motivated by advancing the frontiers of intelligent systems and have the experience to design rigorous, real-world evaluations for cutting-edge AI technologies, we invite you to apply now or email your CV to nk@eu-recruit.com
By applying to this role you understand that we may collect your personal data and store and process it on our systems. For more information please see our Privacy Notice (https : / / eu-recruit.com / about-us / privacy-notice / )
In accordance with local employment laws, applicants must have current, valid authorisation to work in Spain at the time of application. We are unable to sponsor employment visas for this role. Applications from individuals without existing work authorisation for Spain cannot be considered.
Ai Engineer • Marbella, Andalusia, Spain