Founded in 2019, our client had grown into one of Europe’s most recognized deep-tech scale-ups, backed by major global strategic investors and EU innovation funds.
Their quantum and AI technologies had already transformed how enterprise clients built and deployed intelligent systems — achieving up to 95% model compression and 50–80% inference cost reduction.
The company was recognized by CB Insights (2023 & 2025) as one of the Top 100 most promising AI companies globally, often described as a “quantum–AI unicorn in the making.”
Role Highlights
The AI Evaluation Data Scientist was responsible for :
- Designing and leading evaluation strategies for Agentic AI and RAG systems, translating complex workflows into measurable performance metrics.
- Developing multi-step task-based evaluations to capture reasoning quality, factual accuracy, and end-user success in real-world scenarios.
- Building reproducible evaluation pipelines with automated test suites, dataset tracking, and performance versioning.
- Curating and generating synthetic and adversarial datasets to strengthen system robustness.
- Implementing LLM-as-a-judge frameworks aligned with human feedback.
- Conducting error analysis and ablations to identify reasoning gaps, hallucinations, and tool-use failures.
- Collaborating with ML engineers to create a continuous data flywheel linking evaluation outcomes to product improvements.
- Defining and monitoring operational metrics such as latency, reliability, and cost to meet production standards.
- Maintaining high standards in engineering, documentation, and reproducibility.
Candidate Profile
Master’s or Ph.D. in Computer Science, Machine Learning, Physics, Engineering, or related field.3+ years (mid-level) or 5+ years (senior) of experience in Data Science, ML Engineering, or Research roles in applied AI / ML projects.Proven experience designing and implementing evaluation methodologies for machine learning or Generative AI systems.Hands-on experience with LLMs, RAG pipelines, and agentic architectures.Proficiency in Python, Git, Docker, and major ML frameworks (PyTorch, HuggingFace, LangGraph, LlamaIndex).Familiarity with cloud environments (AWS preferred).Excellent communication skills and fluency in English.Preferred
Ph.D. in a relevant technical discipline.Experience with synthetic data generation, adversarial testing, and multi-agent evaluation frameworks.Strong background in LLM error analysis and reliability testing.Open-source contributions publications related to AI evaluation.Fluency in Spanish.Contract Details
Location : Madrid or BarcelonaType : Fixed-term (until June 2026)Work Model : Hybrid (3 days onsite, 2 remote)Seniority : AssociateDepartment : TechnicalCompensation and Benefits
Competitive salary package.Signing and retention bonuses.Relocation support where applicable.Flexible working hours and equal pay guarantee.Inclusive, international, and innovation-driven environment.#J-18808-Ljbffr