We’re looking for an Agent Reliability – AI & Analytics Engineer to take full ownership of reliability and trust for AI agents running in production. Our agents directly influence real user decisions, so correctness, safety, and consistency are critical every single day.

In this startup environment, quality is not split across QA, safety, evaluation, or reliability teams—you are the single point of ownership for agent quality in the real world. You’ll define what “good” looks like, build the systems that measure and enforce it, and continuously close the loop from production failures back into fixes.

This is a hands-on, high-impact role for engineers who want to focus deeply on agent correctness, trust, and reliability at scale.

Core Responsibilities

Own end-to-end agent quality and correctness in production, from defining success criteria to enforcing and improving them continuously.
Design, build, and maintain automated evaluation systems, including golden datasets, regression tests, rule-based checks, and LLM-as-judge approaches.
Implement and operate production monitoring and observability for agent quality, failures, drift, and regressions.
Define and enforce safety guardrails across prompts, tools, permissions, and escalation mechanisms.
Establish fast feedback loops by turning real-world failures into tests, fixes, and long-term reliability improvements.

Required Qualifications

4+ years of experience in AI engineering, data/analytics engineering, ML systems, or reliability-focused roles.
Strong intuition for quality metrics, failure modes, regression testing, and reliability in probabilistic systems.
Hands-on experience with Python for evaluation pipelines, automation, and data analysis.
Experience working close to production systems, including logs, metrics, dashboards, and incident response.
Ability to translate real-world workflows into clear success criteria, KPIs, and test cases.
Strong communication skills, with the ability to explain quality trade-offs to engineers, product teams, and leadership.

Preferred Qualifications

Experience building or evaluating LLM-powered or agentic systems (prompting, RAG, tool use, memory).
Familiarity with LLM evaluation techniques, including offline evaluations, trajectory analysis, and LLM-as-judge methods.
Experience with agent observability, such as tracing, sampling, confidence scoring, or grounding signals.
Background in data systems or analytics platforms, where correctness and trust are essential.
Comfort working in ambiguity and iterating toward clarity using data, examples, and experimentation.

The Mindset We Value

You treat AI agents as probabilistic systems, not deterministic software.
You expect drift and long-tail failures—and proactively design systems to detect them early.
You default to measurement over intuition.
You feel personal ownership over user trust, not just test coverage or metrics.

Agent Reliability – AI & Analytics Engineer

Core Responsibilities

Required Qualifications

Preferred Qualifications

The Mindset We Value

Application form