We’re building LLM-powered analytics agents that help users extract reliable insights from our data platform (Databricks). These agents go far beyond natural-language interfaces—they must reason over data, generate accurate queries, interpret results correctly, and surface insights that business users can trust.
In this role, you’ll design and operate systems that allow LLMs to interact safely and reliably with production data, while navigating real-world constraints such as data volume, complex schemas, governance, and performance requirements.
Key Responsibilities
Architect & Scale Core Systems
-
Lead the design, implementation, and optimization of scalable, fault-tolerant backend services and non-blocking APIs using Python (FastAPI).
-
Ensure systems handle high throughput with low latency and predictable performance.
End-to-End Product Delivery
-
Own the full feature lifecycle—from requirements and prototyping through testing, deployment, and iteration.
-
Collaborate closely with product managers, designers, and DevOps to deliver production-ready solutions.
Asynchronous & Event-Driven Systems
-
Design and maintain event-driven, asynchronous pipelines (e.g., message queues, pub/sub) to support real-time ingestion, processing, and integrations.
Performance & Reliability
-
Establish best practices for monitoring, observability, and automated testing.
-
Identify bottlenecks and apply strategies such as caching, connection pooling, backpressure, and rate limiting to maintain 99.9%+ uptime.
Observability & Monitoring
-
Build and operate end-to-end observability using Prometheus for metrics and Grafana for visualization and alerting.
AI, Agents & RAG Systems
-
Architect Retrieval-Augmented Generation (RAG) pipelines and intelligent agents that solve real business problems.
-
Integrate vector databases, LLM APIs (e.g. OpenAI), and custom prompting or reasoning frameworks.
Leadership & Collaboration
-
Provide technical leadership and mentorship to mid-level engineers.
-
Promote strong engineering standards and collaborate across teams to continuously improve system quality.
Required Qualifications
-
5+ years of software engineering experience building and operating production-grade, scalable systems.
-
Deep expertise in Python and FastAPI, including async programming, concurrency models, and service profiling/tuning.
-
Strong API design experience with REST, GraphQL, or gRPC, and a solid understanding of HTTP/2, WebSockets, and protocol performance.
-
Hands-on experience with cloud infrastructure (AWS, GCP, or Azure), Docker, Kubernetes/Helm, and Infrastructure-as-Code (Terraform, CloudFormation).
-
Strong background in data storage and caching, including relational databases (PostgreSQL, MySQL) and NoSQL systems (MongoDB, Redis, DynamoDB).
-
Experience building asynchronous systems, including message brokers (Kafka, RabbitMQ, AWS SQS), task queues (Celery, Lambda), or stream-processing frameworks.
Preferred Qualifications
-
RAG & Agent Systems
Hands-on experience designing and deploying Retrieval-Augmented Generation (RAG) pipelines, working with vector databases (e.g. Pinecone, Weaviate), and building autonomous or semi-autonomous agents using LangChain or similar orchestration frameworks. -
AI / ML Collaboration
Familiarity with embedding-based retrieval, prompt design, chaining techniques, and orchestrating multi-LLM workflows to support complex reasoning and analytics use cases. -
Observability & Security
Strong understanding of logging, distributed tracing, and metrics using tools such as OpenTelemetry, Prometheus, and Grafana, along with solid knowledge of security best practices including OAuth2, JWT-based authentication, and data encryption. -
Leadership & Communication
Excellent written and verbal English communication skills, with the ability to influence technical decisions, align stakeholders, and collaborate effectively across engineering, product, and data teams.