We are looking for a hands-on, high-ownership DevOps Engineer to own, scale, and evolve the core platform that powers our entire R&D and engineering ecosystem. In this role, you will architect infrastructure running everything from distributed AI training rigs and high-throughput inference serving to our day-to-day internal engineering tools. You will work deep in the stack across multi-cloud environments, enterprise Kubernetes networks, and high-performance bare metal.
Key Responsibilities
-
Enterprise Kubernetes Orchestration: Scale and operate multi-cluster Kubernetes environments across public clouds (GCP, AWS) and on-premises infrastructure, handling complex control plane operations, node lifecycles, and advanced autoscaling via KEDA and HPA.
-
Hybrid Cloud Architecture: Design, implement, and maintain secure hub-and-spoke and multi-AZ network topologies, balancing public cloud resources with bare-metal on-premises fabrics.
-
AI/ML Infrastructure Management: Optimize and manage our high-density inference platform, leveraging vLLM, AIBrix, and specialized autoscaling across a distributed fleet of NVIDIA GPUs.
-
GitOps & Continuous Delivery: Own the end-to-end CI/CD and GitOps lifecycle, driving secure container multi-stage builds, image optimization, and progressive delivery patterns utilizing ArgoCD or FluxCD.
-
Unified Observability: Maintain a single-pane-of-glass observability ecosystem across the Grafana LGTM stack (Mimir, Loki, Tempo, Pyroscope) while actively pushing toward agent-assisted SRE workflows.
-
Security & Identity Lifecycle: Harden platform security by integrating central IdPs (Keycloak, Google Workspace) via OIDC/SAML, enforcing robust RBAC, and managing enterprise secrets.
-
Data & Compute Automation: Support distributed database and messaging platforms (PostgreSQL HA, Kafka, Redis, OpenSearch) alongside self-service training infrastructure and RunPod burst capacity.
What We Are Looking For
-
Production Kubernetes Expertise: Deep, hands-on understanding of workloads, CNI networking, CSI storage plugins, and advanced event-driven autoscaling. Self-managed or bare-metal K8s experience is a massive plus.
-
Design-Level Networking: Proven ability to engineer real-world network topologies, securely managing private clusters, firewalls, load balancers, and complex routing tables.
-
GitOps & Container Native Mindset: Expert-level execution of Docker multi-stage builds, caching layer optimization, and declarative GitOps delivery pipelines.
-
Full-Stack Observability: Practical experience standing up, tuning, and monitoring metrics, logs, traces, and alert routing configurations from scratch in high-scale environments.
-
Identity & Access Controls: Immediate instinct to wire platform applications into centralized access directories rather than managing localized service accounts.
-
Linux & IaC Foundations: Excellent Linux systems administration proficiency paired with modern Infrastructure as Code tools (Terraform, Terragrunt, or Pulumi).
Bonus Technical Points
-
Production experience running OpenStack services (Nova, Neutron, Cinder) or KVM virtualization.
-
Familiarity with distributed open-source storage architectures like Ceph or Rook-Ceph.
-
Deep understanding of LLM inference internals (PagedAttention, continuous batching, tensor parallelism).