We are looking for an experienced DevOps/SRE Engineer to architect, build, and maintain a resilient, scalable, and highly available infrastructure for our company’s AI-powered EdTech platform. You will play a key role in ensuring fast, reliable, and high-quality delivery across our entire technology stack:

Flutter mobile applications (iOS & Android)
FastAPI backend services
GPU-accelerated ML/LLM workloads
Next.js web portal with a Firebase backend

This position requires deep expertise in CI/CD, MLOps, cloud-native infrastructure, and high-traffic production systems to maintain operational excellence and platform reliability.

Key Responsibilities

I. CI/CD & Delivery Gating Systems

Build and maintain full-stack CI/CD pipelines for Flutter mobile apps and FastAPI services.
Implement quality gates at every stage — commit, build, test, deploy — including security checks, linting, performance tests, contract tests, and E2E tests.
Enable immutable infrastructure and zero-downtime deployments with fast rollback strategies.
Manage multi-environment release flows (Dev, Stage, Prod) with proper versioning and change control.
Optimize pipelines to support large-scale, parallel deployments and artifact distribution.

II. Testing & Quality Automation

Work closely with engineering teams to integrate automated testing across unit, integration, contract (FastAPI/OpenAPI), and end-to-end (Flutter) layers.
Implement load, performance, and stress testing frameworks for backend and LLM services.
Provision ephemeral, automated test environments for QA and feature validation.

III. MLOps & GPU Resource Management

Manage GPU workloads on serverless or cloud GPU platforms (Modal, AWS, GCP), covering scaling, scheduling, and monitoring.
Define autoscaling strategies and resource quotas for ML/LLM inference within Kubernetes.
Integrate ML lifecycle orchestration for model training, tuning, and serving.

IV. Core Infrastructure & Scalability

Architect and operate highly available Kubernetes clusters with multi-region redundancy.
Implement autoscaling using HPA/VPA and custom runtime metrics.
Ensure system resilience with chaos engineering, DR drills, and multi-zone deployments.
Manage observability systems (Prometheus, Grafana, ELK, OpenTelemetry) for monitoring latency, GPU usage, WebSockets, Celery queues, and app performance.
Build and maintain cloud infrastructure using IaC tools like Terraform or CloudFormation.

V. High-Traffic Systems Optimization

Optimize FastAPI services for high concurrency and WebSocket workloads.
Implement efficient load balancing, caching, and routing strategies to minimize latency.
Apply rate-limiting, throttling, and backpressure for LLM-heavy API endpoints.

VI. Security & Compliance

Enforce security best practices: encryption, secrets management, data protection.
Ensure compliance with GDPR, HIPAA, and other relevant standards.
Conduct vulnerability scans, penetration testing, and prepare incident-response workflows.

VII. Web & Mobile Platform Support

Support and maintain the Next.js + Firebase web portal.
Collaborate with mobile engineers on app performance, crash monitoring, and automated release pipelines.

Required Skills & Qualifications

5+ years in a DevOps or SRE role managing full-stack, high-traffic, cloud-native production systems.
Strong expertise in Kubernetes cluster architecture and operations, especially for ML and mobile workloads.
Proven experience building CI/CD pipelines for Flutter and REST/web backend services.
Hands-on experience implementing quality gates and integrating automated testing at scale.
Experience managing GPU-accelerated ML/LLM workloads.
Solid knowledge of Python/FastAPI systems, WebSocket handling, and Celery queues.
Proficiency with IaC (Terraform) and multi-cloud environments (AWS, GCP, Azure).
Deep understanding of observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
Strong knowledge of cloud security, compliance, and disaster recovery practices.

Preferred Qualifications

Experience with mobile CI/CD tools like Codemagic, Fastlane, Shorebird.
Familiarity with microservices communication (gRPC, message queues).
CKA or relevant cloud certifications.
Experience scaling LLM endpoints and GPU-heavy ML infrastructure.
Understanding of serverless and event-driven architectures.

DevOps / Site Reliability Engineer (SRE)