We are looking for an experienced DevOps/SRE Engineer to architect, build, and maintain a resilient, scalable, and highly available infrastructure for our company’s AI-powered EdTech platform. You will play a key role in ensuring fast, reliable, and high-quality delivery across our entire technology stack:
-
Flutter mobile applications (iOS & Android)
-
FastAPI backend services
-
GPU-accelerated ML/LLM workloads
-
Next.js web portal with a Firebase backend
This position requires deep expertise in CI/CD, MLOps, cloud-native infrastructure, and high-traffic production systems to maintain operational excellence and platform reliability.
Key Responsibilities
I. CI/CD & Delivery Gating Systems
-
Build and maintain full-stack CI/CD pipelines for Flutter mobile apps and FastAPI services.
-
Implement quality gates at every stage — commit, build, test, deploy — including security checks, linting, performance tests, contract tests, and E2E tests.
-
Enable immutable infrastructure and zero-downtime deployments with fast rollback strategies.
-
Manage multi-environment release flows (Dev, Stage, Prod) with proper versioning and change control.
-
Optimize pipelines to support large-scale, parallel deployments and artifact distribution.
II. Testing & Quality Automation
-
Work closely with engineering teams to integrate automated testing across unit, integration, contract (FastAPI/OpenAPI), and end-to-end (Flutter) layers.
-
Implement load, performance, and stress testing frameworks for backend and LLM services.
-
Provision ephemeral, automated test environments for QA and feature validation.
III. MLOps & GPU Resource Management
-
Manage GPU workloads on serverless or cloud GPU platforms (Modal, AWS, GCP), covering scaling, scheduling, and monitoring.
-
Define autoscaling strategies and resource quotas for ML/LLM inference within Kubernetes.
-
Integrate ML lifecycle orchestration for model training, tuning, and serving.
IV. Core Infrastructure & Scalability
-
Architect and operate highly available Kubernetes clusters with multi-region redundancy.
-
Implement autoscaling using HPA/VPA and custom runtime metrics.
-
Ensure system resilience with chaos engineering, DR drills, and multi-zone deployments.
-
Manage observability systems (Prometheus, Grafana, ELK, OpenTelemetry) for monitoring latency, GPU usage, WebSockets, Celery queues, and app performance.
-
Build and maintain cloud infrastructure using IaC tools like Terraform or CloudFormation.
V. High-Traffic Systems Optimization
-
Optimize FastAPI services for high concurrency and WebSocket workloads.
-
Implement efficient load balancing, caching, and routing strategies to minimize latency.
-
Apply rate-limiting, throttling, and backpressure for LLM-heavy API endpoints.
VI. Security & Compliance
-
Enforce security best practices: encryption, secrets management, data protection.
-
Ensure compliance with GDPR, HIPAA, and other relevant standards.
-
Conduct vulnerability scans, penetration testing, and prepare incident-response workflows.
VII. Web & Mobile Platform Support
-
Support and maintain the Next.js + Firebase web portal.
-
Collaborate with mobile engineers on app performance, crash monitoring, and automated release pipelines.
Required Skills & Qualifications
-
5+ years in a DevOps or SRE role managing full-stack, high-traffic, cloud-native production systems.
-
Strong expertise in Kubernetes cluster architecture and operations, especially for ML and mobile workloads.
-
Proven experience building CI/CD pipelines for Flutter and REST/web backend services.
-
Hands-on experience implementing quality gates and integrating automated testing at scale.
-
Experience managing GPU-accelerated ML/LLM workloads.
-
Solid knowledge of Python/FastAPI systems, WebSocket handling, and Celery queues.
-
Proficiency with IaC (Terraform) and multi-cloud environments (AWS, GCP, Azure).
-
Deep understanding of observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
-
Strong knowledge of cloud security, compliance, and disaster recovery practices.
Preferred Qualifications
-
Experience with mobile CI/CD tools like Codemagic, Fastlane, Shorebird.
-
Familiarity with microservices communication (gRPC, message queues).
-
CKA or relevant cloud certifications.
-
Experience scaling LLM endpoints and GPU-heavy ML infrastructure.
-
Understanding of serverless and event-driven architectures.