DevOps / Site Reliability Engineer (SRE)

Ho Chi Minh

IT

Full-time

  Facebook   Linkedin

We are looking for an experienced DevOps/SRE Engineer to architect, build, and maintain a resilient, scalable, and highly available infrastructure for our company’s AI-powered EdTech platform. You will play a key role in ensuring fast, reliable, and high-quality delivery across our entire technology stack:

  • Flutter mobile applications (iOS & Android)

  • FastAPI backend services

  • GPU-accelerated ML/LLM workloads

  • Next.js web portal with a Firebase backend

This position requires deep expertise in CI/CD, MLOps, cloud-native infrastructure, and high-traffic production systems to maintain operational excellence and platform reliability.


Key Responsibilities

I. CI/CD & Delivery Gating Systems

  • Build and maintain full-stack CI/CD pipelines for Flutter mobile apps and FastAPI services.

  • Implement quality gates at every stage — commit, build, test, deploy — including security checks, linting, performance tests, contract tests, and E2E tests.

  • Enable immutable infrastructure and zero-downtime deployments with fast rollback strategies.

  • Manage multi-environment release flows (Dev, Stage, Prod) with proper versioning and change control.

  • Optimize pipelines to support large-scale, parallel deployments and artifact distribution.

II. Testing & Quality Automation

  • Work closely with engineering teams to integrate automated testing across unit, integration, contract (FastAPI/OpenAPI), and end-to-end (Flutter) layers.

  • Implement load, performance, and stress testing frameworks for backend and LLM services.

  • Provision ephemeral, automated test environments for QA and feature validation.

III. MLOps & GPU Resource Management

  • Manage GPU workloads on serverless or cloud GPU platforms (Modal, AWS, GCP), covering scaling, scheduling, and monitoring.

  • Define autoscaling strategies and resource quotas for ML/LLM inference within Kubernetes.

  • Integrate ML lifecycle orchestration for model training, tuning, and serving.

IV. Core Infrastructure & Scalability

  • Architect and operate highly available Kubernetes clusters with multi-region redundancy.

  • Implement autoscaling using HPA/VPA and custom runtime metrics.

  • Ensure system resilience with chaos engineering, DR drills, and multi-zone deployments.

  • Manage observability systems (Prometheus, Grafana, ELK, OpenTelemetry) for monitoring latency, GPU usage, WebSockets, Celery queues, and app performance.

  • Build and maintain cloud infrastructure using IaC tools like Terraform or CloudFormation.

V. High-Traffic Systems Optimization

  • Optimize FastAPI services for high concurrency and WebSocket workloads.

  • Implement efficient load balancing, caching, and routing strategies to minimize latency.

  • Apply rate-limiting, throttling, and backpressure for LLM-heavy API endpoints.

VI. Security & Compliance

  • Enforce security best practices: encryption, secrets management, data protection.

  • Ensure compliance with GDPR, HIPAA, and other relevant standards.

  • Conduct vulnerability scans, penetration testing, and prepare incident-response workflows.

VII. Web & Mobile Platform Support

  • Support and maintain the Next.js + Firebase web portal.

  • Collaborate with mobile engineers on app performance, crash monitoring, and automated release pipelines.


Required Skills & Qualifications

  • 5+ years in a DevOps or SRE role managing full-stack, high-traffic, cloud-native production systems.

  • Strong expertise in Kubernetes cluster architecture and operations, especially for ML and mobile workloads.

  • Proven experience building CI/CD pipelines for Flutter and REST/web backend services.

  • Hands-on experience implementing quality gates and integrating automated testing at scale.

  • Experience managing GPU-accelerated ML/LLM workloads.

  • Solid knowledge of Python/FastAPI systems, WebSocket handling, and Celery queues.

  • Proficiency with IaC (Terraform) and multi-cloud environments (AWS, GCP, Azure).

  • Deep understanding of observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).

  • Strong knowledge of cloud security, compliance, and disaster recovery practices.


Preferred Qualifications

  • Experience with mobile CI/CD tools like Codemagic, Fastlane, Shorebird.

  • Familiarity with microservices communication (gRPC, message queues).

  • CKA or relevant cloud certifications.

  • Experience scaling LLM endpoints and GPU-heavy ML infrastructure.

  • Understanding of serverless and event-driven architectures.

Application form

Full Name *
Email Address *
Phone Number *
Your Resume *
To attach your Resume, click here to upload from your Computer.
Security code *

Submit