Overview
We are looking for a Senior DevOps / Site Reliability Expert ready to own the reliability and operational maturity of a production AI platform.
You will be the engineering foundation that keeps agentic content workflows running at scale, ensuring services are observable, deployments are automated, and infrastructure is reproducible. Working closely with AI and full-stack engineers, you will shape how the team builds, ships, and operates, with particular focus on the reliability challenges unique to LLM workloads: cost, latency, and non-deterministic failure modes. This is a role for someone who takes pride in building systems that others depend on.
- Own the reliability, scalability, and performance of the platform's services running on Azure Container Apps and AWS ECS
- Build and maintain CI/CD pipelines (GitHub Actions) for automated build, test, and deployment across multiple microservices, including Docker image management, registries, and deployment config
- Implement and manage infrastructure as code (Terraform, Bicep, or ARM) across Azure and AWS
- Set up and maintain observability — monitoring, alerting, logging, and dashboards (New Relic, Langfuse, CloudWatch)
- Manage Azure Service Bus, Blob Storage, Key Vault, and Container Apps configurations
- Ensure security best practices — secret management, image scanning, vulnerability remediation
- Implement auto-scaling, load balancing, and cost optimisation for AI workloads
- Support incident response and establish runbooks for production services
- Collaborate with AI engineers to optimise LLM API usage, token costs, and latency
- 3+ years of SRE, DevOps, or platform engineering experience
- Hands-on expertise in Azure (Container Apps, Service Bus, Key Vault), as well as Blob Storage and Azure OpenAI resource management
- Proficiency in infrastructure as code using Terraform, Bicep, or ARM templates
- Background in CI/CD pipeline design and maintenance (GitHub Actions preferred)
- Skills in Docker and container orchestration, with Kubernetes experience as a strong plus
- Competency in monitoring and observability with New Relic or equivalent, including LLM observability as a plus
- Understanding of security practices, including secret management, vulnerability scanning, and image hardening
- Capability to script in Python or Bash for automation
- Familiarity with AI development tools as a daily user (Cursor, Claude Code, Copilot)
- Excellent command of written and spoken English (B2+ level)
- Delivering innovative solutions to industry leaders, making a global impact
- Enjoyable working environment, whether it is the vibrant office or the comfort of your home
- Opportunity to work abroad for up to two months per year
- Relocation opportunities within our offices in 55+ countries
- Corporate and social events
- Leadership development, career advising, soft skills and well-being programs
- Certifications, including GCP, Azure and AWS
- Unlimited access to LinkedIn Learning and Udemy
- Free English classes with certified teachers
- Participation in the Employee Stock Purchase Plan
- Monetary bonuses for engaging in the referral program
- Comprehensive medical & family care package
- Four trust days per year for personal needs
- Discounts for fitness clubs
- Benefits package (hotels, restaurants, stores and services)
- Familiarity with Amazon Web Services (ECS, S3, Aurora, CloudWatch)
- DevOps
- Docker
- GitHub Actions
- Kubernetes
- Microsoft Azure
- Terraform
- Amazon Web Services
✨ Our intelligent job search engine discovered this job and republished it for your convenience.
Please be aware that the job information may be incorrect or incomplete. The job announcement remains the property of its original publisher. To view the original job and its full details, please visit the job's URL on the owner’s page.
Please clearly mention that you have heard of this job opportunity on https://ijob.am.



