Overview

We are looking for a skilled Senior SRE Engineer to join our distributed team. In this role, you will be responsible for maintaining and improving the reliability, resiliency, scalability, and performance of our onboarded systems. You will provide on-call support, manage production incidents, and drive continuous improvements to our systems and processes while collaborating closely with engineering and operations teams.
Client:
Our client is a large online retailer with yearly revenue of £1 billion.
Project Overview:

Responsibilities:
  • Maintain and enhance reliability, resiliency, scalability, and performance of onboarded systems.
  • Provide on-call support, diagnose, mitigate, fix, and escalate production incidents in a timely manner.
  • Lead incident follow-ups, root cause analysis, and preventive actions to minimize recurrence.
  • Implement customer-centric approaches to align system reliability with user experience.
  • Ensure systems have appropriate SLIs, monitoring, and alerting to meet agreed SLOs.
  • Identify critical system components requiring enhanced availability in partnership with engineering and operations.
  • Design and roll out strategies, tooling, and processes to improve system stability and performance.
  • Develop and maintain CI/CD pipelines for seamless deployment and releases.
  • Automate repetitive and manual tasks to reduce toil and increase operational efficiency.
  • Participate in system architecture discussions focused on reliability and reducing maintenance complexity.
Nice To Have:
  • Experience with concurrency in Java
  • Python knowledge
  • Dependency conflict resolution experience
  • Terraform knowledge
  • Experience with CloudFormation
  • Knowledge of GCP
  • Knowledge of BigQuery
  • Understanding of core SRE concepts (SLI/SLO/etc)
  • Knowledge of reliability patterns (Circuit breaker, Retry, etc.)
Required Qualifications:
  • Java Senior/Expert level with strong background in Spring Boot
  • Experience with shell scripts
  • Working experience with Docker including creation/modification of Docker images
  • Maven and Gradle experience
  • Understanding of AWS ECS
  • Experience working with core AWS Services (SNS, SQS, Kinesis, RDS, DynamoDB, S3, Elasticache)
  • Experience with GitLab
  • Experience troubleshooting/bugfixing in distributed cloud environments
  • Experience with OpenSearch/Kibana
  • Understanding of metrics and tracing
  • Knowledge of Prometheus and Grafana
  • Readiness to be part of a 24/7 rota
Note:

Our intelligent job search engine discovered this job and republished it for your convenience.
Please be aware that the job information may be incorrect or incomplete. The job announcement remains the property of its original publisher. To view the original job and its full details, please visit the job's URL on the owner’s page.

Please clearly mention that you have heard of this job opportunity on https://ijob.am.