Overview

Responsibilities:

  • Responsible for keeping all custome- facing systems running smoothly while applying sound engineering principles, operational discipline, and mature automation
  • Be on a shift rotation to respond to availability and performance incidents and provide first-line support
  • Use your shift to hunt for potential issues and prevent incidents by debugging production issues across all levels of the application stack
  • Improve the software deployment process to make it as reliable as possible while engaging with software developers
  • Continuously enhance monitoring and alerting and focus on symptoms and not on outages
  • Participate in post-incident reviews, document findings, and automate self healing jobs to reduce MTTR
  • Reports to the Site Reliability Manager

Required Qualifications:

  • Familiar with the Linux Shell for administration and troubleshooting
  • Familiar with the usage of configuration management systems like Chef, Ansible, Puppet
  • Have programming skills - Ruby , Go,Python
  • Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies
  • Ability to use GitLab
  • Familiarity with monitoring tools such as ELK, Grafana, application performance monitoring and packet trace analysis tools (e.g. wireshark)
  • Hunting mentality for system uptime and performance - explore edge cases, failure modes, behaviors, specific implementations.
  • At least 7 years of experience in IT Infrastructure or software development
  • BA/BS in Computer Science, Engineering or related technologyfield