Overview
At BBG Group, we are committed to building and maintaining high-performance, reliable infrastructure to support mission-critical systems 24/7. Our team ensures stability, scalability, and security, driving the foundation of our technology operations.
We are looking for a Site Reliability Engineer (System Administrator) to join our team. This role requires expertise in virtualization, containerization, monitoring, and incident response. If you are passionate about system reliability and thrive in a fast-paced environment, we’d love to hear from you!
Responsibilities:
WHAT YOU WILL DOEnsure 24/7 System Reliability
- Monitor and maintain the stability of critical infrastructure during assigned shifts.
- Analyze and optimize the performance of databases (MySQL, PostgreSQL, MongoDB).
- Administer and manage virtual machines using KVM and Proxmox.
- Deploy and support containerized applications using Docker and Kubernetes.
- Track and analyze system health using monitoring tools (Nagios, Zabbix).
- Process and evaluate logs with the ELK Stack (Elasticsearch, Logstash, Kibana).
- Respond to and resolve infrastructure incidents, minimizing system downtime.
- Maintain and update incident logs to enhance troubleshooting efficiency.
- Ensure continuous operation of critical services and applications.
- Implement and maintain backup and recovery strategies for databases and systems.
- Identify and address performance bottlenecks to improve efficiency.
- Work closely with engineering teams to refine monitoring and incident response workflows.
- Document best practices and provide recommendations for automation and efficiency.
- Share knowledge with team members, fostering continuous improvement.
Required Qualifications:
SKILLS TO DO YOUR JOB EFFICIENTLYTechnical Expertise & System Administration
- Strong experience with virtualization (KVM, Proxmox) and containerization (Docker, Kubernetes).
- Hands-on expertise with monitoring and logging tools (Nagios, Zabbix, ELK Stack).
- Knowledge of database administration, including performance analysis and recovery.
- Solid understanding of network technologies and Linux/Unix system administration.
- Ability to diagnose and resolve system failures under high-pressure conditions.
- Strong troubleshooting skills to handle real-time incidents and service outages.
- Experience in developing and maintaining disaster recovery plans.
- Strong attention to detail, responsibility, and team-oriented mindset.
- Ability to multitask, prioritize incidents, and work effectively in a fast-paced environment.
- Willingness to work 12-hour shift schedules (day/night) to support 24/7 operations.
Additional Information:
Location: Yerevan, (Hybrid)
Contact: +374 41 100029
Telegram - @achevardanian
Email - [email protected]
Please clearly mention that you have heard of this job opportunity on https://ijob.am.