Digitain LLC is looking for a Site Reliability Engineer Team Lead.

  • Transform the existing monitoring team into an effective SRE team, emphasizing the identification and resolution of application bottlenecks.
  • Lead, mentor, and guide the team, setting clear objectives, standards, and cultivating a culture of continuous improvement.
  • Work closely with the DevOps team to align strategies, tools, and processes for optimal efficiency, reliability, and performance.
  • Partner with development teams to design systems that are scalable, reliable, efficient, and capable of pinpointing application bottlenecks.
  • Develop and maintain tools for system health monitoring, disaster recovery, performance tuning, and bottleneck analysis.
  • Implement and manage comprehensive monitoring and tracing strategies using tools like Prometheus, Grafana, ELK Stack, Datadog, AppDynamics (for APM), Zabbix, and Jaeger.
  • Oversee the integration and optimization of CDN technologies, particularly Cloudflare, to enhance content delivery and web performance.
  • Lead incident management processes, conduct in-depth post-mortem analyses, and focus on identifying and resolving application bottlenecks.
  • Define, track, and report on SRE team metrics, service level objectives (SLOs), service level indicators (SLIs), and application performance metrics.
  • Manage on-call and shift rotations, ensuring timely resolution of production issues and application bottlenecks.
  • Promote best practices in reliability engineering, CDN management, tracing, and bottleneck analysis across the organization.
  • Participate in architecture reviews, focusing on scalability, stability, and the effective integration of CDN technologies, monitoring, and tracing tools, along with bottleneck analysis capabilities.
  • Oversee capacity planning, conduct system audits, and ensure comprehensive monitoring, tracing, and analysis of application performance and bottlenecks.
  • Stay abreast of industry trends and emerging technologies to foster innovation within the team.
Required Qualifications:
  • Bachelor's degree in computer science, Engineering, or a related field.
  • 5+ years of experience in site reliability engineering, systems administration, DevOps or related roles.
  • Proven experience in leading and managing technical teams, particularly in transitioning or building teams.
  • Expertise in Windows/Linux administration and scripting languages (e.g., Python, Bash).
  • Proficiency in cloud services (Azure, Oracle Cloud) and container orchestration tool (Kubernetes).
  • In-depth knowledge of infrastructure automation tools (Terraform, Ansible).
  • Solid understanding of network protocols and services (TCP/IP, HTTP, DNS).
  • Experience with a variety of monitoring tools (Prometheus, Grafana, ELK Stack, APM, Zabbix) and tracing tools like Jaeger.
  • Proven ability to identify and resolve application bottlenecks.
  • Knowledge and experience in managing CDN technologies, specifically Cloudflare.
  • Exceptional problem-solving, analytical, communication, and interpersonal skills.
  • Demonstrated ability to work collaboratively with DevOps and other technical teams.

Please note: Our intelligent job search engine discovered this job and republished it for your convenience.
Please be aware that the job information may be incorrect or incomplete. The job announcement remains the property of its original publisher. To view the original job and its full details, please visit the job's URL on the owner’s page.