Accountabilities:
Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
Education & Experience:
7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
Experience working in computing, distributed systems, storage, or networking.
Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
Ability to debug, optimize code, and to automate routine tasks.
Systematic problem-solving approach, coupled with effective verbal and written communication skills.
Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
Strong analytical and problem-solving skills are necessary ,TSM processes & tools
You will be redirected to the company website to apply for this position