Azure Cloud Site Reliability Engineer (SRE)
Raas Infotek
Job Title: Azure Cloud Site Reliability Engineer (SRE)
Job Title: -Calgary ,AB (Onsite )
Role Overview: The SRE will be responsible for the reliability, availability, and performance of Azure/AWS PaaS and IaaS workloads. They bridge the gap between development and operations, focusing on building automated systems that prevent failures, managing incident responses, and optimizing cloud costs.
Key Responsibilities
• System Reliability & Monitoring: Design, implement, and maintain comprehensive monitoring and alerting systems such as Azure Monitor, AWS CloudWatch, Application Insights, and Log Analytics.
• Automation & Toil Reduction: Automate repetitive manual operations (toil) such as environment provisioning, system patching, and scaling. Use IaC tools like Terraform and Ansible to manage infrastructure.
• Incident Response & Management: Actively manage incident responses, root cause analysis (RCA), and post-mortem investigations to improve system reliability and minimize mean time to resolution (MTTR).
• Cloud SRE Agent Integration: Deploy and configure Cloud SRE Agent to automate incident investigation, execute remediation steps (restart, scale, rollback), and manage routine tasks.
• Capacity Planning & Scalability: Analyze usage patterns to optimize cloud resources, ensuring high availability and performance while managing costs via Azure Cost Management.
• CI/CD & DevOps Collaboration: Integrate automation workflows into CI/CD pipelines (e.g., GitHub Actions or Azure Pipelines) to ensure reliable deployments.
Required Skills & Qualifications
• Cloud Platforms: Expert knowledge of Microsoft Azure infrastructure services (Compute, Storage, Networking, AKS).
• Scripting & Programming: Proficiency in Python, Bash, or PowerShell for building automation tools.
• Infrastructure as Code (IaC): Extensive experience with Terraform and ARM templates/Bicep.
• Observability Tools: Experience with Azure Monitor, Grafana, Prometheus, or Datadog.
• Containers & Orchestration: Solid understanding of Kubernetes/AKS (Azure Kubernetes Service).
• Operating Systems: Proficient in Windows/Linux environments.
• Azure Certification is a +
• Exposure to multi Cloud environment is must.
Typical "Day in the Life" Activities
1. Reviewing Service Level Objectives (SLOs) and error budgets.
2. Refining auto-scaling rules for Kubernetes clusters based on traffic trends.
3. Working with developers to review service architecture and ensure fault tolerance.
4. Configuring AI-driven alert suppression to reduce alert fatigue.
5. Creating Azure Dashboards to visualize key performance indicators (KPIs).
Thanks & Regards,
Trayambkeshwer Dwivedi (Trayam), Sr. Technical Recruiter
Raas infotek corporation
262 Chapman road, Suite 105A, Newark, DE-19702
Direct number: 3022869764 | 132
Text Now: (424) 222 7980
Email: ***email_hidden***