Azure Cloud Site Reliability Engineer (SRE)

Raas Infotek

Job Title: Azure Cloud Site Reliability Engineer (SRE)

Job Title: -Calgary ,AB (Onsite )

Role Overview: The SRE will be responsible for the reliability, availability, and performance of Azure/AWS PaaS and IaaS workloads. They bridge the gap between development and operations, focusing on building automated systems that prevent failures, managing incident responses, and optimizing cloud costs.

Key Responsibilities

• System Reliability & Monitoring: Design, implement, and maintain comprehensive monitoring and alerting systems such as Azure Monitor, AWS CloudWatch, Application Insights, and Log Analytics.

• Automation & Toil Reduction: Automate repetitive manual operations (toil) such as environment provisioning, system patching, and scaling. Use IaC tools like Terraform and Ansible to manage infrastructure.

• Incident Response & Management: Actively manage incident responses, root cause analysis (RCA), and post-mortem investigations to improve system reliability and minimize mean time to resolution (MTTR).

• Cloud SRE Agent Integration: Deploy and configure Cloud SRE Agent to automate incident investigation, execute remediation steps (restart, scale, rollback), and manage routine tasks.

• Capacity Planning & Scalability: Analyze usage patterns to optimize cloud resources, ensuring high availability and performance while managing costs via Azure Cost Management.

• CI/CD & DevOps Collaboration: Integrate automation workflows into CI/CD pipelines (e.g., GitHub Actions or Azure Pipelines) to ensure reliable deployments.

Required Skills & Qualifications

• Cloud Platforms: Expert knowledge of Microsoft Azure infrastructure services (Compute, Storage, Networking, AKS).

• Scripting & Programming: Proficiency in Python, Bash, or PowerShell for building automation tools.

• Infrastructure as Code (IaC): Extensive experience with Terraform and ARM templates/Bicep.

• Observability Tools: Experience with Azure Monitor, Grafana, Prometheus, or Datadog.

• Containers & Orchestration: Solid understanding of Kubernetes/AKS (Azure Kubernetes Service).

• Operating Systems: Proficient in Windows/Linux environments.

• Azure Certification is a +

• Exposure to multi Cloud environment is must.

Typical "Day in the Life" Activities

1. Reviewing Service Level Objectives (SLOs) and error budgets.

2. Refining auto-scaling rules for Kubernetes clusters based on traffic trends.

3. Working with developers to review service architecture and ensure fault tolerance.

4. Configuring AI-driven alert suppression to reduce alert fatigue.

5. Creating Azure Dashboards to visualize key performance indicators (KPIs).

Thanks & Regards,

Trayambkeshwer Dwivedi (Trayam), Sr. Technical Recruiter

Raas infotek corporation

262 Chapman road, Suite 105A, Newark, DE-19702

Direct number: 3022869764 | 132

Text Now: (424) 222 7980

Email: ***email_hidden***