Site Reliability Engineer (SRE)

Job Summary

We are hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of critical cloud infrastructure and applications across Azure or AWS environments.

Key Responsibilities

Build and manage monitoring systems, dashboards, and alerts
Define and enforce SLOs/SLIs for production systems
Develop automation scripts to reduce manual operations
Troubleshoot incidents and lead root cause analysis
Partner with development teams to improve application resilience
Optimize performance and system availability

Qualifications

4+ years in DevOps or SRE roles with Azure or AWS experience
Proficiency in scripting (Python, Go, Shell) and infrastructure automation
Experience with monitoring tools (Datadog, Prometheus, Grafana, ELK)
Understanding of containerization and orchestration (Docker, Kubernetes)
Experience with incident management and postmortem processes

Other Details

Job Type: W2 or Contract (C2C or 1099)
Duration: 12 months with possible extensions
Location: Hybrid
Clearance: Eligibility preferred
Compensation: DOE