Job Summary
We are hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of critical cloud infrastructure and applications across Azure or AWS environments.
Key Responsibilities
- Build and manage monitoring systems, dashboards, and alerts
- Define and enforce SLOs/SLIs for production systems
- Develop automation scripts to reduce manual operations
- Troubleshoot incidents and lead root cause analysis
- Partner with development teams to improve application resilience
- Optimize performance and system availability
Qualifications
- 4+ years in DevOps or SRE roles with Azure or AWS experience
- Proficiency in scripting (Python, Go, Shell) and infrastructure automation
- Experience with monitoring tools (Datadog, Prometheus, Grafana, ELK)
- Understanding of containerization and orchestration (Docker, Kubernetes)
- Experience with incident management and postmortem processes
Other Details
- Job Type: W2 or Contract (C2C or 1099)
- Duration: 12 months with possible extensions
- Location: Hybrid
- Clearance: Eligibility preferred
- Compensation: DOE