Site Reliability Engineer (SRE)

Job Summary

We are hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of critical cloud infrastructure and applications across Azure or AWS environments.

Key Responsibilities

  • Build and manage monitoring systems, dashboards, and alerts
  • Define and enforce SLOs/SLIs for production systems
  • Develop automation scripts to reduce manual operations
  • Troubleshoot incidents and lead root cause analysis
  • Partner with development teams to improve application resilience
  • Optimize performance and system availability

Qualifications

  • 4+ years in DevOps or SRE roles with Azure or AWS experience
  • Proficiency in scripting (Python, Go, Shell) and infrastructure automation
  • Experience with monitoring tools (Datadog, Prometheus, Grafana, ELK)
  • Understanding of containerization and orchestration (Docker, Kubernetes)
  • Experience with incident management and postmortem processes

Other Details

  • Job Type: W2 or Contract (C2C or 1099)
  • Duration: 12 months with possible extensions
  • Location: Hybrid
  • Clearance: Eligibility preferred
  • Compensation: DOE