Site Reliability Engineer

HyperVerge · Bengaluru, Karnataka, India

Full-time · Senior · Posted 1 month ago

Role Overview
We are looking for an SRE who doesn't just "maintain" systems but builds them. You won't be stuck in a traditional support loop; instead, you will focus on the reliability, scalability, and automation of our cloud-native ecosystem. The ideal candidate has a "developer first" mindset, using code to solve infrastructure bottlenecks and ensuring our AWS and Kubernetes environments are rock-solid.

Key Responsibilities
Infrastructure as Code (IaC): Design and deploy scalable AWS environments using Terraform, CloudFormation, or Pulumi. No manual clicks.
Kubernetes Orchestration: Manage and optimize EKS (Elastic Kubernetes Service) clusters, including ingress controllers, service meshes, and autoscaling.
Reliability Engineering: Implement "Self-healing" infrastructure by writing automation scripts in Python or Go.
Observability: Build deep-visibility dashboards and alerting systems using Prometheus, Grafana, or Datadog to proactively catch issues before they hit users.
CI/CD Mastery: Own and optimize deployment pipelines (Jenkins, GitLab CI, or GitHub Actions) to ensure zero-downtime releases.
Security & Compliance: Ensure the infrastructure follows the "Principle of Least Privilege" using AWS IAM and network security best practices.
Technical Requirements
AWS Expertise: 3+ years of hands-on experience with core services (EC2, S3, RDS, Lambda, VPC, IAM).
Containerization: Strong experience with Docker and production-grade Kubernetes management.
Scripting/Coding: Proficiency in Python or Go for building internal tools and automating repetitive tasks.
Linux Internals: Strong command of Linux/Unix administration, networking (TCP/IP, DNS), and troubleshooting.
Configuration Management: Experience with Ansible, Chef, or Puppet to maintain system state.