Sr. Site Reliability Engineer (SRE principles,SLIs, SLOs,AWSerror budgets, observability ,Debugging,Linux & Windows,C#, .NET)

Vertafore · Hyderabad, Telangana, India

Full-time · Senior · Posted 1 month ago

Vertafore is a leading technology company whose innovative software solution are advancing the insurance industry. Our suite of products provides solutions to our customers that help them better manage their business, boost their productivity and efficiencies, and lower costs while strengthening relationships.

Our mission is to move InsurTech forward by putting people at the heart of the industry. We are leading the way with product innovation, technology partnerships, and focusing on customer success.

Our fast-paced and collaborative environment inspires us to create, think, and challenge each other in ways that make our solutions and our teams better.

We are headquartered in Denver, Colorado, with offices across the U.S., Canada, and India.

We are seeking a Senior Site Reliability Engineer to own the reliability, scalability, performance, and operational integrity of critical production services. This role is accountable for the full-service lifecycle, from design and deployment readiness through production operations, incident response, and continuous improvement. Reliability is a core engineering responsibility, requiring strong software engineering skills and autonomous operation across AWS, hybrid data centers, and customer-hosted environments.

ROLES AND RESPONSIBILITIES

·       Own production services end to end. Accountable for reliability, availability, scalability, performance, and operational health.

·       Define and manage SLIs and SLOs, using error budgets to guide delivery decisions.

·       Influence of service and system design to improve fault tolerance, observability and operational sustainability.

·       Debug complex production issues across application code, services and infrastructure using software engineering practices.

·       Perform root cause analysis using logs, metrics, traces, and code-level investigation.

·       Build automation and self-healing mechanisms to prevent repeat failures. 

·       Execute production changes (patching, certificate management, software releases) with safety, automation, and observability.

·       Design and operate production observability aligned to service health and customer impact.

·       Lead and participate in incident response for high-severity events.

·       Collaborate with engineering, product, architecture, and operations teams.

·       Operate with autonomy and sound judgment in reliability decisions.

Sign up to apply