Site Reliability Engineer - Azure (4-8yrs)

PhonePe · Bengaluru, Karnataka, India

Full-time · Senior · Posted 12 days ago

About PhonePe Limited:

Headquartered in India, its flagship product, the PhonePe digital payments app,
was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600
Million) registered users and a digital payments acceptance network spread
across over 4 Crore (40+ million) merchants. PhonePe also processes over 33
Crore (330+ Million) transactions daily with an Annualized Total Payment Value
(TPV) of over INR 150 lakh crore. 

 

PhonePe’s portfolio of businesses includes the distribution of financial
products (Insurance, Lending, and Wealth) as well as new consumer tech
businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App
Store for the Android ecosystem) in India, which are aligned with the company’s
vision to offer every Indian an equal opportunity to accelerate their progress
by unlocking the flow of money and access to services.

 

Culture:

At PhonePe, we go the extra mile to make sure you can bring your best self to
work, Everyday!. And that starts with creating the right environment for you. We
empower people and trust them to do the right  thing. Here, you own your work
from start to finish, right from day one. PhonePe-rs solve complex problems and
execute quickly; often building frameworks from scratch. If you’re excited by
the idea of building platforms that touch millions, ideating with some of  the
best minds in the country and executing on your dreams with purpose and speed,
join us!

Summary

We are seeking a highly motivated and experienced Site Reliability Engineer
(SRE) to manage, scale, and ensure the high availability of our core
infrastructure. This role involves deep expertise in cloud services, automation,
monitoring, and complex networking to support a high-volume, mission-critical
environment.

Key Responsibilities

* Cloud & Infrastructure: Configure, maintain, and manage services and packages
on Ubuntu Virtual Machines in Azure. Design and manage Azure components for
log storage, management, alerting, and monitoring.
* Networking & Connectivity: Configure and maintain complex network components
including Azure Firewall, Route Tables, Virtual Network Gateways, and Express
Route. Establish and manage IPsec and Express Route connectivity with
external environments. Manage routing, troubleshooting connectivity issues,
and support network component migrations with minimal downtime.
* Automation & IaC: Drive automation for all BAU tasks using Terraform,
Saltstack, Ansible, and scripting languages. Write new Terraform code for
infrastructure components.
* Database & Data Management: Set up and manage high-availability services like
Mysql and Aerospike. Implement database replication across regions, manage
migrations, and ensure data sync. Handle backups of databases, logs, and
configurations.
* Monitoring & Observability: Implement and manage monitoring (e.g.,
Prometheus, Victoria Metrics, Riemann) and centralized logging (Loki)
solutions, with visualization on Grafana. Troubleshoot performance and system
issues at the OS, platform, or application level.
* Security & Compliance: Manage firewalls and integrate platform and VM-level
services with the SOC. Collaborate with Infosec teams to evaluate and fix
security vulnerabilities.
* Capacity & Performance: Conduct proactive capacity planning. Manage critical
infrastructure components like Nginx, HA Proxy, Docker, and RMQ.
* Incident Management & DR: Participate in an on-call rotation. Structure and
lead incident response, Root Cause Analysis (RCA), and post-mortem creation.
Set up and support planning and execution of DR sites and failovers.

Required Technical Expertise

* Cloud Platform (Microsoft Azure):
* Core Services: Deep, hands-on experience with Microsoft Azure components,
including Virtual Machines (Ubuntu/Linux), Azure Storage Accounts,
CosmosDB, and Azure Data Explorer (ADX).
* Networking: Expert-level knowledge in configuring and managing complex
Azure networking components: Azure Firewall, Azure Route Tables, Virtual
Network Gateways, Azure Express Route, and Azure Private DNS. Must be
proficient in setting up and troubleshooting routing using protocols like
BGP with on-prem DCs and managing network component migrations with
minimal downtime.
* Security/Compliance: Experience integrating platform and VM-level services
with the Security Operations Center (SOC) and collaborating with Infosec
teams on vulnerability evaluation and remediation.

* Operating Systems & Scripting:
* OS: Expert proficiency in Linux environments, specifically Ubuntu/Linux,
for system administration, service configuration, and performance
troubleshooting at the OS level.
* High-Level Language: Deep expertise in at least one high-level language
(Python, Go, or Java) for writing automation, services, and tooling.
* Shell Scripting: Shell scripting (Bash) mastery is essential for
day-to-day operational tasks and automation.

* Monitoring, Observability & Logging:
* Monitoring: Extensive experience implementing and maintaining modern
monitoring systems such as Prometheus, Victoria Metrics, and Riemann.
* Logging: Proficiency with centralized log management using Loki for log
ingestion, enrichment, lifecycle management, and providing a search/view
platform.
* Visualization: Expertise in creating and managing dashboards for
visualization and alerting using Grafana.

* Configuration Management & IaC (Infrastructure as Code):
* IaC: Mastery of Terraform for writing new component configurations and
building automation for BAU (Business As Usual) tasks.
* Configuration Management: Strong experience with configuration management
tools like Saltstack (or Ansible) for automated deployment and
configuration of services on VMs.

* Databases & Data Stores:
* High-Availability Data Stores: Hands-on experience setting up, managing,
and scaling high-availability databases like Mysql and Aerospike.
* Time-Series/Search: Familiarity with Elastic Search and time-series
databases like InfluxDB.
* Replication/DR: Expertise in database replication between different
regions, managing database migrations, setting up circular replication,
and ensuring data sync during system and network issues.

* Core Infrastructure Services:
* Web/Proxy: Expert management of critical infrastructure components like
Nginx and HA Proxy, including proxy management, endpoint addition, header
configuration, and writing rewrite rules.
* Messaging/Container: Experience with messaging queues like RMQ (RabbitMQ)
and containerization technology like Docker.
* Networking Services: Deep knowledge of DNS and other core network
protocols.

Essential Soft Skills & Qualifications

* Ownership and Accountability: A proactive approach to identifying and solving
infrastructure challenges before they impact service availability.
* Communication: Excellent written and verbal skills for documenting
procedures, creating runbooks, and communicating with technical and
non-technical stakeholders.
* Mentorship: (For senior roles) Ability to mentor junior engineers and promote
SRE best practices across the organization.
* SLO/SLA Management: Experience defining, monitoring, and meeting Service
Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical
services.
* Toil Reduction: A commitment to measuring and actively reducing operational
toil through automation (e.g., using SRE's Toil Reduction framework).
* Cost Optimization: Experience identifying and implementing cloud resource
optimization and cost-saving measures within the Azure environment.

PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract
Roles)

* Insurance Benefits - Medical Insurance, Critical Illness Insurance,
Accidental Insurance, Life Insurance
* Wellness Program - Employee Assistance Program, Onsite Medical Center,
Emergency Support System
* Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption
Assistance Program, Day-care Support Program
* Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel
Policy
* Retirement Benefits - Employee PF Contribution, Flexible PF Contribution,
Gratuity, NPS, Leave Encashment 
* Other Benefits - Higher Education Assistance, Car Lease, Salary Advance
Policy

 

Our inclusive culture promotes individual expression, creativity, innovation,
and achievement and in turn helps us better understand and serve our customers.
We see ourselves as a place for intellectual curiosity,  ideas and debates,
where diverse perspectives lead to deeper understanding and better quality
results. PhonePe is an equal opportunity employer and is committed to treating
all its employees and job applicants equally; regardless of  gender, sexual
preference, religion, race, color or disability. If you have a disability or
special need that requires assistance or reasonable accommodation, during the
application and hiring process, including support for the interview or
onboarding process, please fill out this form.
[https://docs.google.com/forms/d/e/1FAIpQLSc-ETchy2LsQ_DjlNXOUcGfI182DnA533YZTLfVw5TkJH-Stw/viewform]

Read more about PhonePe on our blog [https://www.phonepe.com/blog/].

Life at PhonePe [https://www.phonepe.com/blog/life-at-phonepe/]

PhonePe in the news [https://www.phonepe.com/press/]

Sign up to apply