ML Infrastructure Service Reliability Engineer- Apple Services Engineering

Apple · Bengaluru, Karnataka, India

Full-time · Senior · Posted 19 days ago

At Apple, we don’t just build products — we create transformative experiences
that have reshaped entire industries. Our innovation is driven by the diversity
of our people and their ideas, inspiring everything we do. Imagine the impact
you could make. Join Apple and help us leave the world better than we found it.
The ML Infrastructure team is responsible for managing Apple’s largest ML
compute platform, multi-cloud storage abstraction and caching platform, which
supports critical machine learning training workloads that power user-facing
features across the Apple ecosystem. Operating across both first-party and
third-party cloud environments brings complex and unique challenges. As a Site
Reliability Engineer (SRE) on the ML Infrastructure team, you’ll be expected to
address these challenges through a strong foundation in cloud object storage,
data analysis, automation, collaboration, and advanced expertise in Kubernetes.
Our team oversees the full infrastructure stack — from low-level nodes to the
complete network architecture — ensuring our platform remains highly available,
resilient, and eﬃcient at scale.

DESCRIPTION

We are seeking an experienced Software and Systems Engineer to join our dynamic
team. This role demands a proactive mindset, technical excellence, and a
collaborative spirit. The ideal candidate will demonstrate: Strong critical
thinking and a high degree of individual accountability Eﬀective communication
and collaboration skills A genuine passion for Infrastructure as a Service
(IaaS) A commitment to automation and operational eﬃciency Ownership of projects
from design through delivery A solutions-oriented approach, coupled with the
ability to gain alignment on technical direction Consistent and timely execution
of design implementations aligned with project objectives The ability to provide
constructive technical feedback, fostering team-wide growth and continuous
improvement

MINIMUM QUALIFICATIONS

5+ years experience in building, operating and scaling a large application in a
private, public or hybrid cloud environment Deep expertise in Kubernetes, with
hands-on experience using platforms such as Google Kubernetes Engine (GKE) or
Amazon Elastic Kubernetes Service (EKS) Proficient in designing, developing, and
releasing code in languages such as Python, Go, or Rust Practical experience
with object storage technologies, including Amazon S3 or Google Cloud Storage
(GCS) Strong background in designing and troubleshooting complex networking
issues in both public and private cloud infrastructures Solid understanding of
Linux internals, standard networking protocols, and distributed systems
architecture

PREFERRED QUALIFICATIONS

Proven drive to automate manual operations and enhance processes through
continuous iteration Strong understanding of best practices for deploying
large-scale, distributed applications Hands-on experience managing diverse
system environments using configuration management tools or software delivery
platforms such as Spinnaker, Helm, or Flux Demonstrated expertise in deploying,
supporting, and monitoring both new and existing services, platforms, and
application stacks Solid familiarity with container orchestration and management
using Kubernetes