AI Solutions and Platforms Operations Engineer

PepsiCo · Telangana, India

Full-time · Mid-Senior level · Posted 12 days ago

Overview

The AI Observability Engineer (Agentic Frameworks & AI Agent Operations Center Developer) builds and operationalizes agentic AI solutions using modern orchestration frameworks and contributes to an AI Agent Operations Center that enables safe, reliable, and observable agent behavior at scale. This role focuses on developing agent workflows (planning, tool execution, memory, and RAG), integrating guardrails and evaluations, and delivering operational capabilities such as run management, telemetry, and incident triage for production agents.

Responsibilities

AI Agent Operations Center (70%)
Build “operations center” capabilities for agent runtime management: agent registry, versioning, deployment tracking, and run histories
Enable operational workflows such as incident triage, replay/debug runs, trace correlation, and root-cause analysis across agent steps
Implement operational dashboards and views for agent health: success rate, latency, tool failure rate, cost per run, and loop detection
Instrument agent flows end-to-end using OpenTelemetry (or equivalent), enabling correlation across prompts, tool calls, retrieval, and responses
Implement semantic conventions and tagging standards (agent name/version, tool name, model provider, environment, tenant/app)
Partner with SRE/observability teams to ensure production-grade monitoring, alerting, and operational readiness
Collaboration with Teams (10%)
Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
Integration & Deployment (10%)
Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
Drive best practices for secure, scalable, and cost-effective agent deployments
Continuous Learning (10%)
Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.
Decision-Making Autonomy Moderate – Significant autonomy in the technical aspects of AI model development and implementation, working under the strategic direction provided by the Senior AI Solutions leads.
Supervision Required Moderate – Operates with general guidance from the Senior AI Solutions leads, with regular updates for alignment and support.
Complexity of Role High – The role requires managing complex AI/ML projects, working with large datasets, and ensuring successful integration with existing systems while maintaining scalability.
Cross-Functional Interactions Yes – Regular interaction with Data Science, Engineering, IT, digital products and business stakeholders to ensure effective AI solution Observability.

Qualifications

Key Skills/Experience Required Minimum Qualifications:
Education: Bachelor’s in Computer Science, AI/ML, Data Science, or a related field.
Experience: 3–5+ years of software engineering experience; 1+ years building and observe AI/ML or GenAI applications preferred
Required Expertise:
Hands-on experience with agentic frameworks (Crew.ai, LangChain, Semantic Kernel, AutoGen, or similar)
Proficiency in Python (primary) and familiarity with APIs/microservices patterns
Strong experience with RAG patterns (embeddings, vector search, retrieval evaluation, chunking strategies)
Experience with cloud environments (Azure/AWS/GCP) and containerized deployments (Kubernetes/AKS/EKS)
Familiarity with observability fundamentals (logs/metrics/traces) and production troubleshooting
Experience building internal developer platforms or operational consoles (agent registry, run tracking, dashboards)
Familiarity with OpenTelemetry, distributed tracing, and telemetry pipelines
Experience with Azure AI Search / vector databases, prompt/version management, and evaluation frameworks
Knowledge of Responsible AI practices: data handling, safety guardrails, audit trails, and redaction strategies
FinOps exposure: token/GPU cost optimization and chargeback/showback reporting
Drentiating Competencies Required
Technical Proficiency: Agent orchestration design (planning, tool execution, memory, RAG), Strong engineering discipline: testing, versioning, CI/CD, automation, Operational mindset: reliability, debuggability, and incident response support
Problem-Solving: Ability to translate business challenges into technical solutions.
Collaboration Skills: Effective at working within cross-functional teams.
Agility: Flexibility to adapt to changing requirements and new technologies.
Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.