UiPath

Principal Site Reliability Engineer

UiPath1 months ago
Location

Tokyo

Type

Full Time

Level

Principal

Role

Site Reliability Engineer

Posted

Jan 22, 2026

Full TimePrincipal

The role

Summary

UiPath is seeking a Principal Site Reliability Engineer to lead incident response and reliability initiatives for their Japan region. This role requires 7+ years of SRE experience with expertise in distributed systems, Kubernetes, and cloud infrastructure, focusing on incident command, service reliability, and automation to support UiPath's agentic automation platform.

What you'll do

Incident Command Leadership: Act as primary Incident Commander for high-stakes technical events, establishing command and control while orchestrating cross-functional response efforts
Live Site Troubleshooting: Serve as key escalation point for complex issues, diagnosing grey failures and resolving service disruptions using deep understanding of service topology
Executive Communication: Deliver real-time executive briefings during incidents, translating technical details into business impact and recovery timelines for leadership
Post-Incident Analysis: Lead thorough retrospectives and root cause analyses, driving implementation of automated self-healing solutions to prevent recurring issues
Observability Implementation: Define and track service health through SLIs and SLOs, implementing proactive monitoring and early-warning alert systems
Automation Development: Design and implement automation to reduce manual intervention during incidents and eliminate repetitive operational tasks
Service Resilience Testing: Test service behavior under load including degradation modes, scaling characteristics, and dependency failures
Architectural Partnership: Collaborate with development teams to champion high availability practices and promote reliability best practices
Team Mentorship: Mentor engineers and raise overall incident response and reliability maturity across the organization
Regional SRE Leadership: Act as Japan regional owner for SRE standards while maintaining alignment with UiPath's Global SRE organization

What we look for

Technical

SRE Experience7+ years in SRE, Cloud Operations, or related technical field with at least 3 years in lead responder or command-oriented role
Programming ProficiencyStrong proficiency in Python or Go for automation and tooling development
Distributed Systems KnowledgeHolistic understanding of distributed systems, Kubernetes, and cloud infrastructure, preferably Azure
Observability StackDeep experience with Prometheus/Grafana, OpenTelemetry, or equivalent third-party observability platforms
Incident Response SkillsSkills in analyzing system artifacts, network data, and performance dashboards to identify root causes of service failures
On-call AvailabilityWillingness to participate in on-call rotation as Incident Commander for high-severity issues

Education

Technical BackgroundBachelor's degree in Computer Science, Engineering, or equivalent practical experience in SRE/Infrastructure roles

Experience

Command PresenceDemonstrated ability to remain calm and decisive under pressure while leading diverse stakeholders through technical crisis situations
Cross-functional LeadershipExperience leading technical conversations across compute, network, storage, and database teams to successful outcomes
Language SkillsStrong English proficiency for global team communication combined with Japanese proficiency for local stakeholder communication

Skills

Required skills

Incident CommandProven ability to lead incident response as primary commander during high-stakes technical events
Python/Go ProgrammingStrong proficiency in Python or Go for developing automation tools and SRE solutions
KubernetesDeep understanding of container orchestration for managing distributed services at scale
Azure Cloud PlatformExperience with Azure cloud infrastructure and services for large-scale distributed systems
Observability ToolsExpertise with Prometheus/Grafana, OpenTelemetry, or equivalent monitoring and observability stacks
System TroubleshootingAdvanced skills in analyzing system artifacts and performance data to diagnose complex service failures

Nice to have

Incident Command System (ICS)Familiarity with structured command frameworks used in crisis management situations
LLM OperationsExperience using LLMs or AI-driven systems for reliability and capacity challenges in GPU-heavy environments
AI ToolingExperience championing AI tools and LLM-powered agents to improve SRE operations and reduce toil
Self-healing InfrastructureProven history building automated remediation systems using Terraform, Azure Service Operator, or equivalent solutions
Event-driven ArchitectureExperience with event-driven remediation and building automated response systems

Compensation & benefits

Benefits

Flexible Work Arrangements

Role allows flexibility in when and where work gets done depending on business needs

Diverse and Inclusive Workplace

Equal opportunities regardless of age, race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status

Reasonable Accommodations

UiPath provides reasonable accommodations for candidates on request and respects applicants' privacy rights

Professional Development

Opportunity to work with cutting-edge agentic automation technology and mentor other engineers


Interview process

  1. 1
    Initial Application Review Applications assessed on rolling basis with no fixed deadline, focusing on SRE experience and incident command capabilities
  2. 2
    Phone/Video Screening Initial conversation to discuss SRE background, incident response experience, and technical proficiency with required technologies
  3. 3
    Technical Assessment Deep dive into distributed systems knowledge, troubleshooting scenarios, and hands-on experience with observability tools
  4. 4
    Incident Simulation Practical exercise demonstrating incident command skills, crisis communication, and technical problem-solving under pressure
  5. 5
    Panel Interview Cross-functional interview with engineering leaders and SRE team members to assess leadership and collaboration skills
  6. 6
    Executive Presentation Present incident response strategy or technical solution to senior leadership, demonstrating communication skills for executive audiences

Apply for this position

You'll be redirected to the company's application page