← Back to Opportunities
Cloud and DevOps
Platform Reliability Engineer
Remote - US
Remote
Full-time
Senior
Posted February 2, 2026
About this Position
Build client-critical software at KeY2Moon
At KeY2Moon Solutions, you will work on real client problems that affect revenue, operations, and customer experience. We combine agency speed with engineering discipline, so people who join us get broad ownership and measurable impact.
Direct exposure to product, architecture, and client decision-making
A digital subscription business is scaling quickly, but release windows trigger recurring incidents and rollback-heavy weekends for the internal team.
Their current pipeline was assembled in phases and lacks guardrails. We need a pragmatic engineer who can improve reliability without freezing product delivery.
You will redesign delivery controls, observability, and incident workflows so teams can ship often without breaking production.
Engagement Stack
AWS
Kubernetes
Terraform
GitHub Actions
Datadog
Responsibilities
• Rework release flow using GitHub Actions, Terraform, and Kubernetes rollout controls that match real failure patterns
• Improve incident readiness through better service ownership, Datadog/Sentry observability, and runbook quality
• Set practical reliability KPIs from AWS infrastructure, deployment, and error telemetry that engineering and product can track together
• Coach client squads on operational discipline, on-call readiness, and post-incident follow-through
Requirements
• You have improved unstable pipelines in high-pressure environments using AWS, Kubernetes, and Infrastructure as Code
• You can define reliability controls that teams adopt because they are practical for daily delivery, not just policy-compliant
• You are strong at production troubleshooting across infra, application, and CI/CD layers with clear incident communication
• You can convert repetitive outage patterns into preventive engineering backlog with measurable reliability outcomes
Nice to have
• Experience in subscription or payment-heavy systems where uptime directly affects revenue
• Experience running blameless postmortems with cross-functional technical and business teams
• Experience mentoring product engineers in reliability fundamentals and release safety practices
Hiring process
1. Intro call with talent team (30 minutes)
2. Practical role interview focused on recent project work (60-90 minutes)
3. Final panel on collaboration, ownership, and client communication
4. Offer discussion and onboarding plan


