KeY2Moon Solutions

← Back to Opportunities

Cloud and DevOps

Platform Reliability Engineer

Remote - US

Remote

Full-time

Senior

Posted February 2, 2026

About this Position

Build client-critical software at KeY2Moon

At KeY2Moon Solutions, you will work on real client problems that affect revenue, operations, and customer experience. We combine agency speed with engineering discipline, so people who join us get broad ownership and measurable impact.

Direct exposure to product, architecture, and client decision-making

A digital subscription business is scaling quickly, but release windows trigger recurring incidents and rollback-heavy weekends for the internal team.

Their current pipeline was assembled in phases and lacks guardrails. We need a pragmatic engineer who can improve reliability without freezing product delivery.

You will redesign delivery controls, observability, and incident workflows so teams can ship often without breaking production.

Engagement Stack

AWS

Kubernetes

Terraform

GitHub Actions

Datadog

Responsibilities

• Rework release flow using GitHub Actions, Terraform, and Kubernetes rollout controls that match real failure patterns

• Improve incident readiness through better service ownership, Datadog/Sentry observability, and runbook quality

• Set practical reliability KPIs from AWS infrastructure, deployment, and error telemetry that engineering and product can track together

• Coach client squads on operational discipline, on-call readiness, and post-incident follow-through

Requirements

• You have improved unstable pipelines in high-pressure environments using AWS, Kubernetes, and Infrastructure as Code

• You can define reliability controls that teams adopt because they are practical for daily delivery, not just policy-compliant

• You are strong at production troubleshooting across infra, application, and CI/CD layers with clear incident communication

• You can convert repetitive outage patterns into preventive engineering backlog with measurable reliability outcomes

Nice to have

• Experience in subscription or payment-heavy systems where uptime directly affects revenue

• Experience running blameless postmortems with cross-functional technical and business teams

• Experience mentoring product engineers in reliability fundamentals and release safety practices

Hiring process

1. Intro call with talent team (30 minutes)

2. Practical role interview focused on recent project work (60-90 minutes)

3. Final panel on collaboration, ownership, and client communication

4. Offer discussion and onboarding plan