Infracloud
About the Role
As a Production Engineer, you’ll be part of our newly formed global Production Engineering (Prod Eng) team — the bridge between Technical Support and Engineering. Your mission is to accelerate the resolution of complex, customer-impacting issues, reduce escalation times, and contribute to the reliability and resilience of our products.
You’ll work closely with SREs, Product Engineers, and Technical Support, identifying recurring issues, performing root cause analysis, and driving automation that eliminates manual toil.
As a Production Engineer, you’ll be part of our newly formed global Production Engineering (Prod Eng) team — the bridge between Technical Support and Engineering. Your mission is to accelerate the resolution of complex, customer-impacting issues, reduce escalation times, and contribute to the reliability and resilience of our products.
You’ll work closely with SREs, Product Engineers, and Technical Support, identifying recurring issues, performing root cause analysis, and driving automation that eliminates manual toil.
What You’ll Do
- Own Tier 3 technical escalations from Technical Support and ensure rapid resolution.
- Investigate, triage, and mitigate incidents, ensuring accountability and timely communication.
- Conduct trend and root-cause analysis to identify recurring issues, bug patterns, and product gaps.
- Read and interpret application code to isolate, reproduce, and diagnose complex technical problems.
- Collaborate with Support and Product Engineering to drive systemic improvements and long-term fixes.
- Contribute to the creation and maintenance of runbooks, escalation workflows, and troubleshooting guides.
- Partner with cross-functional teams to improve monitoring, logging, and alerting for production systems.
- Automate repetitive tasks and build tools to improve team efficiency.
- Participate in on-call rotations as part of a 24×7 follow-the-sun model.
What You’ll Bring
- 3–5 years of experience in Production Engineering, Technical Support (Tier 3), SRE, or similar roles in a SaaS or enterprise software environment.
- Strong understanding of incident management, troubleshooting, and root cause analysis.
- Ability to read and understand code (golang preferred) to debug issues, analyze stack traces, and collaborate effectively with developers.
- Proficiency with ServiceNow, Jira, Azure DevOps, or equivalent tools.
- Familiarity with monitoring and observability platforms (Grafana, Prometheus, Splunk, etc.).
- Hands-on experience with cloud platforms such as Azure, AWS, or GCP.
- Basic scripting or automation skills (e.g., Python, PowerShell, or Bash).
- Strong communication and cross-functional collaboration skills.
- Data-driven mindset with a focus on efficiency, metrics, and continuous improvement.
Nice to Have
- Experience working in globally distributed, follow-the-sun teams.
- Exposure to AI or automation for incident triage or resolution.
- Experience contributing to DevOps or SRE practices.
- Prior experience in backup, recovery, or data management products.