DevOps Engineer

About the Company

DataRobot provides AI solutions that maximize business impact while minimizing risk. Its platform and applications integrate into core business processes, enabling teams to develop, deploy, and manage AI at scale. Organizations worldwide rely on DataRobot for predictive and generative AI that is secure, reliable, and aligned with business needs today and in the future.

About the Role

The Service Reliability Engineer (SRE) ensures the stability, scalability, and reliability of production environments. This role involves troubleshooting, debugging, systems management, software deployments, and automating operational tasks to deliver a seamless experience for DataRobot customers. SREs contribute to designing tools and practices that improve observability, performance, and system availability.

Responsibilities

  • Maintain continuous operation of production environments and respond promptly to alerts.
  • Ensure compliance with Service Level Agreements (SLAs) for customer-facing systems.
  • Deploy new features and updates to production environments safely and efficiently.
  • Automate routine operational tasks to improve reliability and scalability.
  • Collaborate with cross-functional teams to enhance monitoring, alerting, and incident response practices.
  • Analyze and resolve performance, networking, and application issues proactively.
  • Contribute to the design and enhancement of SRE tools and processes for system observability.

Required Skills

  • Strong experience with Linux/UNIX systems (Ubuntu, RedHat, or similar).
  • Proficiency with container orchestration using Kubernetes.
  • Infrastructure-as-Code knowledge using Terraform or CloudFormation.
  • Configuration management experience with Ansible.
  • Familiarity with databases and messaging systems: MongoDB, RabbitMQ, Postgres, Redis.
  • Observability and monitoring tools experience: ELK stack, ClickHouse, Grafana.
  • Cloud expertise in AWS, GCP, or Azure.
  • Programming/scripting skills in Python or Bash.
  • Solid understanding of networking protocols: TCP/IP, SMTP, HTTP, DNS, and load balancers.
  • Experience in network and application performance troubleshooting using tools like netcat, Wireshark, or Fiddler.
  • Version control and artifact management: GitHub, Artifactory.
  • Knowledge of application performance monitoring principles.

Preferred Qualifications

  • Bachelor’s degree in Computer Science, Management Information Systems, or related field.
  • 1–2 years of experience in Linux/Unix system fundamentals, cloud services, networking, storage, and database administration.
  • Strong communication skills and ability to collaborate effectively with distributed teams.
  • Demonstrated ability to improve operational processes and enhance system reliability.

For additional information and the full job description, visit the link to our official website below:

Copyright © 2025 MyDevopsJobs.com. All Rights Reserved.