Senior Engineer, ML Training Platforms

  • CoreWeave Europe
  • Apr 23, 2024
Full time

Job Description

The Training Platforms Team plays a key role in CoreWeave's customer journey as the team responsible for supporting and innovating the interfaces our customers use to schedule and drive their machine learning workloads. Initially supporting our Slurm platform on CoreWeave's Kubernetes-native infrastructure, this team has a mandate to keep their fingers on the pulse of the machine learning community and provide commoditised solutions targeted to their needs that reduce friction and increase the efficiency and reliability of consuming CoreWeave's world-class GPU cloud.

We are seeking a Senior Engineer to join the Training Platforms Team and help us build the interfaces of consumption that CoreWeave's customers need in order to be successful. This is a new team, based in London but with the opportunity to work fully remotely. You will be involved in the early stages of Coreweave's expansion into Europe, and will work on the full gamut of rewarding challenges that come with the business of building a cloud in a communicative, supportive, and high-performing environment. As a member of the Training Platforms Team, you will have the opportunity to:

  • Identify and implement scalable and fault-tolerant interfaces for consuming GPU resources that are responsive to the needs and practices of the ML community.
  • Investigate new frameworks and ensure that Coreweave is able to support customers with the latest cutting edge techniques in ML training.
  • Create test plans, deployment automation, dashboards, alerts, and insights into our product's operations as well as participate in the Training Platforms on-call rotation.
  • Develop as a technical leader, leveraging your experience and your existing knowledge of Slurm and/or Kubernetes to help mentor and support junior engineers.
  • Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.

The base pay for this position ranges from £80,000 - £120,000. Pay is based on a number of factors including job-related knowledge, skills, and experience. This position requires participation in a rotating on-call schedule.

Wondering if you're a good fit?We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams - even if you aren't a 100% skill or experience match. Here are some qualities we've found compatible with our team. If a portion of this resonates with you, we'd love to talk.

  • You have four or more years of experience in a software engineering industry with a specialisation in developing and troubleshooting distributed systems in production and at scale.
  • You have a drive to learn and grow in a rapidly evolving technology space and have interest or experience in some of the current core technologies supported by the team such as Slurm.
  • You are comfortable with the idea of using Go as your primary programming language and are capable of navigating a Linux operating environment.
  • You have experience using Kubernetes, with a solid understanding of its major components and ingress/service meshes.
  • You can transform problems in elastic architectures, decompose them into achievable tasks, and socialise both to your teammates.
  • You're interested in reliability engineering concepts such as the different types of testing, progressive deployments, error budgets, the role observability, and fault-tolerant design.
  • You're excited about being part of a team of diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together.