KubeCon: Enabling HPC and ML Workloads with the Latest Kubernetes Job Features

On April 21, LLNL computer scientist Vanessa Sochat and Google software engineer Michał Woźniak presented “Enabling HPC and ML Workloads with the Latest Kubernetes Job Features” at KubeCon. View the slides and watch the video. The abstract follows:

In this talk, we present the new features in Kubernetes Job API and how they can be used to stand up to challenges of running distributed Batch/AI/HPC workloads at scale, based on real-world experiences from DeepMind and the Flux Operator from Lawrence Livermore National Laboratory. We showcase the Indexed Jobs feature by presenting its production use. First, we demonstrate how it simplifies running parallel workloads which require pod-to-pod communication, including distributed machine learning examples based on its use by DeepMind. Next, we demonstrate the orchestration of HPC workloads using the Flux Operator. Here, we create a “Mini Cluster” within Kubernetes built on top of an indexed job, providing a rich ecosystem for orchestration of batch workloads, related user interfaces, and APIs. We also discuss the challenge of handling pod failures for long-running workloads. We show how Pod Failure Policy can be used to continue job execution despite numerous pod disruptions (caused by events such as node maintenance or preemption), yet reduce costs by avoiding unnecessary pod retries when there are software bugs.