Task-Based HPC Scheduler for Scalable, Cross-Platform Software

Ariel Sherman

Overview

When writing complex high-performance computing (HPC) applications, a large amount of the coding and debugging time is not spent defining the problem physics but instead on balancing computations between multiple heterogeneous devices, handling data communications, managing distributed memory systems, and providing fault-tolerance. Often, the resulting programs are barely readable, as the details of the work being performed are obscured by hardware-specific setup and communication code that dominates a program’s codebase. Even worse, the code used to balance computation, manage data communication, and provide fault-tolerance gets re-implemented in each piece of an application even though it performs the same tasks across those pieces of software. Relying on such specific code makes software more difficult to maintain and upgrade and hinders porting to new hardware platforms as they become available. The time spent improving, modifying, or debugging these device-specific code paths and common code sections could be better spent improving kernel performance or adding new features.

Project Details

The goal of this project is to develop a set of tools that makes writing and maintaining efficient HPC software for hybrid computing environments easier and improves the performance and scalability of existing and future applications. This will be done through a task-based scheduling framework for heterogeneous devices. Users will develop device-specific implementations of their computational kernels and define a task graph showing how they relate. The automated tools will then schedule the tasks for execution on the various devices available at runtime.

Results and Impact

To address the problem of separating physical science from computing science, we are developing a solution that decouples the problem definition from the platform-specific implementation details. We are demonstrating a prototype of this technology that expresses a problem as a series of tasks and data dependencies and handing it off to a managed runtime that efficiently partitions and schedules the problem tasks for execution. This approach can result in creating shorter, clearer, more powerful code that performs better than hand-tuned alternatives.

Why HPC Matters

While potentially relevant to a wide range of NASA applications, this work focuses on the Goddard Earth Observing System (GEOS) climate modeling software run on the Discover supercomputer at NASA’s Goddard Space Flight Center and the FUN3D computational fluid dynamics package developed at NASA’s Langley Research Center. Without NASA supercomputing resources, these tools could not be used for important scientific work in the areas of climate research, rocket design, and Mars lander analysis.