Abstract
HPC systems are shared between many users. Managing their resources and scheduling compute jobs is a central task of these clusters. Scheduling also allows to control the workload and energy consumption of an HPC system. A Digital Twin of an HPC cluster can aid in the scheduling process by providing energy measurements about the system and predict scheduling decisions with a simulation. For a real-world use case, an integration of the Digital Twin with the scheduler is necessary. A possible use case are energy limitations as part of a demand response process between the HPC operator and energy supplier. Therefore, this paper introduces a plugin for Slurm, an open-source scheduler, that implements a scheduling algorithm for time-dependent cluster-wide power capping. It uses a node energy model to predict the energy consumption of jobs and can start jobs at different frequencies to stay below the configured power limit. The plugin interfaces with the Digital Twin that provides energy measurements for the compute nodes to track the system power consumption in real time and update the power limitations if necessary. The plugin is tested on a cluster and compared against a scheduling simulation of the algorithm. The analysis compares the power profile of the simulation and the real system and the allocation of the jobs over time. Differences in the execution and the power trace are analysed and discussed.
DOI: 10.15439/2025F0376