15 Years of XDMoD

By Megan Johnson, NCSA
The XDMoD Logo.

Most people would like to save more time through efficiency, especially if that time saving helps them make the most of their resources. The same principle applies to the use of high-performance computing (HPC) resources that the U.S. National Science Foundation (NSF) ACCESS program allocates.

Everyone who is part of the program, from researchers to HPC center administrators, strives to find ways to be more efficient. HPC administrators work to ensure the systems they support are operating efficiently, researchers want to maximize their allocations and job throughput, and the NSF expects both providers and users to optimize resource use to fully realize the value of its investment in cyberinfrastructure. That’s where XDMoD (XD Metrics on Demand) comes in: a widely used tool that offers detailed metrics on how supercomputing resources are being used. It supports HPC centers, the NSF and researchers – particularly those using ACCESS – by providing insights that can lead to improved job performance and more efficient resource utilization.

XDMoD was born out of a desire to have near-real-time utilization metrics on HPC system use. In 2006, Tom Furlani, who was then Director of the University at Buffalo Center for Computational Research, a flagship institution within the State University of New York (SUNY) system, was frustrated by his inability to easily determine the utilization of the Center’s HPC resources – simple questions such as who is running jobs and what is their utilization over a quarter or year,  what is the utilization by department or decanal unit and what are the average wait times and how are they changing. While this data could be obtained by parsing the system log files, it was laborious, time-consuming and had to be repeated each time answers to these questions were desired. Enter a very talented programmer at CCR, Andrew Bruno, who developed UBMoD (UB Metrics on Demand), which was the precursor to XDMoD and was a remarkable tool in its own right.

In 2009, the NSF desired similar auditing capability for its large investment in HPC resources and put out a competitive solicitation for the development of a framework to facilitate the monitoring and measurement of its HPC resources. The Center for Computational Research at the University at Buffalo responded to the solicitation, and the rest is history. UBMoD formed the basis for today’s ACCESS XDMoD, and includes an open-source version – Open XDMoD, which is widely deployed throughout the U.S.

A picture of Tom Furlani.
Tom Furlani, PI for ACCESS Metrics

Since that time, XDMoD has grown in scope and purpose. Many new features have been added over the years. Here is Tom Furlani, PI for ACCESS Metrics, answering some questions about how XDMoD has evolved and what the team plans to do next:

XDMoD origins started with simple utilization metrics, but we soon added the ability to measure system and job-level performance to help better understand the workloads running on the systems, which helps inform future upgrades as well as allows staff and researchers to optimize individual job throughput.

Most recently, we have added the data analytics framework, which opens the rich repository of utilization and performance data contained in the XDMoD data warehouse to analysis using Jupyter notebooks and R and Python. So, interested parties can use their favorite analytical tools as opposed to having to be bound by the analytical capability the XDMoD interface provides. 

Just the wide range of application performance on the CI resources. Some applications are highly tuned to the computing environment and run very efficiently, all for the benefit of the researcher. While other applications have not yet been fully optimized. In some cases, this can be due to the rapidly changing natures of the computing systems, some of which require substantial tuning to achieve high performance. 

For me, as a former director, I find the job- or application-level performance metrics the most interesting. Computational scientists and application developers can use the detailed performance data to help tune their application to the computing environment and, in so doing, improve their overall job throughput. Occasionally, you come across a user who has forgotten to set a flag in the job submittal script that would greatly improve their job’s performance…those are simple ones to detect, but can have a dramatic impact on minimizing their time to science.

We will soon implement full integration with ACCESS OnDemand, so that from the researcher’s ACCESS OnDemand interface, XDMoD metrics on their individual jobs will appear. And with a simple click of the mouse on any particular job that has completed, they will be able to bring up detailed performance metrics through ACCESS XDMoD.

Tutorials and demonstrations are regularly provided by the XDMoD staff. You can contact them here to schedule a demo of XDMoD or Open XDMoD for your group. 

ACCESS users can launch ACCESS XDMoD to start improving their usage here.

You can learn more about Open XDMoD and download it for yourself here.Read more about XDMoD here: What Gets Measure Gets Improved

Sign up for ACCESS news and updates.

Receive our monthly newsletter with ACCESS program news in your inbox. Read past issues.