What Gets Measured Gets Improved

By ACCESS Team

Most people generally want to be efficient – you’re working better, faster, more cost-effectively. You avoid wasting time and you use your resources carefully, which benefits you by having more time and resources at your disposal. 

The same can be said for use of computational resources. Administrators of HPC centers want to make sure their resources are being used as efficiently as possible. Researchers utilizing HPC resources want to be efficient to get maximal use of their allocation as well as improve their job throughput. And of course, the National Science Foundation (NSF) wants both the computational centers and the resource users to make optimal use of these resources so that their investment in cyberinfrastructure realizes its full potential. Enter XDMoD. 

In broad terms, XDMoD (XD Metrics on Demand) is a tool that provides information (metrics) to a wide range of people about how supercomputers and other computing resources are being used. It’s widely used by HPC centers across the country, the NSF, and users of HPC resources. For researchers utilizing ACCESS, the tool can be used to glean valuable insights into job performance and usage, which can ultimately help you improve resource use and job throughput.   

XDMoD and ACCESS Users

There are a number of ways researchers utilizing ACCESS resources can leverage XDMoD to better understand the performance of the jobs they’re running, and in so doing, possibly improve the performance of subsequent jobs. For example, ACCESS users can view detailed job-level performance data for every job through XDMoD’s “Job Viewer” capability.

To demonstrate this, consider the plot below, which depicts CPU user percent versus time for an ACCESS job on Bridges. A CPU can be in various modes; it can be in wait/idle mode, system mode, or user mode. Ideally, researchers want CPUs to be accomplishing work, which happens when the CPU is in user mode. So the Job Viewer shows you the percentage of the time that the CPU is in user mode. The closer that number is to 100%, the better. In the job shown below, the user requested 24 cores and the job uses just one of them at a time. This means that, on average, the CPUs spend just 4% of the time in the user mode doing useful work. The user could have just requested a single core, run the job in the same time and used 1/24 as much of their allocation.

This screen capture of the XDMoD Job Viewer depicts CPU user percent versus time for an ACCESS job on Bridges. The user requested 24 cores and the job swaps back and forth among them using just one of them at a time. (Each different colored line represents an individual CPU.) Therefore, on average, the CPUs spend just 4% of the time in the user mode, as shown in the upper left-hand CPU User “traffic light” chart (green=good, yellow=fair, red=poor), meaning the job was not run efficiently.

Another cool feature of the Job Viewer is the plot of memory usage, which allows researchers to see how much memory was used during a job. If, for instance, there was a time your job failed, but you didn’t understand why, you could use XDMoD’s Job Viewer to determine the memory usage of that job. If it reached 100% memory usage, that’s likely why it failed and resubmitting the job with a larger memory request may be all that’s needed. For example, in the job below, the Headroom Memory (that is, the amount of memory available beyond that which is used) is zero; meaning that the job hit the memory limit.

This screen capture of the XDMoD Job Viewer depicts a job that ran out of memory and crashed, which is indicated by the color-coded Memory Headroom “traffic light” at the top.

XDMoD makes it simple for users and PIs to track their ACCESS allocation usage. Just log on to ACCESS XDMoD, click on the “Allocations” tab to see all of your allocations, how much has been used and how much is remaining.

This screen capture of XDMoD depicts how a user can view all their allocations to easily track usage.

XDMoD and HPC Centers

This is the first page of a report generated on the utilization of Bridges over the last quarter of 2022. These types of reports are helpful for XDMoD stakeholders and can be automatically generated and sent at frequencies ranging from daily to annually.

Beyond ACCESS users, XDMoD is a helpful tool for a number of other stakeholders, including the center director and the staff responsible for running HPC facilities. XDMoD can be used to help ensure the HPC systems are operating efficiently as well as providing quantitative metrics on its use, including the number of jobs, average job size, areas of research supported, applications run, average wait times, and many, many others. 

Many of these metrics are critical to being able to determine how well the resource is meeting the researchers’ needs as well as provide insight into the design of future systems to better serve the users. For example, these metrics can shed light on questions such as is the per-node memory sufficient to support the application mix running on the resource, and are the GPUs being effectively used? Much of this information is available on XDMoD with a few keystrokes.

In addition, once a given set of metrics of interest to a particular stakeholder is determined, XDMoD has a report generator feature that can be used to send one-time or periodic reports to that stakeholder, whether they are a researcher, staff support specialist, center director or an NSF program manager. For example, see the illustration below, which shows the first page of a report generated on the utilization of Bridges over the last quarter of 2022. Any metric available in XDMoD can be incorporated into a report and sent one-time or automatically updated and sent daily, weekly, monthly, quarterly, semi-annually or annually.

XDMoD and the National Computing Landscape

According to the National Science Board, the federal government is the largest funder of academic research and development, or R&D. When it comes to funding academic research computing, the largest funding source is the National Science Foundation. In 2020, the NSF’s Office of Advanced Cyberinfrastructure (OAC) spent over $150 million on cyberinfrastructure alone (see page 7 in their FY 2022 Budget Request to Congress). 

When NSF decides to fund a new machine in the cyberinfrastructure ecosystem, they need to understand how machines are currently being used and what type of machine will best suit researchers’ needs in the near future. If GPU usage is on the rise, that can indicate the need to deploy a new system that’s designed to support GPU-intensive research. XDMoD has a tremendous database of usage characteristics after years of gathering data on a number of supercomputers across the country. In fact, the XDMoD team has carried out detailed workload analyses in the past, including Blue Waters and the XSEDE allocatable resources to help NSF assess the national landscape and inform the deployment of future systems. 

History of XDMoD

Tom Furlani, previous Director of the University at Buffalo Center for Computational Research.
Andrew Bruno, who developed the precursor to XDMoD.

So, perhaps you’re wondering how XDMoD got its start. 

The answer is a modest beginning borne out of a desire to have near real-time metrics on HPC system use. In 2006, Tom Furlani, who was then Director of the University at Buffalo Center for Computational Research, was frustrated by his inability to easily determine the utilization of the Center’s HPC resources – simple questions such as who is running jobs, how much are individual users running per quarter or per year, what is the utilization by department or decanal unit, what are the average wait times … the list goes on and on. While this data could be obtained by parsing the system log files, it was laborious, time-consuming and had to be repeated each time answers to these questions were desired. Enter a very talented programmer at CCR by the name of Andrew Bruno, who developed UBMoD (UB Metrics on Demand), which was the precursor to XDMoD and was a remarkable tool in its own right – indeed, it is still used locally at UB. 

When in 2009, the NSF desired similar auditing capability for its large investment in HPC resources, and put out a competitive solicitation for the development of a framework to facilitate the monitoring and measurement of their HPC resources. The Center for Computational Research at the University at Buffalo responded to the solicitation and the rest is history….UBMoD formed the basis for today’s ACCESS XDMoD, and includes an open-source version – Open XDMoD.

XDMoD Today

XDMoD is currently utilized by all advanced cyberinfrastructure within the ACCESS program. There’s also an open-source version of XDMoD, Open XDMoD, which is designed to provide similar capabilities to academic and industrial HPC centers. Open XDMoD has been downloaded thousands of times and is known to be employed by at least 400 users worldwide. 

Tutorials and demonstrations are regularly provided by the XDMoD staff. You can contact them here to schedule a demo of XDMoD or Open XDMoD for your group. 

ACCESS users can launch ACCESS XDMoD to start improving their usage here.

You can learn more about Open XDMoD and download it for yourself here.

Bob DeLeon, ACCESS Metrics Co-PI and Project Manager, contributed to this story.


Project Details

Institution: ACCESS
University: SUNY Buffalo
Funding Agency: National Science Foundation

Open XDMoD is funded by National Science Foundation grants ACI 1025159 and ACI 1445806. The ACCESS program is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

Sign up for ACCESS news and updates.

Receive our monthly newsletter with ACCESS program news in your inbox. Read past issues.