
You may have read about how artificial intelligence applications like ChatGPT were trained by ingesting as much freely available data on the internet as possible. It’s no secret that this has caused significant concern in research circles about how unfiltered data can affect AI-generated answers. But accuracy isn’t the only concern when it comes to training AI on “all the data.” The amount of resources it takes to ingest so much data to build the large language model (LLM) that forms the “brain” of one of these chatbots is enormous. Researchers who wish to explore LLMs, test and refine training algorithms or create project-specific chatbots for domain-specific research usually don’t have the same resources as companies like OpenAI or Google. Researchers at the University of Illinois Urbana-Champaign (U. of I.) are using their U.S. National Science Foundation ACCESS allocation to address these issues in AI training.
Ishika Agarwal, a doctoral candidate at U. of I., is part of a team of researchers who used NCSA’s Delta supercomputer to create a new framework for training AI called DELIFT (Data Efficient Language model Instruction Fine-Tuning). Last year, they presented their work on DELIFT at the International Conference on Learning Representations (ICLR) in Singapore. They used Delta to help test the DELIFT framework.
I used Delta GPUs to help me run the experiments for my paper. It was really easy to use and required minimal setup, which helped me get results fast.
–Ishika Agarwal, University of Illinois Urbana-Champaign
DELIFT is designed to train AI more efficiently – meaning faster and cheaper – by being “smarter” about what data is used in the training process. “DELIFT tries to reduce the amount of data that we use to train large language models because training is computationally intensive,” explained Agarwal. “ It calculates the interactions between data points in a dataset to ensure that there aren’t redundant, noisy or conflicting samples. All this ensures that your dataset is the smallest and most informative to train your model on.”
One problem with training AI using a bunch of random data is that data doesn’t exist in a vacuum. There’s a lot of context around specific pieces of information. For instance, much of the data on the internet is redundant, and using it all to train an LLM is a waste of resources. Let’s say you want to train an LLM to understand that 1+1=2. At some point, the LLM no longer needs examples to learn the answer, even though there are likely millions of pages online that say that 1+1=2.
Some data also needs to be “unlocked” before other data can be understood. Simple math eventually builds to complex math, and in order for an LLM to understand complex physics equations, it would need to be trained on the simple stuff first.

“Understanding the dynamics of data is difficult because data can interact with each other in many ways,” said Agarwal. “DELIFT provides an intuitive way to understand and measure the dynamics of your data for various fine-tuning tasks.”
DELIFT is designed to train LLMs much more efficiently, ensuring that the data used to train an LLM is necessary and of high quality. Think of what DELIFT does as similar to getting a hint to solve a brain teaser – the hint could help you connect the dots to the answer. DELIFT does something similar by testing “samples” of data to see if they help the model solve other problems. As a result, DELIFT will occasionally outperform models that use 100% of a dataset.
“More data is not always good because data has to be high quality. If we train our language model on garbage, it’s only going to output garbage,” said Agarwal. “That’s why we see data selection methods often outperforming models that are trained on 100% of data. In data, ‘noise’ can refer to bad quality samples (mis-labeled, improper formatting, confusing characters, wrong answers, etc.), and ‘redundancy’ refers to duplicated samples that contain the same information (a model does not need to see both samples to learn the information).”
The ACCESS program and Agarwal’s project share some goals. DELIFT will make it easier for more researchers and educators to train AI, even if they don’t have access to robust HPC resources. Democratization of access to compute resources is also a defining aspect of ACCESS. The fact that the ACCESS program supports the research Agarwal’s team is engaged in is serendipitous – DELIFT is specifically designed to help researchers find ways to use HPC resources in their work, even if their resources are limited, and the ACCESS program can help provide those researchers with the HPC resources they’d need to try the DELIFT framework for themselves, at no additional cost. To find out more about how HPC resources could help power your research, you can get started with ACCESS here. It’s easy to apply for an allocation, and we have a number of tools available to help you discover all the resources the program has to offer.
You can read more about this research in the original article here: Less is Sometimes More.
Resource Provider Institution(s): National Center for Supercomputing Applications (NCSA)
Resources Used: Delta
Affiliations: University of Illinois Urbana-Champaign
Funding Agency: NSF
Grant or Allocation Number(s): CIS240550 and CIS260246
The science story featured here was enabled by the U.S. National Science Foundation’s ACCESS program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
