Mimic Score and Grad-Mimic Framework Improve AI Training Data Selection

The Benefit of Training Data: A New Approach to Evaluation and Selection

Foundation models, the basis for numerous AI applications, are based on massive datasets often collected from the internet. These datasets frequently contain erroneous data, biases, and irrelevant content that can impair the performance of the models. The selection of the right data is therefore crucial for the success of the training. Traditional methods for data selection often rely on human heuristics, downstream evaluation datasets, or specialized evaluation models. However, these approaches can overlook the actual utility of the data for the training process.

A new study, published on arXiv, presents an innovative approach to evaluating data utility: the "Mimic Score". This value uses a pre-trained reference model as a guideline to assess the suitability of data samples for training a new model. The Mimic Score is based on the alignment of the gradient of the new model parameters with the vector pointing towards the reference model in the weight space. Samples that deviate from this direction are classified as less valuable and can be filtered out.

Grad-Mimic: A Framework for Automated Data Selection

Building upon the Mimic Score, the researchers developed the "Grad-Mimic" framework. This framework automates the identification and prioritization of useful data samples in two phases:

Phase 1: During training, Grad-Mimic prioritizes the samples to be learned based on the Mimic Score, thus guiding the model towards the weight space of the reference model.

Phase 2: After training, Grad-Mimic identifies the data utility across the training steps and aggregates these evaluations into an ensemble filter that automates data selection.

Empirical Results and Advantages

The effectiveness of Grad-Mimic has been confirmed in various experiments. In a controlled scenario with synthetic data to which varying degrees of label noise were added, Grad-Mimic could precisely identify the mislabeled samples and provide a reliable estimate of the overall quality of the dataset.

Even in more complex scenarios with large multimodal datasets from the internet, Grad-Mimic achieved convincing results. When training CLIP models based on the DataComp dataset, the use of Mimic Scores led to performance improvements. Compared to human-designed filters, the dataset refined with Grad-Mimic achieved higher model performance. Furthermore, Mimic Scores complement CLIP-Score-based filters by removing less valuable samples, thereby improving model performance with less data.

The advantages of Grad-Mimic can be summarized as follows:

New metric for data quality assessment: The Mimic Score offers a new approach to evaluating data utility.

Automated data selection framework: Grad-Mimic automates the selection of useful data samples.

Dataset quality assessment: Mimic Scores and the associated filters improve existing filtering strategies and enable a reliable estimation of dataset quality.

These research results underscore the importance of intelligent data selection for training AI models. The Mimic Score and the Grad-Mimic framework offer promising tools to evaluate the quality of training data and optimize model development, which is of particular interest to companies like Mindverse, specializing in AI-based content creation and customized AI solutions.

```