Autonomous Data Selection with Language Models for Mathematical Texts

Tsinghua University, BUPT, Shanghai Qizhi Institute

Abstract

To improve language models’ proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter language model on our curated dataset, achieving substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks with a token amount reduced by orders of magnitude compared to previous continual pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-theart baselines, underscoring the potential of our approach in enhancing models’ mathematical reasoning capabilities. The AutoMathText dataset is available at this link.

Comparison

Autonoumous Data Selection

While the ML research community and the world at large are still focused on manually selecting quality data, we have chosen autonomous data selection (AutoDS), conceptually aligned with RLAIF. In addition, the meta prompting technique, in essence replaces traditional alignment techniques such as DPO and directly deals with the model through its instruction following techniques.


Introduction

Our contributions can be listed as three-fold:

1. We showcase the efficacy of leveraging base language models with meta-prompts for zero-shot verification using a straightforward score function derived from logits. Our method, Autonomous Data Selection (AutoDS) advances beyond traditional alignment strategies such as SFT and RLHF without the reliance on human-annotated data, facilitating autonomous content evaluation.

2. We address the shortage of labeled high-quality mathematical training resources by introducing the open-source AutoMathText dataset. This comprehensive dataset is designed to enrich AI model training with mathematical content, thereby enhancing their performance in math-intensive tasks.

3. Through empirical evidence, we demonstrate the effectiveness of our methodology by continuously pretrain a 7B parameter Mistral language model on the AutoMathText dataset. Our results highlight substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks with 2 times pretraining token efficiency, underscoring the practical benefits of our approach in mathematical reasoning tasks.

Language Models as Zero-shot Verifiers

LM-Score

LM-Score Distribution

LM-Score

Citation

Please cite the paper and star this repo if you use AutoMathText or AutoDS and find it interesting/useful, thanks!

@inproceedings{zhang2024automathtext,
  title={AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts},
  author={Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew Chi-Chih},
  booktitle={ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models},
  year={2024},
  url={https://openreview.net/forum?id=bBF077z8LF}
}