AutoMathText — Autonomous Data Selection for Mathematical Texts

Overview

We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot “generative classifiers” to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model’s logits to decide whether a passage is mathematically informative and educational.

Integrated into a continual-pretraining pipeline, AutoDS substantially boosts downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods — roughly a 2× improvement in pretraining token efficiency over strong baselines. We release the curated AutoMathText dataset to facilitate research in automated domain-specific data curation.

Pretraining token efficiency versus baselines — AutoDS reaches strong math-benchmark accuracy with roughly half the pretraining tokens of competing data-selection baselines.

Autonomous Data Selection

Language models as zero-shot verifiers

Rather than training a separate classifier, AutoDS prompts a base language model and reads its logits to produce a continuous LM-Score for each passage — an annotation-free signal of mathematical quality and educational value.

LM-Score from language-model logits — A base model acts as a zero-shot generative classifier: the LM-Score is read directly from its logits, with no labels or fine-tuning.

LM-Score distribution

Distribution of LM-Scores across the corpus — The LM-Score cleanly separates mathematically informative passages from low-quality text across the corpus.

Citation

If you use AutoMathText or AutoDS, please cite the paper and star the repo:

@article{zhang2025autonomous,
  title   = {Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts},
  author  = {Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew C},
  journal = {The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 Findings)},
  year    = {2025}
}