We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot “generative classifiers” to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model’s logits to decide whether a passage is mathematically informative and educational.
Integrated into a continual-pretraining pipeline, AutoDS substantially boosts downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods — roughly a 2× improvement in pretraining token efficiency over strong baselines. We release the curated AutoMathText dataset to facilitate research in automated domain-specific data curation.


