← See all

Datasets

Pretraining Effectively on S2ORC

The peS2o dataset is a collection of ~40M open access academic papers, cleaned, filtered, and formatted for pre-training of language models. It is derived from the Semantic Scholar Open Research Corpus(Lo et al, 2020), or S2ORC.


Dataset Provider
External

Dataset Status
Released

View documentation