A closer look at BookCorpus, the text dataset that helps train large language models for Google, OpenAI, Amazon, and others — BookCorpus has helped train at least thirty influential language models (including Google’s BERT, OpenAI’s GPT, and Amazon’s Bort), according to HuggingFace. But what exactly is inside BookCorpus? This is the research question that Nicholas Vincent and I ask in a new working paper that attempts to address some of the…