74r' |
SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/
Saved on 2024-08-19 [19954 edays] via digitalcorpora.org
Modified 2024-08-19 [19954 edays]
data
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/
Saved on 2024-08-19 [19954 edays] via digitalcorpora.org
Modified 2024-08-19 [19954 edays]
data
This corpus contains nearly 8 million PDFs gathered from across the web in July/August of 2021. The PDF files were initially identified by Common Crawl as part of their July/August 2021 crawl (identified as CC-MAIN-2021-31) and subsequently updated and collated as part of the DARPA SafeDocs program.