Hello, I’m New To Datasets And Would Like To See Whether It’s Possible To Filter A Dataset From Huggingface Before Downloading It.

Hello everyone. I’m currently trying to find a more or less complete corpus of data that is completely public domain or under a free software / culture license. Something like a bundle of Wikipedia, Stack Overflow, the Gutenberg Project, and maybe some GitHub repositories for good measure. And I found RedPajama is painfully close to that, but not quite:

It includes the Common Crawl and C4 datasets, which are decidedly not completely open-source. It includes the Arxiv dataset, which might work for my purposes, but it includes both open-source and proprietary-licensed papers, so it would need filtering before I proceed. And it had to drop the Gutenberg dataset parser because of issues with it accidentally fetching copyrighted content (!!)

So, what I would like to do with RedPajama is:

Fetching Wikipedia, like usual, but also add other Wiki-projects like Wikinews and Wiktionary, and languages other than English, for completion purposes (as we’re ditching C4) Fetching more of the Stack Overflow data to compensate for the lack of C4 Fixing the Gutenberg parser so it can actually download the public-domain books from there. Alternately, download the Wikibooks dataset instead Filtering the Arxiv dataset to remove anything not under a public-domain, CC-By, or CC-By-SA license, preferably before downloading each individual paper

Is it possible to do that as a Huggingface script, or do I need to execute some manual pruning after downloading the entire RedPajama dataset instead?

submitted by /u/csolisr
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *