Hey fellow datasets enthusiasts!
I’ve developed a robust public data collection engine that’s been quietly amassing an impressive dataset, and I’m curious about its potential applications and demand.
The Dataset
Scale: Over 2 billion data points, with 10 million added per day (4 billion per year at our current rate) Sources: Diverse and challenging public social media sources (X, Reddit, BlueSky, Youtube, Mastodon, Lemmy, TradingView, bitointalk, jeuxvideo.com, etc.) (6000+ sources) Collection: Near real-time capture Rich: Structured, and annotated with translation, emotions, sentiment, top_keywords, topics.
We are an emerging, small startup, and of course I’m not trying to do self promotion, so won’t write the link/name (PM me for that).
I was thinking of opening datasets on Hugginface. I could do several & in various forms, I wanted to know what this community would be most interested in?
Possibilities are:
– A full slice of 1 day of data, with all annotated/attributes
– A sampled set of 1 source (for example X dataset, Reddit dataset) up to like 10 million items
– etc.
What would be interesting to you all? We want to do a genuine gift to the Open Source community, especially since Twitter/X shut down its free API & locked out 99.99% of OSINT/researchers.
submitted by /u/askolein
[link] [comments]