Hey fellow datasets enthusiasts!

I’ve developed a robust public data collection engine that’s been quietly amassing an impressive dataset, and I’m curious about its potential applications and demand.

The Dataset

Scale: Over 2 billion data points, with 10 million added per day (4 billion per year at our current rate) Sources: Diverse and challenging public social media sources (X, Reddit, BlueSky, Youtube, Mastodon, Lemmy, TradingView, bitointalk, jeuxvideo.com, etc.) (6000+ sources) Collection: Near real-time capture Rich: Structured, and annotated with translation, emotions, sentiment, top_keywords, topics.

We are an emerging, small startup, and of course I’m not trying to do self promotion, so won’t write the link/name (PM me for that).

I was thinking of opening datasets on Hugginface. I could do several & in various forms, I wanted to know what this community would be most interested in?

Possibilities are:

– A full slice of 1 day of data, with all annotated/attributes

– A sampled set of 1 source (for example X dataset, Reddit dataset) up to like 10 million items

– etc.

What would be interesting to you all? We want to do a genuine gift to the Open Source community, especially since Twitter/X shut down its free API & locked out 99.99% of OSINT/researchers.

submitted by /u/askolein
[link] [comments]

Billion Social Media Posts Datasets / Sample – Dicussion

The Dataset

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

The Dataset

Leave a Reply Cancel reply

Recent Posts

Recent Comments