I recently created my first public dataset focused on cryptocurrency sentiment analysis and Bitcoin market forecasting. The dataset contains around 20,000 Reddit posts collected from major crypto communities between 2017 and 2025 using the PRAW API.
It includes:
- Reddit post metadata
- Cleaned text features
- Crypto-enhanced VADER sentiment
- Custom FinBERT sentiment scores
- Bitcoin prices and returns
- Binary BTC movement labels for 1h, 6h, 12h, and 24h horizons
The dataset was built for financial NLP, sentiment analysis, and forecasting research. I am still learning dataset engineering and would appreciate feedback, suggestions, or ideas for improvement.
submitted by /u/Cyclo_Studios
[link] [comments]