Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

[self-promotion] Giving Back To The Datasets Community With Some Free Data!

Hey guys,

I just wanted to share our project called Potarix (https://potarix.com/). It’s an AI-powered web scraping/data extraction tool that can pull data from any website. You can use it at (https://app.potarix.com).

I wanted to give back to this community, so we’ve given everyone that signs up 5$ of credits. Scraping each page takes up $0.10 of your credits. You are not charged for unsuccessful scrapes! That should let you get data from 50 web pages.

So far, we’ve used this project (with some added features) to help clients:

Scrape betting data from the NFL, NBA, and NCAA. Scrape all the Google reviews for each business in San Francisco Scrape business contact information on Google Maps for every single business in the Houston area

Looking ahead, we built some stuff in-house that we’d love to include in the SAAS platform shortly. We’ve built functionality to click, type, scroll, etc. on the page. AI also tends to be wrong sometimes, so we created a tweakable script in the backend, to control the agent’s actions. That way, you’re in control and can bring the script to 100% accuracy. We’ve also seen people battling to build infrastructure for their large-scale scraping projects. We wanna autonomously let folk set up parallelization and choose the infra for their project so everything is scraped as quickly and succinctly as possible from the SAAS.

If any of these future features sound interesting, feel free to book some time, and we can discuss how we can help you with these now!

submitted by /u/youngkilog
[link] [comments]

Multi-sources Rich Social Media Dataset – A Full Month Of Global Chatters!

Hey, data enthusiasts and web scraping aficionados!
We’re thrilled to share a massive new social media dataset that just dropped on Hugging Face! 🚀

Access the Data:

👉Exorde Social Media One Month 2024

What’s Inside?

Scale: 270 million posts collected over one month (Nov 14 – Dec 13, 2024) Methodology: Total sampling of the web, statistical capture of all topics Sources: 6000+ platforms including Reddit, Twitter, BlueSky, YouTube, Mastodon, Lemmy, and more Rich Annotations: Original text, metadata, emotions, sentiment, top keywords, and themes Multi-language: Covers 122 languages with translated keywords Unique features: English top keywords, allowing super-quick statistics, trends/time series analytics! Source: At Exorde Labs, we are processing ~4 billion posts per year, or 10-12 million every 24 hrs.

Why This Dataset Rocks

This is a goldmine for:

Trend analysis across platforms Sentiment/emotion research (algo trading, OSINT, disinfo detection) NLP at scale (language models, embeddings, clustering) Studying information spread & cross-platform discourse Detecting emerging memes/topics Building ML models for text classification

Whether you’re a startup, data scientist, ML engineer, or just a curious dev, this dataset has something for everyone. It’s perfect for both serious research and fun side projects. Do you have questions or cool ideas for using the data? Drop them below.

We’re processing over 300 million items monthly at Exorde Labs—and we’re excited to support open research with this Xmas gift 🎁. Let us know your ideas or questions below—let’s build something awesome together!

Happy data crunching!

Exorde Labs Team – A unique network of smart nodes collecting data like never before

submitted by /u/Exorde_Mathias
[link] [comments]

Looking For Fraud Detection Datasets

I am writing a book chapter on fraud detection using machine learning. I found that most of the current research is rather hard for a person actually building models to apply, every paper likes to highlight the lack of good datasets but no one provides a collection of good datasets that people reading their paper can use

I think that if I include some good datasets for people to train their models on in my chapter, then that will be a very good contribution from my side.

Do you know any good datasets that are used for this, or where I can look for such datasets?

I am honestly clueless when it comes to collecting and finding good datasets for industry grade applications, and I will be really grateful for any help that I get🙏🙏

submitted by /u/mystic-aditya
[link] [comments]

NFL Data Help For Expected Hypothetical Completion Probability

Currently trying to predict the 2025 super bowl winner for a college final presentation. Trying to use Expected Hypothetical Completion Probability from Big Data Bowl 2019 to help by seeing which teams best optimize their playbook for EHCP and if there is a correlation between that and how often they win / complete but having trouble finding a data source.

The EHCP metric requires two main types of data:

1. Play-by-Play Data:

Includes high-level information like down, distance, time remaining, score differential, and whether the pass was completed.

2. Player Tracking Data:

Tracks the location of players and the ball during each play.

Key elements:

Receiver and defender positions. Ball location during the pass. Receiver separation, speed, and direction.

I was directed to pff.com and https://nextgenstats.nfl.com/ so far but I am having trouble coming up with entire data sets for exactly what I need. Anything helps so please let me know!

submitted by /u/B2_CROPFARMER
[link] [comments]

Institutional Data Initiative Plans To Release A Dataset “5 Times That Of Book3” In Early 2025

https://institutionaldatainitiative.org/

https://www.wired.com/story/harvard-ai-training-dataset-openai-microsoft/

Harvard University announced Thursday it’s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright… with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries… In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line.

submitted by /u/furrypony2718
[link] [comments]

Lookin For Additional US National Pollutants & Animal Movement Datasets

Looking to do some analyses on animal movement in relation to pollutants and anthropogenic landscape features. I have a few datasets/sites collected already, but wondering if I’m missing anything. In particular looking for higher resolution lead/cognition-impairing or mutagenic substances and rodenticide.

Datasets below incase its of use for anyone —

Animal Movement:

Movebank: https://www.movebank.org/cms/movebank-main

Animal Telemetry Network: https://portal.atn.ioos.us/#map

Pollutants:

Enviroatlas: https://enviroatlas.epa.gov/enviroatlas/interactivemap/

Uranium mines: https://andthewest.stanford.edu/2020/uranium-mine-sites-in-the-united-states/

Oil Refineries: https://atlas.eia.gov/datasets/eia::petroleum-refineries-1/explore?location=33.922439%2C-118.375771%2C10.55

Superfund sites: https://www.epa.gov/superfund/search-superfund-sites-where-you-live

PFAS: https://www.ewg.org/interactive-maps/pfas_contamination/map/

Heavy Metals: https://www.sciencedirect.com/science/article/pii/S0048969724011112

ATTAINS water inventory: https://www.epa.gov/waterdata/get-data-access-public-attains-data
NATA /AQS air quality: https://aqs.epa.gov/aqsweb/documents/data_api.html#annual
Toxic release: https://www.epa.gov/toxics-release-inventory-tri-program

submitted by /u/latrans_canis_
[link] [comments]

Can We Automate Data Quality Assessment Process For Small Datasets?

Recently, my friend and I have been thinking of working on a side project (for our portfolios) to automate data quality assessment for small tabular datasets that you often find in kaggle.

We acknowledge that such a tool can’t be 100% accurate but it can definitely help nontech people and tech people to get started with working on their datasets. We aim to have a platform where the user will upload a dataset, the system will identify anomalies and give suggestions to the user with different ways to fix that anomaly (e.g. imputation of missing value, fixing an email that doesn’t follow the email pattern, etc).

I would love to discuss the project further and get your thoughts on it. We have been researching similar projects and we found Cocoon, they use proceed column by column, and for each column they have a series of anomalies to fix using an LLM. But we want to have statistical methods for numerical columns, and use LLM only when it’s needed. Can anyone help?

submitted by /u/Better_Resource_4765
[link] [comments]

Is Anyone Aware Of Any Country-wide, Detailed And Multi-topic Attitude And Behavior Polls?

As the title states, I’m looking for some country-wide datasets which cover topics like people’s views and behaviors concerning technology, the environment, and beyond, in a detailed way. What I’m looking for goes a little more in-depth than most national/international polls — for example, the European Social Survey will also cover niche topics, but will usually only ask a question or two about them.

The UK Household Longitudinal Study is an excellent example, but I’m wondering if these kinds of datasets exist for other countries, or even across countries. The Gallup World Poll also seems to cover these topics in a multi-country context, but is behind a paywall.

Any recommendations would be greatly appreciated!

submitted by /u/oliveheron
[link] [comments]

Words That Do Not Convey The Subject Of A Sentence

Hi all! I’m building an application that automatically quizzes you on textual datasets! So far things are working brilliantly, but I’m running into an issue. I wish to remove words that are “uninteresting” for quizzing. Exactly my problem is that I don’t know how to describe them, so don’t know what to lookup. I’ll show an example instead.

“The mitochondria is the powerhouse of the cell”

If I had a simple fill-in-the-blanks question, I want to avoid blanking “the” “is” and “of” as that would make for a very boring quiz question. I’m not a linguist, but from my rudimentary knowledge, I don’t know of any linguistic term that applies to these words as they aren’t just, in the general case, prepositons, for example.

Best case, someone already knows a dataset of words that I can use, but I would really appreciate any help for even what to look up on this topic.

I hope this is appropriate to ask here, else, forgive me and I’ll happily take recommendations for where else to ask!

Many thanks

submitted by /u/langers8
[link] [comments]

Billion Social Media Posts Datasets / Sample – Dicussion

Hey fellow datasets enthusiasts!

I’ve developed a robust public data collection engine that’s been quietly amassing an impressive dataset, and I’m curious about its potential applications and demand.

The Dataset

Scale: Over 2 billion data points, with 10 million added per day (4 billion per year at our current rate) Sources: Diverse and challenging public social media sources (X, Reddit, BlueSky, Youtube, Mastodon, Lemmy, TradingView, bitointalk, jeuxvideo.com, etc.) (6000+ sources) Collection: Near real-time capture Rich: Structured, and annotated with translation, emotions, sentiment, top_keywords, topics.

We are an emerging, small startup, and of course I’m not trying to do self promotion, so won’t write the link/name (PM me for that).

I was thinking of opening datasets on Hugginface. I could do several & in various forms, I wanted to know what this community would be most interested in?

Possibilities are:

– A full slice of 1 day of data, with all annotated/attributes

– A sampled set of 1 source (for example X dataset, Reddit dataset) up to like 10 million items

– etc.

What would be interesting to you all? We want to do a genuine gift to the Open Source community, especially since Twitter/X shut down its free API & locked out 99.99% of OSINT/researchers.

submitted by /u/askolein
[link] [comments]

[self-promotion] Introducing My Newegg & Glovo Scrapers On Apify

Heyo!

I’m a Computer Science MSc student with recent interest in web scraping and data automation. Over the past few years, I’ve honed my skills in backend development and web scraping, and I’m excited to share two Apify Actors I’ve developed to help you build comprehensive datasets effortlessly.

🔍 What I Built:

Newegg Scraper: Newegg Scraper on Apify Features: Extracts detailed product information, pricing, customer reviews, and category listings from Newegg. Use Cases: Ideal for creating datasets for market analysis, price tracking, and competitive research in the electronics and e-commerce sectors. Glovo Scraper: Glovo Scraper on Apify Features: Gathers comprehensive restaurant data, including names, addresses, delivery fees, promotions, and menu items from Glovo. Use Cases: Perfect for building datasets related to food delivery services, local restaurant analysis, and market trend tracking.

Why These Scrapers?

Building high-quality datasets can be time-consuming and technically challenging. These scrapers are designed to simplify the data collection process, providing you with structured and ready-to-use data for your projects. Whether you’re conducting research, developing machine learning models, or performing business intelligence, these tools can save you valuable time.

Seeking Your Feedback:

I’m eager to hear your thoughts! If you have any suggestions for improvements, additional features you’d like to see, or feedback on your experience using these scrapers, please let me know. Your insights are invaluable in making these tools even better for the community.

Thank you for your time, and happy data hoarding! 🗄️✨

submitted by /u/Rorisjack
[link] [comments]

Data Provenance: What Solutions Are You Using, If Any?

Hello everyone,

I’m curious about how people in this community are handling data provenance. For those unfamiliar, data provenance is about tracking the origins and transformations of data throughout its lifecycle.

Are you currently using any tools or methods to track the provenance of your datasets? If yes, what solutions are you using? Are they custom-built or off-the-shelf? If not, do you see a need for such tools in your work? What features would you consider essential in a data provenance solution?

submitted by /u/crtahlin
[link] [comments]

Retail Electricity Prices In PJM And ISO-NE Operation Regions

I am trying to decompose retail electricity prices into its components (transmission costs, fuel costs etc), and discuss determinants of retail energy prices in these two markets. My overarching goal is to explain the reason(s) behind different energy costs faced by retail customers across the US. These two regions have the most similar markets among those with organized capacity markets (although correct me if I am wrong). These regions have consistently high pricing, but what explains this discrepancy compared to the rest of the country? Locational Marginal Prices would also work.

Any advice is greatly appreciated. Thanks in advance!

submitted by /u/capricious_scales
[link] [comments]

Final Year Project In Data Analytics

Hi all,

I am currently a Malaysian student, in my final year and have my FYP pending. I am studying computer science, specialising in Data Analytics. I’ll need to do the standard data pre-processing, visualising, model building etc. However, it is mandatory to include 1 of the SDG goals in my overall project.

I just need some advice on which potential topics I could go into, as I keep over thinking every topic, and am struggling to settle with one. And if anyone could help me find some good datasets to go with the topic, that would be very appreciated.

Thanks to anyone who takes time to read this!

submitted by /u/Shadow_Wing210
[link] [comments]