Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Any Sources For Recipe Databases That Can Be Used Commercially With Actual Database Licensing?

Can anyone point me towards actual recipe database(s), not API services, that permit commercial use?

I’m looking to do a project with a view to eventual Commercial implementation based around ingredient/recipe matching. I am aware that online recipe matching is quite a crowded field with many web services offering simple recipe matching already out there. I have a couple of specific angles that makes my idea different that I don’t want to go into here but I have not seen anyone else doing.

There are also many recipe API services with of course tiered pricing, rate limiting and so on. The fundamental problem with using third party recipe APIs is that, cost aside, it’s essentially impossible to query outside of the search parameters that they already provide. I am not interested in trying to put together my own clone of what’s fundamentally a widely and freely available turnkey service- If my thing is no different than I see no point.

In order for my project to work I need to be able to directly access a recipe database, not just run queries that someone else already thought of through their API. I would be happy to self host this but I have to get the data from somewhere. Is anyone able to suggest sources for actual database access, either to query against directly or to clone for self hosting? So far everything I found seems to be either non-commercial only with no other licensing option presented or things like datasets that people have scraped on Kaggle or things that aren’t actually recipe databases e.g. Nutritionix.

Thanks

submitted by /u/SquiffSquiff
[link] [comments]

[REQUEST] Reliable Football(soccer) Data API (live Scores + Player & Club Stats)

Looking for a reliable and frequently updated football data API that covers: Premier League, Serie A, La Liga, Bundesliga, Ligue 1, and EFL Championship.

What I need • Competitions: EPL, Serie A, La Liga, Bundesliga, Ligue 1, EFL Championship • Data types: • Live: match scores, ongoing results, live match events (goals, cards, substitutions, etc.) • Recent: updated league tables and standings (within minutes of change) • Player stats: appearances, minutes, goals, assists, xG/xA if available • Club stats: team form, possession, shots, xG/xGA, PPDA, etc. • Historical: access to past seasons (preferably 2010/11 → present) • Update frequency: Real-time or near real-time (<1-min delay preferred) • Format: JSON REST API or GraphQL, with good documentation • Licensing: Open or paid — just needs clear usage rights and stable uptime

Bonus • Webhooks or push updates for live events • Consistent player/club IDs across seasons • Advanced metrics (xG models, passing maps, pressure events)

If you know any trusted APIs or data providers, please share: • Link • Coverage (competitions + seasons) • Update frequency • Known limitations • Pricing/licence details

Thanks in advance, I’ll compile and share the best options for others looking for up-to-date football data

submitted by /u/isolba9
[link] [comments]

Dearly Departed Datasets. Federal Datasets That We Have Lost, Are Losing, Or Have Had Recent Alterations. America’s Essential Data

Two web-sites are tracking deletions, changes, or reduced accessibility to Federal datasets.

America’s Essential Data
America’s Essential Data is a collaborative effort dedicated to documenting the value that data produced by the federal government provides for American lives and livelihoods. This effort supports federal agency implementation of the bipartisan Evidence Act of 2018, which requires that agencies prioritize data that deeply impact the public.

https://fas.org/publication/deleted-federal-datasets/

They identified three types of data decedents. Examples are below, but visit the Dearly Departed Dataset Graveyard at EssentialData.US for a more complete tally and relevant links.

  1. Terminated datasets. These are data that used to be collected and published on a regular basis (for example, every year) and will no longer be collected. When an agency terminates a collection, historical data are usually still available on federal websites. This includes the well-publicized terminations of USDA’s Current Population Survey Food Security Supplement, and EPA’s Greenhouse Gas Reporting Program, as well as the less-publicized demise of SAMHSA’s Drug Abuse Warning Network (DAWN). Meanwhile, the Community Resilience Estimates Equity Supplement that identified neighborhoods most socially vulnerable to disasters has both been terminated and pulled from the Census Bureau’s website.
  2. Removed variables. With some datasets, agencies have taken out specific data columns, generally to remove variables not aligned with Administration priorities. That includes Race/Ethnicity (OPM’s Fedscope data on the federal workforce) and Gender Identity (DOJ’s National Crime Victimization Survey, the Bureau of Prison’s Inmate Statistics, and many more datasets across agencies).
  3. Discontinued tools. Digital tools can help a broader audience of Americans make use of federal datasets. Departed tools include EPA’s Environmental Justice Screening and Mapping tool – known to friends as “EJ Screen” – which shined a light on communities overburdened by environmental harms, and also Homeland Infrastructure Foundation-Level Data (HIFLD) Open, a digital go-bag of ~300 critical infrastructure datasets from across federal agencies relied on by emergency managers around the country.

submitted by /u/Slight-Fix9564
[link] [comments]

Should I Upload My Skin Condition Dataset To Kaggle For Others To Use?

Hi everyone,
I’ve been working on a skin condition detection project using CNNs, with 5 classes — Wrinkles, Hyperpigmentation, Blackheads, Acne, and Open Pores.
I’ve collected around 3,000 images per class from various open sources and uploaded them to Google Drive for model training.

Now that I’ve trained and saved my model weights, I’m planning to delete the dataset from Drive to save space. But since I worked really hard to collect and clean it, I don’t want it to go to waste.

Can I upload the dataset to Kaggle Datasets for free and reference it in my GitHub project for future users?
Or is there a better alternative for sharing it publicly with proper licensing and access?

Any advice or experience sharing datasets like this would be super helpful.

submitted by /u/Plane_Race_840
[link] [comments]

Where Can I Find Or Download The OpenDNS (Cisco Umbrella) Domain Tagging Dataset?

Hey everyone,

I’m working on a small project related to website characterization and categorization — basically classifying domains into types like E-commerce, News, Social Media, Adult, etc.

I’ve heard that OpenDNS (now Cisco Umbrella) has a large Domain Tagging dataset where domains are categorized by the community. I’d love to use it (or even a subset) as part of my training or benchmarking data.

However, I can’t find any public dataset download or API endpoint that provides the full tagged domain list — only individual lookups or some small sample lists.

Does anyone know if:

  • Is there a public mirror, dump, or archive of the OpenDNS domain tagging data?
  • Or maybe a similar open alternative dataset with website categories that can be used for machine learning/research purposes?

I’ve already checked the official OpenDNS community site and Cisco forums, but I didn’t see a bulk export option.
Any pointers, mirrors, or even partial exports would be amazing.

Thanks in advance!

OpenDNS Link: https://community.opendns.com/domaintagging/

submitted by /u/mrjohndoe42069
[link] [comments]

Looking For Solar Panel Defect Dataset With Bounding Box Annotations (RGB / IR / EL)

I’m working on a computer vision project for solar panel defect detection and localization. Specifically, I need datasets where defects are annotated with bounding boxes so the model can learn to detect where the problem is, not just classify the image as faulty or normal. I want to download the data and work locally, and I don’t want to use any online platforms for training.

submitted by /u/Successful-Life8510
[link] [comments]

[Aide] Récupération Des Noms Commerciaux (enseignes) Des Stations-service — Sans Scraping

Bonjour à tous,

Je développe une application mobile (Expo / React Native + backend Flask) où il est affiché les prix des stations carburants.

Je consomme déjà le jeu de données officiel [Prix des carburants en temps réel]() disponible sur data.gouv.fr, qui fournit les identifiants, adresses, coordonnées GPS et prix.

Problème : ce flux ne contient pas systématiquement le nom commercial (enseigne) des stations (ex : TotalEnergies, Leclerc, Intermarché, Carrefour Market…).

Je cherche une solution légale et durable, sans scraping, pour associer chaque station à son enseigne.
Le but est d’afficher dans l’application :

  • le nom de la station,
  • son adresse complète,
  • les prix actualisés des carburants.

  • Existe-t-il un jeu de données officiel (CSV / JSON / API) qui relie les identifiants de stations (id, adresse, cp, ville) à leur enseigne / nom commercial ? → Si oui, pouvez-vous indiquer le lien exact ou le nom du dataset ?

  • Si ce jeu n’est pas public :

    • savez-vous quel organisme / contact (DGEC, Ministère, etc.) gère la donnée ?
    • et comment leur demander une autorisation de réutilisation des champs “enseigne” ?
  • Connaissez-vous une source alternative légale (par exemple open data régionaux, INSEE, ou bases professionnelles) pour obtenir les enseignes correspondantes ?

  • Côté technique : recommandez-vous de précharger ces correspondances côté serveur (ex : table SQLite ou CSV importé) afin d’éviter tout appel excessif ou scraping client ?

  • Enfin, si quelqu’un a déjà fusionné ces données (via ID, adresse ou géolocalisation), je serais très intéressé par :

    • un exemple de correspondance (quelques lignes de CSV anonymisées),
    • ou une méthode de matching fiable à reproduire.

Contraintes

  • Pas de scraping du site officiel (prix-carburants.gouv.fr)
  • L’application sera publiée sur App Store / Play Store, donc la source doit être officielle, publique et réutilisable (licence ouverte).

Exemple du besoin:

Je souhaite obtenir une structure de données de ce type :

{ "id_station": "12345678", "enseigne": "TotalEnergies", "adresse": "4 Rue Étienne Kernours", "ville": "Douarnenez", "prix_gazole": 1.622, "prix_sp98": 1.739 } 

Merci d’avance pour toute aide, piste ou contact !

Cordialement,

Tom

submitted by /u/OpenApartment1246
[link] [comments]

[Dataset] UK Parliamentary Interest Groups (“APPGs”)

All-Party Parliamentary Groups (APPGs) are informal cross-party groups within the UK Parliament. APPGs exist to examine particular topics or causes, for example, small modular reactors, blood cancer, and Saudi Arabia.

While APPGs can provide useful forums for bringing together stakeholders and advancing policy discussions, there have been instances of impropriety, and the groups have faced criticism for potential conflicts of interest and undue influence from external bodies.

I have pulled data from Parliament’s register of APPGs (individual webpages / single PDF) into a JSON object for easy interrogation. Each APPG entry lists a chair, a secretariat, sources of funding, and so on.

How many APPGs are there on cancer; which political party chairs the most APPGs; how many donations do they receive?

Click HERE to view the dataset on Kaggle.

submitted by /u/2AEP
[link] [comments]

Need A Messy Dataset For A Class I’m In, Where Can I Go To Get One?

I’m in college right now and I need an “unclean/untidy” dataset. One that has a bunch of missing values, poor formatting, duplicate entries, etc., is there a website I can go to that gives data like this? I hope to get into the renewable energy field, so data covering that topic would be exactly what I’m looking for, but any website that has this sort of this would help me.

Thanks in advance

submitted by /u/timedoesnotwait
[link] [comments]

To Everyone In The Datasets Community, I Would Like To Give An Update

My name is Jason Baumgartner and I am the founder of Pushshift. I have been dealing with some health issues but hopefully my eye surgery will be coming up soon. I developed PSCs (posterior subcapular cataracts) from late onset Diabetes.

I have been working lately to bring more amazing APIs and tools to the research community including making available a large amount of datasets containing YouTube data and many other social media datasets.

Currently I have collected around 15 billion Youtube comments and billions of YouTube channel metadata and video metadata.

My goal, once my surgery is completed and my eyes heal is to get back into the community and invite others who love data to work with all this data.

I greatly appreciate everyone who donates or spreads the word about my gofundme.

I will be providing updates over time, but if you want to reach out to me, please use the email in my Reddit profile (the gmail one).

I want to thank all of the datasets moderators for assisting me during this challenging period in my life.

I am very excited to get back into the saddle and pursuing my biggest passion – data science and datasets.

I no longer control the Pushshift domain bit I will be sharing a new name soon and letting everyone know what’s been happening over the past 2 years.

Thanks again and I will try to respond to as many emails as possible.

You can find the link to my gofundme in my Reddit profile or my post in /r/pushshift.

Feel free to ask questions in this post and I will try to answer as soon as possible. Also, if you have any questions about specific social media data that you are interested in, I would be happy to clarify what data I currently have and what is on the roadmap in the future. It would be very helpful to see what data sources people are interested in!

submitted by /u/Stuck_In_the_Matrix
[link] [comments]

Tideon AI Makes Analyzing Excel Datasets 5x Faster — Try It Free

If you work with Excel files regularly, I wanted to share something that’s been a game-changer for me: Tideon AI — an AI-powered platform that lets you chat with your datasets instantly.

Instead of manually digging through spreadsheets, you can:

  • Upload Excel files and ask questions in plain English
  • Get instant insights without writing formulas

Would love to hear if this helps anyone here streamline their workflow!

Link: https://tideon.ai

submitted by /u/Narrow_Ground1495
[link] [comments]

Made My First Dataset! Ca. 100 Scanned Pages Of Books From 1910-1920, Serbian Cyrillic. Kaggle And HF

Hi everyone, first time building a dataset. This is a v0.1, about 100 scans of book pages (both single and double-page per scan). The books are in the public domain. The intended use is for anyone looking to do image-to-text software work.

The scans are in a .jpg format, with a PDF with the whole collection.

I have also included 2 .txt files:

1)”raw” (aka not corrected for halluciations, artifacts, etc.) .txt file for anyone looking to do a check. The file is in Markdown.

2) A “corrected” .txt file, where the hallucinations, artifacts, errors, etc. were manually corrected. This file is in .txt, not Markdown.

Looking for feedback if this is useful, how to make a dataset like this better, etc.

Kaggle: https://www.kaggle.com/datasets/booksofjeremiah/serbian-cyrillic-script-printed

Huggingface: https://huggingface.co/datasets/Books-of-Jeremiah/raw-OCR-serbian-cyrillic

Any feedback on whether the set is useful for other use cases or how it can be made better is appreciated!

submitted by /u/Books_Of_Jeremiah
[link] [comments]

[P] Training Better LLMs With 30% Less Data – Entropy-Based Data Distillation

I’ve been experimenting with data-efficient LLM training as part of a project I’m calling Oren, focused on entropy-based dataset filtering.

The philosophy behind this emerged from knowledge distillation pipelines, where student models basically inherit the same limitations of intelligence as the teacher models have. Thus, the goal of Oren is to change LLM training completely – from the current frontier approach of rapidly upscaling in compute and GPU hours to a new strategy: optimizing training datasets for smaller, smarter models.

The experimentation setup: two identical 100M-parameter language models.

  • Model A: trained on 700M raw tokens
  • Model B: trained on the top 70% of samples (500M tokens) selected via entropy-based filtering

Result: Model B matched Model A in performance, while using 30% less data, time, and compute. No architecture or hyperparameter changes.

Open-source models:

🤗 Model A – Raw (700M tokens)

🤗 Model B – Filtered (500M tokens)

Full documentation:

👾GitHub Repository

I’d love feedback, especially on how to generalize this into a reusable pipeline that can be directly applied onto LLMs before training and/or fine-tuning–I’m currently thinking of a multi-agent system, with each agent being a SLM trained on a subdomain (i.e., coding, math, science), each with their own scoring metrics. Would love feedback from anyone here who has tried entropy or loss-based filtering and possibly even scaled it

submitted by /u/Jolly-Act9349
[link] [comments]