USGS Integration)

I’ve been developing an AI-based project called StormGPT, which generates environmental visualizations using real data from NOAA, NASA, USGS, EPA, and FEMA.

The dataset includes:

Hurricane and flood impact maps
3D climate visualizations
Tsunami and rainfall simulations
Feature catalog (.xlsx) for geospatial AI analysis

Any feedback or collaboration ideas from data scientists, analysts, and environmental researchers.

— Daniel Guzman

submitted by /u/storm-intel
[link] [comments]

0

Are There Existing Metadata Standards For Icon/vector Datasets Used In ML Or Technical Workflows?

Hi everyone,

I’ve been working on cleaning and organizing a set of visual assets (icons, small diagrams, SVG symbols) for my own ML/technical projects, and I noticed that most existing icon libraries don’t really follow a shared metadata structure.

What I’ve seen is that metadata usually focuses on keywords for visual search, but rarely includes things like: • consistent semantic categories • usage-context descriptions • relationships between symbols • cross-library taxonomy alignment

Before I go deeper into structuring my own set, I’m trying to understand whether this is already a solved problem or if I’m missing an existing standard.

So I’d love to know: 1. Are there known datasets or standards that define semantic/structured metadata for visual symbols? 2. Do people typically create their own taxonomies internally? 3. Is unified metadata across icon sources something practitioners actually find useful? Not promoting anything — just trying to avoid reinventing the wheel and understand current practice.

Any insights appreciated 🙏

submitted by /u/XdotX78
[link] [comments]

0

Is Orion-MSP Actually Robust Across Heterogeneous Tabular Distributions?

I’ve been looking into Orion-MSP, which uses multi-scale sparse attention and Perceiver-style memory to enable tabular in-context learning. It claims to generalize across diverse datasets, but I’m skeptical.

Some questions:

Does multi-scale attention help when dataset feature spaces are mismatched?
Is the Perceiver-memory robust to shifts in feature distribution or sparsity?
What kind of datasets would actually benefit from this architecture?

If anyone has seen examples of tabular models holding up across wildly different dataset structures, I’d love to hear about it.

(Links can be shared in the comments.)

submitted by /u/Dan27138
[link] [comments]

0

The Most Complete Python Code Big ⭕ Time Complexity Dataset

Hi folks,

I built a little classifier that classifies python code time complexity in big O notation, and in the process of doing so, I collected all the data I could find, which consist of a pre-existing dataset, as well as scraping the data from other sources and then cleaning it myself. Thought this might be useful for someone.

Data sources:

You can find the data in my repo: ~/data/data folder

Repo link: https://github.com/komaksym/biggitybiggityO

If you find this useful, I’d appreciate starring the repo.

submitted by /u/Financial-Grass4819
[link] [comments]

0

Measuring AI Ability To Complete Long Tasks

Dáta linked to in article but it’s also at https://metr.org/assets/benchmark_results.yaml

submitted by /u/cavedave
[link] [comments]

0

4 Examples Of When You Really Need Model Distillation (and How To Try It Yourself)

Hi everyone, I’m part of the Nebius Token Factory team and wanted to share some insights from our recent post on model distillation with compute (full article here).

We highlighted 4 concrete scenarios where distillation makes a big difference:

High-latency inference: When your large models are slow to respond in production, distillation lets you train a smaller student model that retains most of the teacher’s accuracy but runs much faster.
Cost-sensitive deployments: Big models are expensive to run at scale. Distilled models cut compute requirements dramatically, saving money without sacrificing quality.
Edge or embedded devices: If you want to run AI on mobile devices, IoT, or constrained hardware, distillation compresses the model so it fits into memory and compute limits.
Rapid experimentation / A/B testing: Training smaller distilled models allows you to quickly iterate on experiments or deploy multiple variants, since they are much cheaper and faster to run.

How we do it at Nebius Token Factory:

Efficient workflow to distill large teacher models into leaner students.
GPU-powered training for fast experimentation.
Production-ready endpoints to serve distilled models with low latency.
Significant cost savings for inference workloads.

If you want to try this out yourself, you can test Token Factory with the credits available after registration — it’s a hands-on way to see distillation in action. We’d love your feedback on how it works in real scenarios, what’s smooth, and what could be improved.

https://tokenfactory.nebius.com/

submitted by /u/FarPercentage6591
[link] [comments]

0

I’m Doing A Nutrition Degree And An Academic Report On Caffeinated Beverages! I Would Love If You Could Share Your Experiences And Insights As Coffee And Caffeinated Beverage Consumers. It Is Anonymous And Takes 1-2mins. Thank You! :)

Caffeine Consumption 🙂

submitted by /u/Routine-Hedgehog-245
[link] [comments]

0

How To Create Dataset From Engineering Drawing Pdf For YOLO Algorithms?

Any help in this direction is highly appreciable. I also need to web scap the pdfs.

submitted by /u/Fragrant-Bit-7373
[link] [comments]

0

A Resource We Built For Founders Who Want Clearer Weekly Insights From Their Data

Lots of founders I know spend a few hours each week digging through Stripe, PostHog, GA4, Linear, GitHub, support emails, and whatever else they use. The goal is always the same: figure out what changed, what mattered, and what deserves attention next.

The trouble is that dashboards rarely answer those questions on their own. You still have to hunt for patterns, compare cohorts, validate hunches, and connect signals across different tools.

We built Counsel to serve as a resource that handles that weekly work for you.

You connect your stack, and once a week it scans your product usage, billing, shipping velocity, support signals, and engagement data. Instead of generic summaries, it tries to surface things like:

Activation or retention issues caused by a specific step or behavior
Cohorts that suddenly perform better or worse
Features with strong engagement but weak long term value
Churn that clusters around a particular frustration pattern

You get a short brief that tells you what changed, why it matters, and what to pay attention to next. No new dashboards to learn, no complicated setup.

We’re privately piloting this with early stage B2C SaaS teams. If you want to try it or see how the system analyzes your funnel, here’s the link: calendly.com/aarush-yadav/30min

If you want the prompt structure, integration checklist, or agent design we used to build it as a resource for your own projects, I can share that too.

My post comply with the rules.

submitted by /u/No_Purpose9658
[link] [comments]

0

Google Trending Searches Dataset (2001-2024)

Introducing the Google-trending-words dataset: a compilation of 2784 trending Google searches from 2001-2024.

This dataset captures search trends in 93 categories, and is perfect for analyzing cultural shifts, predicting future trends, and understanding how global events shape online behavior!

submitted by /u/Ok_Employee_6418
[link] [comments]

0

Jeffrey_Epstein39s_file_little_black

submitted by /u/Neptun_11
[link] [comments]

0

Looking For A Prolog Dataset

submitted by /u/cavedave
[link] [comments]

0

Fight Detection Datasets Material Issue

I have a project that involves using AI to detect fights in schools, universities, and dorms. However, I can't find enough materials on this. Could you please recommend datasets that include fights (not boxing or hockey).

submitted by /u/Ecstatic-Turnip6389
[link] [comments]

0

US Traffic AADT With State Level Data

Anyone know of a free source of USA traffic… the federal one is light on and the states are a big hodgepodge!

submitted by /u/nattyandthecoffee
[link] [comments]

0

Exercise Dataset With Video Demonstrations -MuscleWiki API

submitted by /u/brave_w0ts0n
[link] [comments]

0

Cleaned + Structured The Nov 2025 Epstein Email Dump Into A Single JSONL (9966 Entries) + Semantic Explorer [HuggingFace]

A few days after the Nov 12th 2025 Epstein email dump went public, I pulled all the individual text files together, cleaned them, removed duplicates, and converted everything into a single standardized .jsonl dataset.

No PDFs, no images — this is text-only. The raw dump wasn’t structured: filenames were random, topics weren’t grouped, and keyword search barely worked. Names weren’t consistent, related passages didn’t use the same vocabulary, and there was no way to browse by theme.

So I built a structured version:

merged everything into one JSONL file each line = one JSON object (9966 total entries) cleaned formatting + removed noise chunked text properly grouped the dataset into clusters (topic-based) added BM25 keyword search added simple topic-term extraction added entity search made a lightweight explorer UI on HuggingFace

🔗 HuggingFace explorer + dataset:

https://huggingface.co/spaces/cjc0013/epstein-semantic-explorer

JSONL structure (one entry per line):

json {“id”: 123, “cluster”: 47, “text”: “…”} What you can do in the explorer:

Browse clusters by topic Run BM25 keyword search Search entities (names/places/orgs) View cluster summaries See top terms Upload your own JSONL to reuse the explorer for any dataset

This is not commentary — just a structured dataset + tools for anyone who wants to analyze the dump more efficiently.

Please let me know if you encounter any errors. Will answer any questions about the datasets construction.

submitted by /u/Either_Pound1986
[link] [comments]

0

Public Dataset For European Cancer Statistics

Hey there! I’m wondering if there is a publicly available dataset on cancer statistics among European nations, similar to SEER in the US. Thanks!

submitted by /u/Stud_Muffin15
[link] [comments]

0

Looking For A Dataset With A Count Response Variable For Poisson Regression

Hello, I’m looking for a dataset with a count response variable to apply Poisson regression models. I found the well-known Bike Sharing dataset, but it has been used by many people, so I ruled it out. While searching, I found another dataset, the Seoul Bike Sharing Demand dataset. It’s better in the sense that it hasn’t been used as much, but it’s not as good as the first one.

So I have the following question: could someone share a dataset suitable for Poisson regression, i.e., one with a count response variable that can be used as the dependent variable in the model? It doesn’t need to be related to bike sharing, but if it is, that would be even better for me.

submitted by /u/Yaguil23
[link] [comments]

0

If You’re Dealing With Data Scarcity Or Privacy Bottlenecks, Tell Me Your Use Case.

submitted by /u/Quirky-Ad-3072
[link] [comments]

0

20,000 Epstein Files In A Single Text File Available To Download (~100 MB)

I’ve processed all the text and image files (~25,000 document pages/emails) within individual folders released last friday into a two column text file. I used Googles tesseract OCR library to convert jpg to text.

You can download it here: https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K

For each document, I’ve included the full path to the original google drive folder from House oversight committee so you can link and verify contents. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation)

submitted by /u/tensonaut
[link] [comments]

0

Looking For Examples Of DevOps-related LLM Failures (building A Small Dataset)

submitted by /u/apinference
[link] [comments]

0

[Dataset] [30 Trillion Tokens] “HPLT 3.0: Very Large-Scale Multilingual Resources For LLM And MT. Mono- And Bi-lingual Data, Multilingual Evaluation, And Pre-Trained Models”, Oepen Et Al. 2025

Dataset(s): https://hplt-project.org/datasets/v3.0

Paper: https://arxiv.org/abs/2511.01066

submitted by /u/RecmacfonD
[link] [comments]

0

[OC] 100 Million Domains Ranked By Authority – Free Dataset (1.7GB, Monthly Updates)

I’ve built a dataset of 100 million domains ranked by web authority and releasing it publicly under MIT license.

Dataset: https://github.com/WebsiteLaunches/top-100-million-domains

Stats: – 100M domains ranked by authority – Updated monthly (last: Nov 15, 2025) – MIT licensed (free for any use) – Multiple size tiers: 1K, 10K, 100K, 1M, 10M, 100M – CSV format, simple ranked lists

Methodology: Rankings based on Common Crawl web graph analysis, domain age, traffic patterns, and site quality metrics from Website Launches data. Domains ordered from highest to lowest authority.

Potential uses: – ML training data for domain/web classification – SEO and competitive research – Web graph analysis – Domain investment research – Large-scale web studies

Free and open. Feedback welcome.

submitted by /u/antiochIst
[link] [comments]

0

What’s The Hardest Part Of Turning Scraped Data Into Something Reusable?

I’ve been building datasets from retail and job sites for a while. The hardest part isn’t crawling it’s standardizing. Product specs, company names, job levels nothing matches cleanly. Even after cleaning, every new source breaks the schema again. For those who publish datasets: how do you maintain consistency without rewriting your schema every month?

submitted by /u/Vivid_Stock5288
[link] [comments]

0

Supply Chain/Logistics Data Set Needed

Working on creating a BI business that is geared specifically towards small supply chain businesses but I am needing access to real world supply chain databases to create some examples and practice on. Would love some guidance on this!

submitted by /u/DiabeticDays
[link] [comments]

0

Any Bulk Image Prompt Datasets? Instead Of Storing The Image, I Want To Store The Prompt As A Form Of Compression.

Byo-model, re-generations won’t be pixel perfect and that’s ok

submitted by /u/fukijama
[link] [comments]

0

#DDoSecrets Has Released 121 GB Of Epstein Files

submitted by /u/cavedave
[link] [comments]

0

Urgent Request For A Dataset That Includes Virtual Webinar Invitations

Please let me know if you have any questions!

submitted by /u/archubbuck
[link] [comments]

0

Courier News Created A Searchable Database With All 20,000 Files From Epstein’s Estate

submitted by /u/cavedave
[link] [comments]

0

Questions For A Paper Im Writing For School

Im in a sex and gender class for school and we have to interview a bunch of people for a paper and see the differences on people’s perspectives based on their backgrounds. If you feel comfortable sharing a bit about yourself and awnsering any or all of these questions I would greatly appreciate it. I will also message you if I quote you in my paper!

SLO 1: Define sex, gender, and gender identity and explain the relationship between these concepts.

How are the concepts of sex, gender, and gender identity defined in psychology and sociology, how do they relate to each other and why do you think these terms are misunderstood?
Is it possible to be rid of gendered stereotypes, something that has occurred for centuries? How do we as a society have an impact on this negative perception?
What does gender mean to you personally, and how do you think your experiences have shaped that understanding?
Can you describe how you understand the differences between sex, gender, and gender identity, and how these aspects of identity have influenced your experiences or the way you see others?
How do you think understanding the difference between sex and gender can help promote inclusion and equality? How do you think not understanding it affects a public or professional setting?

submitted by /u/lil_bag_a_fritos
[link] [comments]

0

Category: Datatards

StormGPT — AI-Powered Environmental Visualization Dataset (NOAA/NASA/USGS Integration)

Are There Existing Metadata Standards For Icon/vector Datasets Used In ML Or Technical Workflows?

Is Orion-MSP Actually Robust Across Heterogeneous Tabular Distributions?

The Most Complete Python Code Big ⭕ Time Complexity Dataset

Measuring AI Ability To Complete Long Tasks

4 Examples Of When You Really Need Model Distillation (and How To Try It Yourself)

I’m Doing A Nutrition Degree And An Academic Report On Caffeinated Beverages! I Would Love If You Could Share Your Experiences And Insights As Coffee And Caffeinated Beverage Consumers. It Is Anonymous And Takes 1-2mins. Thank You! :)

How To Create Dataset From Engineering Drawing Pdf For YOLO Algorithms?

A Resource We Built For Founders Who Want Clearer Weekly Insights From Their Data

Google Trending Searches Dataset (2001-2024)

Jeffrey_Epstein39s_file_little_black

Looking For A Prolog Dataset

Fight Detection Datasets Material Issue

US Traffic AADT With State Level Data

Exercise Dataset With Video Demonstrations -MuscleWiki API

Cleaned + Structured The Nov 2025 Epstein Email Dump Into A Single JSONL (9966 Entries) + Semantic Explorer [HuggingFace]

Public Dataset For European Cancer Statistics

Looking For A Dataset With A Count Response Variable For Poisson Regression

If You’re Dealing With Data Scarcity Or Privacy Bottlenecks, Tell Me Your Use Case.

20,000 Epstein Files In A Single Text File Available To Download (~100 MB)

Looking For Examples Of DevOps-related LLM Failures (building A Small Dataset)

[Dataset] [30 Trillion Tokens] “HPLT 3.0: Very Large-Scale Multilingual Resources For LLM And MT. Mono- And Bi-lingual Data, Multilingual Evaluation, And Pre-Trained Models”, Oepen Et Al. 2025

[OC] 100 Million Domains Ranked By Authority – Free Dataset (1.7GB, Monthly Updates)

What’s The Hardest Part Of Turning Scraped Data Into Something Reusable?

Supply Chain/Logistics Data Set Needed

Any Bulk Image Prompt Datasets? Instead Of Storing The Image, I Want To Store The Prompt As A Form Of Compression.

#DDoSecrets Has Released 121 GB Of Epstein Files

Urgent Request For A Dataset That Includes Virtual Webinar Invitations

Courier News Created A Searchable Database With All 20,000 Files From Epstein’s Estate

Questions For A Paper Im Writing For School

Recent Posts

Recent Comments

18+ Content

Recent Posts

Recent Comments