Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Exploring The Public “Epstein Files” Dataset Using A Log Analytics Engine (interactive Demo)

I’ve been experimenting with different ways to explore large text corpora, and ended up trying something a bit unusual.

I took the public “Epstein Files” dataset (~25k documents/emails released as part of a House Oversight Committee dump) and ingested all of it into a log analytics platform (LogZilla). Each document is treated like a log event with metadata tags (Doc Year, Doc Month, People, Orgs, Locations, Themes, Content Flags, etc).

The idea was to see whether a log/event engine could be used as a sort of structured document explorer. It turns out it works surprisingly well: dashboards, top-K breakdowns, entity co-occurrence, temporal patterns, and AI-assisted summaries all become easy to generate once everything is normalized.

If anyone wants to explore the dataset through this interface, here’s the temporary demo instance:

https://epstein.bro-do-you-even-log.com
login: reddit / reddit

A few notes for anyone trying it:

  • Set the time filter to “Last 7 Days.”
    I ingested the dataset a few days ago, so “Today” won’t return anything. Actual document dates are stored in the Doc Year/Month/Day tags.
  • It’s a test box and may be reset daily, so don’t rely on persistence.
  • The AI component won’t answer explicit or graphic queries, but it handles general analytical prompts (patterns, tag combinations, temporal comparisons, clustering, etc).
  • This isn’t a production environment; dashboards or queries may break if a lot of people hit it at once.

Some of the patterns it surfaced:

  • unusual “Friday” concentration in documents tagged with travel
  • entity co-occurrence clusters across people/locations/themes
  • shifts in terminology across document years
  • small but interesting gaps in metadata density in certain periods
  • relationships that only emerge when combining multiple tag fields

This is not connected to LogZilla (the company) in any way — just a personal experiment in treating a document corpus as a log stream to see what kind of structure falls out.

If anyone here works with document data, embeddings, search layers, metadata tagging, etc, I’d be curious to see what would happen if I throw it in there.

Also, I don’t know how the system will respond to 100’s of the same user logged in, so expect some likely weirdness. and pls be kind, it’s just a test box.

submitted by /u/meccaleccahimeccahi
[link] [comments]

[Synthetic] Created A 3-million Instance Dataset To Equip ML Models To Trade Better In Blackswan Events.

So I recently wrapped up a project where I trained an RL model to backtest on 3 years of synthetic stock data, and it generated 45% returns overall in real-market backtesting.

I decided to push it a lil further and include black swan events. Now the dataset I used is too big for Kaggle, but the second dataset is available here.

I’m working on a smaller version of the model to bring it soon, but looking for some feedback here about the dataset construction.

submitted by /u/Legitimate_Monk_318
[link] [comments]

Discussion About Creating Structured, AI-ready Data/knowledge Datasets For AI Tools, Workflows, …

I’m working on a project, that turns raw, unstructured data into structured, AI-ready data in form of Dataset, which can then be used by AI tools, or can be directly queried.

What I’m trying to understand is, how is everyone handling this unstructured data to make it ”understandable”, with proper context so AI tools can understand it.

Also, what are your current setbacks and pain points when creating a certain Datasets?

Where do you currently store your data? On a local device(s) or already using a cloud based solution?

What would it take for you to trust your data/knowledge to a platform, which would help you structure this data and make it AI-ready?

If you could, would you monetize it, or keep it private for your own use only?

If there would be a marketplace, with different Datasets available, would you consider buying access to these Datasets?

When it comes to LLMs, do you have specific ones that you’d use?

I’m not trying to promote or sell anything, just trying to understand how community here is thinking about the Datasets, data/knowledge, …

submitted by /u/Udbovc
[link] [comments]

We Built A Synthetic Proteomics Engine That Expands Real Datasets Without Breaking The Biology. Sharing Some Validation Results

Hey, let me start of with with Proteomics datasets especially exosome datasets used in cancer research which are are often small, expensive to produce, and hard to share. Because of that, a lot of analysis and ML work ends up limited by sample size instead of ideas.

At Synarch Labs we kept running into this issue, so we built something practical: a synthetic proteomics engine that can expand real datasets while keeping the underlying biology intact. The model learns the structure of the original samples and generates new ones that follow the same statistical and biological behavior.

We tested it on a breast cancer exosome dataset (PXD038553). The original data had just twenty samples across control, tumor, and metastasis groups. We expanded it about fifteen times and ran several checks to see if the synthetic data still behaved like the real one.

Global patterns held up. Log-intensity distributions matched closely. Quantile quantile plots stayed near the identity line even when jumping from twenty to three hundred samples. Group proportions stayed stable, which matters when a dataset is already slightly imbalanced.

We then looked at deeper structure. Variance profiles were nearly identical between original and synthetic data. Group means followed the identity line with very small deviations. Kolmogorov–Smirnov tests showed that most protein-level distributions stayed within acceptable similarity ranges. We added a few example proteins so people can see how the density curves look side by side.

After that, we checked biological consistency. Control, tumor, and metastasis groups preserved their original signatures even after augmentation. The overall shapes of their distributions remained realistic, and the synthetic samples stayed within biological ranges instead of drifting into weird or noisy patterns.

Synthetic proteomics like this can help when datasets are too small for proper analysis but researchers still need more data for exploration, reproducibility checks, or early ML experiments. It also avoids patient-level privacy issues while keeping the biological signal intact.

We’re sharing these results to get feedback from people who work in proteomics, exosomes, omics ML, or synthetic data. If there’s interest, we can share a small synthetic subset for testing. We’re still refining the approach, so critiques and suggestions are welcome.

submitted by /u/Odd-Disk-975
[link] [comments]

I Scraped And Cleaned 50,000+ Career Discussion Threads From R/AskEngineers And R/EngineeringStudents. Here Is The Tool I Used.

I couldn’t find a good dataset that mapped the “Skills Gap” between university and industry, so I built a local scraper to create one.

The Data:

  • Volume: ~52,000 threads.
  • Fields: Title, Body, Top Comments, Sentiment.
  • Focus: Keywords relating to “Exams” vs “Workplace Tools”.

I built the extractor (ORION) to run locally so I wouldn’t get IP banned. It uses requests and smart rate-limiting.

You can grab the tool and the extraction logic here: https://mrweeb0.github.io/ORION-tool-showcase/

Feel free to fork it if you want to scrape other career subreddits (like Nursing or CS).

submitted by /u/No-Associate-6068
[link] [comments]

What’s The Best Way To Capture Change Over Time In Scraped Data?

I’m working on a dataset of daily price movements across thousands of products.
The data’s clean but flat. Without a timeline, it’s hard to analyze trends. I’ve tried storing deltas, snapshots, and event logs each one adds bloat. What’s your preferred model for time-aware datasets? Versioned tables? Append-only logs? Or something hybrid that stays queryable without eating storage?

submitted by /u/Vivid_Stock5288
[link] [comments]

I Can Generate Unlimited, World-class Synthetic Datasets On Demand – 100% Custom, Cleaner Than Most Real-world Data, Any Domain

Throwaway for obvious reasons, but I’ve spent the last 18 months quietly perfecting a pipeline that spits out synthetic data that consistently beats public benchmarks and even most private datasets in quality. What I can do right now (literally same-day delivery in most cases): Any domain: medical (EHR, radiology reports, mimic-like), legal, financial (LOBs, transactions, KYC), code, multilingual text, tabular, time-series, images + captions, instruction-following, agent trajectories, you name it

Scale: 10k–10M+ samples, whatever you need

submitted by /u/Quirky-Ad-3072
[link] [comments]

StormGPT — AI-Powered Environmental Visualization Dataset (NOAA/NASA/USGS Integration)

I’ve been developing an AI-based project called StormGPT, which generates environmental visualizations using real data from NOAA, NASA, USGS, EPA, and FEMA.

The dataset includes:

  • Hurricane and flood impact maps
  • 3D climate visualizations
  • Tsunami and rainfall simulations
  • Feature catalog (.xlsx) for geospatial AI analysis

    Any feedback or collaboration ideas from data scientists, analysts, and environmental researchers.

— Daniel Guzman

submitted by /u/storm-intel
[link] [comments]

Are There Existing Metadata Standards For Icon/vector Datasets Used In ML Or Technical Workflows?

Hi everyone,

I’ve been working on cleaning and organizing a set of visual assets (icons, small diagrams, SVG symbols) for my own ML/technical projects, and I noticed that most existing icon libraries don’t really follow a shared metadata structure.

What I’ve seen is that metadata usually focuses on keywords for visual search, but rarely includes things like: • consistent semantic categories • usage-context descriptions • relationships between symbols • cross-library taxonomy alignment

Before I go deeper into structuring my own set, I’m trying to understand whether this is already a solved problem or if I’m missing an existing standard.

So I’d love to know: 1. Are there known datasets or standards that define semantic/structured metadata for visual symbols? 2. Do people typically create their own taxonomies internally? 3. Is unified metadata across icon sources something practitioners actually find useful? Not promoting anything — just trying to avoid reinventing the wheel and understand current practice.

Any insights appreciated 🙏

submitted by /u/XdotX78
[link] [comments]

Is Orion-MSP Actually Robust Across Heterogeneous Tabular Distributions?

I’ve been looking into Orion-MSP, which uses multi-scale sparse attention and Perceiver-style memory to enable tabular in-context learning. It claims to generalize across diverse datasets, but I’m skeptical.

Some questions:

  • Does multi-scale attention help when dataset feature spaces are mismatched?
  • Is the Perceiver-memory robust to shifts in feature distribution or sparsity?
  • What kind of datasets would actually benefit from this architecture?

If anyone has seen examples of tabular models holding up across wildly different dataset structures, I’d love to hear about it.

(Links can be shared in the comments.)

submitted by /u/Dan27138
[link] [comments]

The Most Complete Python Code Big ⭕ Time Complexity Dataset

Hi folks,

I built a little classifier that classifies python code time complexity in big O notation, and in the process of doing so, I collected all the data I could find, which consist of a pre-existing dataset, as well as scraping the data from other sources and then cleaning it myself. Thought this might be useful for someone.

Data sources:

You can find the data in my repo: ~/data/data folder

Repo link: https://github.com/komaksym/biggitybiggityO

If you find this useful, I’d appreciate starring the repo.

submitted by /u/Financial-Grass4819
[link] [comments]

4 Examples Of When You Really Need Model Distillation (and How To Try It Yourself)

Hi everyone, I’m part of the Nebius Token Factory team and wanted to share some insights from our recent post on model distillation with compute (full article here).

We highlighted 4 concrete scenarios where distillation makes a big difference:

  1. High-latency inference: When your large models are slow to respond in production, distillation lets you train a smaller student model that retains most of the teacher’s accuracy but runs much faster.
  2. Cost-sensitive deployments: Big models are expensive to run at scale. Distilled models cut compute requirements dramatically, saving money without sacrificing quality.
  3. Edge or embedded devices: If you want to run AI on mobile devices, IoT, or constrained hardware, distillation compresses the model so it fits into memory and compute limits.
  4. Rapid experimentation / A/B testing: Training smaller distilled models allows you to quickly iterate on experiments or deploy multiple variants, since they are much cheaper and faster to run.

How we do it at Nebius Token Factory:

  • Efficient workflow to distill large teacher models into leaner students.
  • GPU-powered training for fast experimentation.
  • Production-ready endpoints to serve distilled models with low latency.
  • Significant cost savings for inference workloads.

If you want to try this out yourself, you can test Token Factory with the credits available after registration — it’s a hands-on way to see distillation in action. We’d love your feedback on how it works in real scenarios, what’s smooth, and what could be improved.

https://tokenfactory.nebius.com/

submitted by /u/FarPercentage6591
[link] [comments]

A Resource We Built For Founders Who Want Clearer Weekly Insights From Their Data

Lots of founders I know spend a few hours each week digging through Stripe, PostHog, GA4, Linear, GitHub, support emails, and whatever else they use. The goal is always the same: figure out what changed, what mattered, and what deserves attention next.

The trouble is that dashboards rarely answer those questions on their own. You still have to hunt for patterns, compare cohorts, validate hunches, and connect signals across different tools.

We built Counsel to serve as a resource that handles that weekly work for you.

You connect your stack, and once a week it scans your product usage, billing, shipping velocity, support signals, and engagement data. Instead of generic summaries, it tries to surface things like:

  • Activation or retention issues caused by a specific step or behavior
  • Cohorts that suddenly perform better or worse
  • Features with strong engagement but weak long term value
  • Churn that clusters around a particular frustration pattern

You get a short brief that tells you what changed, why it matters, and what to pay attention to next. No new dashboards to learn, no complicated setup.

We’re privately piloting this with early stage B2C SaaS teams. If you want to try it or see how the system analyzes your funnel, here’s the link: calendly.com/aarush-yadav/30min

If you want the prompt structure, integration checklist, or agent design we used to build it as a resource for your own projects, I can share that too.

My post comply with the rules.

submitted by /u/No_Purpose9658
[link] [comments]