Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Per-asset LoRA Adapters For Financial News Sentiment — Dataset Pipeline, Labeling Methodology, And What’s Going On HuggingFace

Where are the domain-specific LoRA fine-tunes for financial sentiment analysis — one adapter per asset (OIL, GOLD, COFFEE, BTC, EUR/USD, etc.)?

The problem: no labeled dataset exists that’s asset-specific. Generic FinBERT doesn’t know that “OPEC cuts production” is bearish for oil. So I built one.

The pipeline:

~17,500 headlines collected across 35+ securities from RSS, Google News, GDELT, YouTube transcripts, and FMP.

Claude Haiku pre-labels everything with asset-specific context (known inversions, price drivers). Humans review and override.

Why per-asset matters:

Because standard sentiment models like FinBERT treat “Fed raises rates” as bearish across the board.

Or “rising dollar boosts USD index to 3-month high” →

FinBERT: bullish. In the actual gold market this is bearish

Or “OPEC increases production” is it nice for your OIL Futures?
• FinBERT sees “increases”, “production up” → bullish (more output = growth = good)
• Actual oil market → bearish (more supply = price drops)

Labeling methodology:

• 4 classes: bullish / bearish / neutral / irrelevant (per asset, not generic)
• AI seed labels → human consensus → LoRA training data
• Target: ~500 human consensus labels per security before fine-tuning

What’s going on HuggingFace:

• Inversion catalog already live: polibert/sentimentwiki-catalog
• Labeled dataset + LoRA adapters: uploading as each security hits threshold
• First uploads: OIL, GOLD, EUR/USD (most labeled)

Data sources that actually work (and a few that don’t):

Works: OilPrice RSS, FXStreet, CoinDesk, GDELT, YouTube (Bloomberg/Reuters/Kitco), FMP (only paid one)
Doesn’t: S&P Global Platts (paywalled), USDA AMS (PDFs only), ICO coffee (Cloudflare-blocked)

If you work in financial NLP and want to contribute labels or suggest assets: sentimentwiki.io (http://sentimentwiki.io/) — contributions welcome

submitted by /u/Poli-Bert
[link] [comments]

How Do You Search Violations In Bulk In The NOLA OneStop App?

I’m trying to look up multiple property violations at once using the NOLA OneStop website/app, but I can’t find a way to run a bulk search. Right now it seems like I have to check each address individually. Is there a way to search or export violations in bulk (for multiple addresses or properties) on NOLA OneStop? Or is there another tool or dataset people use for this?

submitted by /u/tshuntln1
[link] [comments]

I Created A Dataset To Make RAG Training Easy.

The more diversity that can be shared at this level, the easier it will be for independent developers to continue to help push the frontiers of what is possible in LLM development.

This dataset is free to use in your projects. Please upvote. Your support means a lot!

Contains 312,000 records that train subject/question/answer classification in a consistent behavior leveraging Wikipedia while retaining source link structures. Ideal for NLP RAG/TriviaQA style benchmarks.

https://huggingface.co/datasets/CJJones/Wikipedia_RAG_QA_Classification

submitted by /u/No-Cash-9530
[link] [comments]

[Showcase] Structuring 2,170+ TCM Herbs Into JSON: Challenges In Data Normalization

Hi everyone, I’ve spent the last few months digitizing and structuring a database of 2,170+ traditional medicinal herbs. The biggest challenge wasn’t just translation, but mapping biochemical compounds (like Astragaloside IV) to qualitative properties (Nature/Taste) in a way that modern systems can process.

Technical Breakdown:

  • Nomenclature: Cross-referenced English, Latin, and Hanzi.
  • Safety Data: Structured toxicity levels and contraindications.
  • Structure: Validated JSON, optimized for knowledge graphs.

I’ve put together a substantive summary and a 50-herb sample for anyone interested in the data schema or herbal research. You can find the documentation and the sample file here: IF ANYONE WANT IT PLS TEXT ME 🥺 ITS FREEE

I’d love to get your thoughts on the schema design, especially regarding the mapping of chemical compounds to therapeutic functions

submitted by /u/Desperate_Spirit_576
[link] [comments]

How To Split A Dataset Into 2 To Check For Generalization Over Memorization?

I wish to ensure that a neural network does generalization rather than memorization.

in terms of using 1 dataset that is a collection of social media chats, would it be sufficent to split it chornologically only so to create 2 datasets?

or something more needs to be done like splitting it into different usernames and channel names being mentioned.

basically I only have 1 dataset but I wish to make 2 datasets out of it so that one is for supervised learning for the model and the other is to check how well the model performs

submitted by /u/Calm_Maybe_4639
[link] [comments]

My Friend Didn’t Know There Was A Simpler Way To Clean A CSV. So I Built One.

A few months ago I was sitting with my friend who’s doing his data science degree. He had a CSV file, maybe 500 rows, and just needed to clean it before running his model -> remove duplicates, fix some inconsistent date formats, that kind of thing.

He opened Power BI because that’s genuinely what his college taught him. It worked, but it took 20 minutes for something that felt like it should take 2.

I realized the problem wasn’t him, there just aren’t many tools that sit between “write pandas code” and “open a full BI suite” for basic data cleaning. That gap is what I wanted to fill.

So I built DatumInt. Drop in a CSV or Excel file, it runs entirely in your browser, nothing goes to a server.

It auto-detects what’s wrong – duplicates, encoding issues, messy date formats, empty columns – gives you a health score and fixes everything in one click.

No code. No heavy software. No signup. Still early and actively improving it.

Curious what data quality issues you hit most often – what would make a tool like this actually useful to you?

(Disclosure: I’m the developer of this tool)

submitted by /u/PriorNervous1031
[link] [comments]

Best Dataset For A First Excel Portfolio Project?

Hi everyone
I’m self-teaching data analytics and just wrapped up my Excel training. Before diving into SQL, I want to build a solid, hands-on project to serve as my very first portfolio piece and my first professional LinkedIn post. I want to build something that stands out to hiring managers and has a long-lasting, evergreen appeal. What datasets do you highly recommend for someone aiming for a data or financial analysis role? Are there specific datasets—like sales, finance, or operations—that never go out of style and perfectly showcase data cleaning, complex formulas, and dashboarding? I’d love your advice on where to find the best fit for a strong, impactful first project!

Thanks in advance

submitted by /u/Living-Bass1565
[link] [comments]

Extracting Structured Datasets From Public-record Websites

A lot of public-record sites contain useful people data (phones, address history, relatives), but the data is locked inside messy HTML pages.

I experimented with building a pipeline that extracts those pages and converts them into structured fields automatically.

The interesting part wasn’t scraping — it was normalizing inconsistent formats across records.

Curious if anyone else here builds pipelines for turning messy web sources into structured datasets.

https://bgcheck.vercel.app/

submitted by /u/Aggressive_Cut7433
[link] [comments]

Open-source Tool For Schema-driven Synthetic Data Generation For Testing Data Pipelines

Testing data pipelines with realistic data is something I’ve struggled with in several projects. In many environments, we can’t use production data because of privacy constraints, and small handcrafted datasets rarely capture the complexity of real schemas (relationships, constraints, distributions, etc.).

I’ve been experimenting with a schema-driven approach to synthetic data generation and wanted to get feedback from others working on data engineering systems.

The idea is to treat the **schema as the source of truth** and attach generation rules to it. From that, you can generate datasets that mirror the structure of production systems while remaining reproducible.

Some of the design ideas I’ve been exploring:

• define tables, columns, and relationships in a schema definition

• attach generation rules per column (faker, uuid, sequence, range, weighted choices, etc.)

• validate schemas before generating data

• generate datasets with a run manifest that records configuration and schema version

• track lineage so datasets can be reproduced later

I built a small open-source tool around this idea while experimenting with the approach.

Tech stack is fairly straightforward:

Python (FastAPI) for the backend and a small React/Next.js UI for editing schemas and running generation jobs.

If you’ve worked on similar problems, I’m curious about a few things:

• How do you currently generate realistic test data for pipelines?

• Do you rely on anonymised production data, synthetic data, or fixtures?

• What features would you expect from a synthetic data tool used in data engineering workflows?

Repo for reference if anyone wants to look at the implementation:

[https://github.com/ojasshukla01/data-forge](https://github.com/ojasshukla01/data-forge)

submitted by /u/Business-Quantity-15
[link] [comments]

Free Cross-Lingual Acoustic Feature Database For Tabular ML And Emotion Recognition

So I have a free to use 7 language macro prosody samole pack for the community to play with. I’d love feedback. No audio, voice telemetry on 7 languages, normalized, graded. Good to help make emotive TTS or benchmark less common languages, cross linguisic comparion etc.

90+ languages available for possible licensing.

https://huggingface.co/datasets/vadette/macro_prosody_sample_set

This pack was selected to span typologically distinct language families and speech types:

Korean is a language isolate with phrase-final focus marking and complex mora timing — a useful contrast to the stress-timed Indo-Aryan languages.

Hindi is the largest corpus here and provides strong statistical power for Indo-Aryan prosody baselines.

Hebrew is a VSO Semitic language with root-and-pattern morphology; the high metadata coverage makes it useful for demographic-stratified analyses.

Manx is a Celtic revival language with a tiny native speaker community. The 98% PRISTINE rate reflects the controlled recording conditions of motivated community contributors.

Tzeltal is a Mayan language with ergative-absolutive alignment and a distinctive tonal register system. It is rarely represented in acoustic datasets.

Maguindanao (SPS2) is spontaneous speech from a Philippine Austronesian language. The T2-heavy distribution reflects the naturalistic recording conditions of the SPS2 corpus.

Lasi (SPS2) is a Sindhi variety spoken in Balochistan. Shorter median clip duration (3.4s vs 5–6s for CV24 languages) reflects the spontaneous speech format.

submitted by /u/Wooden_Leek_7258
[link] [comments]

Building A Multi-turn, Time-aware Personal Diary AI Dataset For RLVR Training — Looking For Ideas On Scenario Design And Rubric Construction [serious]

Hey everyone,

I’m working on designing a training dataset aimed at fixing one of the quieter but genuinely frustrating failure modes in current LLMs: the fact that models have essentially no sense of time passing between conversations.

Specifically, I’m building a multi-turn, time-aware personal diary RLVR dataset — the idea being that someone uses an AI as a personal journal companion over multiple days, and the model is supposed to track the evolution of their life, relationships, and emotional state across entries without being explicitly reminded of everything that came before.

Current models are surprisingly bad at this in ways that feel obvious once you notice them. Thought this community might have strong opinions on both the scenario design side and the rubric side, so wanted to crowdsource some thinking.

submitted by /u/Over_Valuable_12
[link] [comments]

What If There Was A Extensive Relationship Compatibility Questionnaire (details In The First Comment) That Is Meant To Work As A Premptive And Predictive Diagnostic Report For Frictions In Relationship?

Hi everyone,

I’ve been studying relationship dynamics and friction points for a research proposal recently. While going through a lot of material and patterns around where couples struggle, I realized something interesting.

Many relationship issues aren’t sudden. They slowly build over time through misunderstandings, mismatched expectations, or different ways of handling stress and conflict.

While looking into this, I started working on something that’s basically ‘a very detailed relationship questionnaire’. Both partners would answer it separately, and the idea is to generate a kind of predictive and preemptive diagnostic report for the relationship.

The goal isn’t to judge the relationship or tell people whether they should stay together or not. It’s more about identifying things like:

• areas where partners naturally align • possible friction points • differences in expectations or emotional needs • places where misunderstandings could happen later

So couples can talk about these things earlier, instead of discovering them years down the road.

I’ll be honest about something too. I’ve never really been blessed with what many of you have here. A stable relationship with someone you care about is a pretty beautiful thing, and in some ways I’m a little jealous of it.

So this is partly curiosity and partly a hope that maybe tools like this could help people keep what they already have strong.

I wanted to ask people who are actually in relationships:

  1. Would you and your partner try something like this?

  2. Would you want to see the results if it pointed out possible future friction points?

  3. Is there something you wish you had understood earlier about your partner?

Just genuinely curious about how couples would feel about something like this.

(Questionnaire would be completely anonymous.)

submitted by /u/Additional_Fee1673
[link] [comments]

Reliable B2B Data Provider For Lead Generation (Verified Contacts & Decision-Makers)

Hi everyone,

I run a research team that helps lead generation agencies, sales teams, and B2B companies find accurate contact data for outreach and prospecting. If you’re doing cold email, LinkedIn outreach, or sales prospecting, we can help you with:

• Verified B2B contact databases • Decision-maker contact numbers • Professional email addresses • Industry-specific prospect lists • Targeted company databases (any industry, any region) • Custom lead lists based on your exact ICP

We focus on quality over bulk, so the goal is to give you usable contacts that actually help you book meetings and generate leads.

This works well for:

Lead generation agencies SDR teams Recruitment firms SaaS companies Marketing agencies B2B founders doing outbound

If you need targeted contacts for a specific industry, country, or job title, feel free to comment or send me a DM.

Happy to share more details and see if we can help.

Thanks!

submitted by /u/HelicopterNo8935
[link] [comments]

Butterflies & Moths Of Austria – Fine-grained Lepidoptera Dataset

I repackaged the Butterflies & Moths of Austria dataset to make it easier to use in ML workflows.

The dataset contains 541,677 images of 185 butterfly and moth species recorded in Austria, making it potentially useful for:

  • biodiversity ML
  • species classification
  • computer vision research

Hugging Face dataset:
https://huggingface.co/datasets/birder-project/butterflies-moths-austria

Original dataset (Figshare):
https://figshare.com/s/e79493adf7d26352f0c7

Credit to the original dataset creators and contributors 🙌
This Hugging Face version mainly reorganizes the data to make it easier to load and work with in ML pipelines (ImageFolder format).

submitted by /u/hassonofer
[link] [comments]

What Companies Provide Automated Web Scraping Of News Website?

I don’t want to build scrapers, then i have 2 options.

  1. Scraped News APIs & Aggregator: These platforms crawl millions of sources daily and serve you clean, structured data:Pre. Example: Webz.io, An enterprise-grade provider that scrapes millions of news sites, blogs, and forums daily. They provide highly granular filtering and historical data.
  2. Need to scrape niche, heavily protected sites or extract highly specific data points? go for Custom Web Scraping & AI Extraction Infrastructure. Example: Forage AI, they sit right at the intersection of Custom Web Scraping and AI-Powered Data Pipelines, catering heavily to enterprises and AI developers.

As a non-engineer these are the two options I can think of, open for suggestions.

submitted by /u/3iraven22
[link] [comments]

Starting A Small Project Exploring MIMIC-IV.

As a cardiology resident interested in clinical AI, my goal is to better understand how real ICU data can be used for predictive modeling. Current focus: • dataset exploration • variable understanding • data cleaning

Currently in the dataset exploration and cleaning phase. MIMIC is incredibly rich: thousands of ICU stays and hundreds of clinical variables — but turning raw hospital data into something usable for ML is not trivial.

My goal is simple: learn how clinical data can be transformed into predictive models for patient outcomes. Curious to hear from others who have worked with MIMIC or clinical ML.

submitted by /u/FrequentViolinist672
[link] [comments]

Customer Funnel Datasets Suggestion.

Hello. I have been looking for datasets for customer funnel analysis (for SQL-based analysis). I want to show my proficiency in data cleaning in SQL and analysis via this project. So, A dataset with null and duplicate values will be really effective, I believe. Any suggestions or resources?

submitted by /u/xudling_pong23
[link] [comments]

Make Your AI Assistant Behave, Not Just Sound Smart

Most AI assistants fail for a simple reason:
they were never trained for real product behavior.

We built DinoDS to fix that.

DinoDS is a production-grade training suite for teams building AI assistants that need to: • respond in a consistent tone
• follow strict output formats
• make better decisions about when to answer vs retrieve
• produce reliable structured outputs

Instead of generic data, DinoDS focuses on behavioral training for real AI workflows.

If you’re building serious AI products and want your models to behave reliably in production, let’s talk.

DM me if you want access.

submitted by /u/JayPatel24_
[link] [comments]

Has Anyone Used ThorData To Skip The Web Scraping Phase? Found Some Solid Structured Data For E-commerce/socials.

Recently I was working on a market research project and frankly, I was getting exhausted spending 80% of my time just maintaining web scrapers. Dealing with rotating residential proxies, CAPTCHAs, and sites constantly changing their DOM structure (looking at you, Amazon and TikTok) is a massive headache when you just want to get to the actual data analysis.

While looking for alternatives to building scrapers from scratch, I stumbled across a platform called Thordata (thordata.com/products/datasets). I spent some time digging into their docs and catalog, and it seems pretty interesting from an engineering/analytics standpoint.

While looking for alternatives to building scrapers from scratch, I stumbled across a platform called Thordata (thordata.com/products/datasets). I spent some time digging into their docs and catalog, and it seems pretty interesting from an engineering/analytics standpoint.

Basically, they handle the extraction and structuring from heavy anti-bot sites and serve it up ready to use. A few things that stood out to me:

  • Coverage: They have a pretty heavy focus on e-commerce (Amazon, Walmart, Shopee) and social media (TikTok, X, Instagram). They also have B2B stuff like LinkedIn and Crunchbase.
  • Delivery formats: This is what caught my eye. You can either get static datasets (good for training models or backtesting), or use their APIs to pull live data if you’re building a dashboard or tracking real-time prices/trends.
  • Cleanliness: The data fields (like product specs, reviews, social metrics) are already parsed into clean JSON/CSV, so it skips the whole regex/parsing step.

For me, the main appeal is just outsourcing the infrastructure pain. Not having to manage headless browsers or pay a premium for proxy networks just to get reliable e-commerce data is a huge time saver.

Has anyone here actually used them in a production environment? I’m curious to know:

  1. How is the API latency if you are using it for live feeds?
  2. How quickly do they update their schemas when these big platforms push major UI/backend updates?

Would love to hear your thoughts, or if you guys have other go-to alternatives for these specific sites (aside from just building it yourself). Cheers.

submitted by /u/Mammoth-Dress-7368
[link] [comments]