Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

LOOKING FOR DATA SETS FOR ACADEMIC RESEARCH PAPER

Hi guys I am currently doing my Academic Research Paper, I would like to ask for help where I can get data sets for AI Generated Human Face (image or video is fine) which is Open Source, and Paid? Thank you guys, hope you guys have time to help me currently having a hard time to find datasets. I currently looked up in huggingface and Github.

submitted by /u/Daegushi
[link] [comments]

İs There A Market For Digitalized Non-Digital Assets?

I got some old books, receipts, invoices, posters etc like the stuff you cant find on the internet in different languages and I planned to make those to a digital asset like cvs or json file maybe ecxel too but I have a doubt that is even make a dime without a company. In summary Can I make money (as a one dude) in online sites with enough of those old documents? If the answer is yes where? Thank you for your help in advance

submitted by /u/GreenDeafth_21
[link] [comments]

Are Behavioral Aspect Affecting Your Gains?

Studies show that many traders lose their gains not because of poor strategies, but due to unnoticed behavioral patterns.

While most traders focus on macro and microeconomic indicators, technical analysis, and stock tips, losses often stem from panic selling, overtrading, and behavioral biases that go unnoticed during trading.

To address this, I built a solution that helps identify whether these behavioral traits are affecting your performance — and pinpoints the root cause

It’s completely free and requires no signup.

submitted by /u/TrySoggy3955
[link] [comments]

Intermediate Project Including Data Analysis

Hi everyone,

I’m looking for ideas and direction from experienced folks for a uni project built on open data. The goal is to create a public-facing service that doesn’t really exist yet (or is clearly missing), and deliver a realistic prototype within a student timeline.

If you have experience in civic tech / open data projects and can help orient me, I’d really appreciate:

• ideas for high-impact problems worth tackling, • suggestions on datasets that are actually workable, • and how you would validate impact (basic metrics / evaluation). 

I’m open to many domains (mobility, environment, public spending, health, education, safety, etc.), as long as it’s powered by open data and results in a useful public service (search, comparison, alerts, maps, dashboards, scoring, etc.).

Thanks for any guidance!

submitted by /u/ddummas01
[link] [comments]

Web UI Dataset: Screenshot And Code Of Modern Websites With Details Of Web Frameworks And Box Bounds For All Viewports (Desktop, Mobile, Tablet)

Built a dataset of 10,000+ real screenshots and code of modern websites with details of styling, framework used, and box bounds for all viewports (Desktop, mobile, tablet).

I fine-tuned QWEN 2.5 VL-7B-Instruct with this dataset and ran it on DesignBench (An LLM Web UI benchmark), and the model showed improvements in the pixel similarity score of generated websites!

submitted by /u/Ok_Employee_6418
[link] [comments]

[Question] Temporal Sequence Dataset Management

I have a temporal sequence dataset but it is scattered to many small groups of dataset. How to manage the dataset by keeping the temporal sequence?

Here is my case: Let’s say I have a total of 100 dataset frames scattered to 4 groups with the same size. Each group is a temporal sequence but in different time, not continues. 2 set of groups is used for train, 1 set for validation, and 1 set for test. Is it fine for my NN to learn from this dataset? What is the drawback from the 100 frames continues temporal frames with the usual 80% train, 10% 10% val-test split?

submitted by /u/FineSand3810
[link] [comments]

How Can I Find Data For Financial Research

I’m planning to conduct research on banks in Asia, but I’m struggling to find reliable data sources beyond standard financial indicators (e.g., assets, liabilities, equity). Could anyone advise where I can obtain or purchase datasets for metrics such as FinTech adoption/digital maturity and ESG performance, especially for less-covered markets like Vietnam?

submitted by /u/icantevenhaveaname
[link] [comments]

The Wise&Ethical Fast Analyst – AhamData

Spent the weekend refining AhamData’s analysis models with feedback from colleagues (special thanks to @Zayne). Even with test data, the new features are worth exploring:

– Data Quality Report (0–100 integrity scoring)

– Descriptive Statistics Engine (auto column classification)

– Transparent Methodology Panel (see PCA, Shapley, regression, clustering in real time)

– Sample Datasets Catalog (25 curated datasets across domains)

www.ahamdata.com

Still a bit slow with larger datasets — more backend work is needed to scale performance and avoid applying sampling logic when size exceeds model capabilities. Feedback & feature requests welcome!

submitted by /u/Jerusari8
[link] [comments]

Vehicle Categories – Need Source For Data

Hi I’m a developer working on a project, not sure if this is the right place, but thought I’d ask.

This project has a core business feature where pricing is tied to a vehicle’s category. That way the user can price out packages accordingly based on vehicle type.

Here is where the problems begin. I usually use the NHTSA for vehicle data, public fast, free, but it’s not complete enough. It returns ambiguous ‘types’ like ‘mpv,bus,truck,car’ rather then sedan, suv, exotic, etc.

I then tried the EPA fuel economy dataset, as it had 12,000 rows, was in csv format for easy parsing etc. But this proved to also be too incomplete, wouldn’t have newer vehicles like a 2024 3/4 ton trucks and more.

For speed, I made my own sort of ‘source of truth’ table in my database which runs a populate job to seed, but still I need a clean reliable data source to actually run this job through. I can get by with the NHTSA data for the time being, but a more complete solution is necessary for scale.

submitted by /u/Square-Display555
[link] [comments]

New [Synthetic] Oklahoma Precision Ag Dataset (50K Rows) – Calibrated For Yield Prediction, Irrigation & Pest Modeling

Hey r/PrecisionAg,

I just released a new hyper-realistic synthetic dataset specifically built for Oklahoma conditions using real Mesonet weather patterns and USDA crop statistics.

Dataset details:

  • 50,000 daily sensor + yield records
  • Crops: Winter Wheat (50%), Cotton (20%), Grain Sorghum (15%), Soybeans (15%)
  • 15 real Oklahoma counties
  • 18 columns including: soil moisture, NDVI, NPK levels, soil pH, temperature, humidity, rainfall, solar radiation, wind speed, irrigation, pest pressure (High/Med/Low), weather events (drought/heatwave), growth stage, and yield-loss-risk labels

It’s 100% synthetic (no scraping or real farm data), so it’s completely legal and privacy-safe for commercial use or AI training.

I created it because I saw how hard it is to get clean, regionally accurate tabular/sensor data for precision ag models. Thought it might be useful for anyone working on yield forecasting, irrigation optimization, pest risk, or drought modeling in the Midwest/South Plains.

Full dataset is available here:
https://datasetking.gumroad.com/l/ok-precision-ag

Happy to answer any questions or take feedback. More regional versions (California, Texas, etc.) are in the works.

Thanks for looking!

DataSetKing

submitted by /u/datasetking
[link] [comments]

UEBA: User And Entity Behavior Analytics

[SELF-PROMOTION]
Inspired by the chaotic currency exploits in Rainbow Six Siege in late 2025, this project explores User & Entity Behavior Analytics (UEBA) to detect insider and outsider threats.

Faced with the challenge of inaccessible real-world logs and complex datasets like CMU_CERT, I developed a simple, synthetic custom-built dataset designed to simulate realistic corporate environments. A key feature of this project is the inclusion of “gray area” activities—actions that mimic malicious patterns but are actually benign—to challenge the model’s accuracy and better reflect the nuance of real-world cybersecurity.

  • Origin: Sparked by the “total anarchy” of the 2025 R6 Siege security scandal.
  • The Problem: Existing datasets like CMU-CERT are often too complex for entry-level projects, while others are too simplistic to be useful.
  • The Solution: A synthesized dataset bridging the gap between theory and practice.
  • Technical Focus: Moving beyond “black and white” detection by incorporating deceptive gray-area data points.

Access the dataset on (Kaggle.)[https://www.kaggle.com/datasets/prajwalnayakat/ueba-insider-threat-and-attack-detection]

Let me know if its a bit faulty in anyway.

submitted by /u/Puzzleheaded_boi_63
[link] [comments]

[self-promotion][Paid] Scraped 6,600 AI Tools Across 3 Major Directories Into Clean CSVs

Been using web scrapers for competitive research and kept going back to the same data, so I cleaned it up properly.

Three files:

– Futurepedia: 1,221 tools. Ratings, review counts, pros/cons, feature breakdowns, social links.

– TAAFT (There’s An AI For That): 2,896 tools. Same rich fields, one of the most complete AI directories out there.

– TopAI: 2,500 tools. Names, URLs, descriptions, categories, pricing models.

Standard CSV. Opens in Excel, Sheets, pandas, whatever.

Useful for market research, competitive mapping, writing roundups, or just having a flat filterable list of AI companies with URLs and categories.

Scraped early 2026. 7 bucks. Reddit seems to auto-filter Gumroad links so DM me for the link, or search ‘krisco65 gumroad AI tools dataset’.

submitted by /u/krisco65
[link] [comments]

Any Dataset Of 100% Human HTTP Requests?

Hi, I’m doing a master thesis on telling apart bots from humans based on their HTTP requests with machine learning. Right now I have a working proptotype that is based on the traffic logs from my university and honeypots. However, we’re a little limited on the human data and fear it wouldn’t be representative of the broader web. Is there any datasets with guaranteed human requests? Preferably containing header fields such as the User Agent, status, protocol version, response size and uri.

Thank you.

submitted by /u/Bottled_Up_DarkPeace
[link] [comments]

Looking For Coffee Bean Image Dataset With CQI Scores,does One Exist?

Hey everyone, I’m working on a coffee quality assessment project and trying to find a dataset that combines bean images with CQI scores. The Kaggle CQI database is great for scores but has no images, and the image datasets I found (USK-Coffee, HuggingFace grading) have no verified cup scores.

Has anyone come across a dataset that has both? Or have you found a way to bridge this gap in your own projects?

Or a even a normal CQI dataset with substantial datapoints would also be great.

Any help appreciated!

submitted by /u/hitchhiker08
[link] [comments]

[self-promotion] CRED-1: Open Dataset Of 2,672 Domains Scored For Credibility (CC BY 4.0, Zenodo DOI)

We just released CRED-1, an open dataset scoring 2,672 domains for credibility. It combines two established media watchdog sources (OpenSources.co and Iffy.news) and enriches them with four automated signals:

  • Tranco web rank (popularity/reach)
  • RDAP domain age
  • Google Fact Check Tools API (claim counts)
  • Google Safe Browsing API (malware/phishing flags)

Each domain gets a composite credibility score (0-1) based on a weighted model. The dataset is available as both a compact JSON and a full CSV with all enrichment fields.

Use cases: misinformation research, browser extensions, content moderation, media literacy tools, training data for credibility classifiers.

Key stats: – 2,672 domains across 5 categories (fake, unreliable, conspiracy, satire, other) – 704 matched in Tranco Top 1M – 67 domains with Google Fact Check claims – Score range: 0.000 to 0.962

License: CC BY 4.0 DOI: 10.5281/zenodo.18769460 GitHub: https://github.com/aloth/cred-1

Paper submitted to Data in Brief (Elsevier) and available on arXiv.

Happy to answer questions about the methodology or scoring model.

submitted by /u/bit3py
[link] [comments]

Building A Synthetic Dataset, Can You Help?

I built a pipeline to detect a bunch of “signals” inside generated conversations, and my first real extraction eval was brutal: macro F1 was 29.7% because I’d set the bar at 85% and everything collapsed. My first instinct was “my detector is trash,” but the real problem was that I’d mashed three different failure modes into one score.

  1. The spec was wrong. One label wasn’t expected in any call type, so true positives were literally impossible. That guarantees an F1 of 0.
  2. The regex layer was confused. Some patterns were way too broad, others were too narrow, so some mentions were being phrased in ways the patterns never caught
  3. My contrast eval was too rigid. It was flagging pairs as “inconsistent” when the overall outcome stayed the same but small events drifted a bit… which is often totally fine.

So instead of touching the model immediately, I fixed the evals first. For contrast sets, I moved from an all-or-nothing rule to something closer to constraint satisfaction. That alone took contrast from 65% → 93.3%: role swaps stopped getting punished for small event drift, and signal flips started checking the direction of change instead of demanding a perfect structural match.

Then I accepted the obvious truth: regex-only was never going to clear an 85% gate on implicit, varied, LLM-style wording. There’s a real recall ceiling. I switched to a two-gate setup: a cheap regex gate for CI, and a semantic gate for actual quality.

The semantic gate is basically weak supervision + embeddings + a simple classifier per label. I wrote 30+ labeling functions across 7 signals (explicit keywords, indirect cues, metadata hints, speaker-role heuristics, plus “absent” functions to keep noise in check), combined them Snorkel-style with an EM label model, embedded with all-MiniLM-L6-v2, and trained LogisticRegression per label.

Two changes made everything finally click:

  • I stopped doing naive CV and switched to GroupKFold by conversation_id. Before that, I was leaking near-identical windows from the same convo into train and test, which inflated scores and gave me thresholds that didn’t transfer.
  • I fixed the embedding/truncation issue with a multi-instance setup. Instead of embedding the whole conversation and silently chopping everything past ~256 tokens, I embedded 17k sliding windows of 3 turns and max-pooled them into a conversation-level prediction. That brought back signals that tend to show up late (stalls, objections).

I also dropped the idea of a global 0.5 threshold and optimized one threshold per signal from the PR curve. After that, the semantic gate macro F1 jumped from 56.08% → 78.86% (+22.78). Per-signal improvements were big also.

Next up is active learning on the uncertain cases (uncertainty sampling & clustering for diversity is already wired), and then either a small finetune on corrected labels or sticking with LR if it keeps scaling.

If anyone here has done multi-label signal detection on transcripts: would you keep max-pooling for “presence” detection, or move to learned pooling/attention? And how do you handle thresholding/calibration cleanly when each label has totally different base rates and error costs?

submitted by /u/Euphoric_Network_887
[link] [comments]

UPDATED WITH TIMELINE – Audited $2.1B In Epstein Financial Records.

Hello again — I published N19 (Blueprint of a Financial Machine) recently and got requests for a timeline showing when the money moved. Built it out and added it to the bottom of N19. 158 months of dated transactions, 69 vetted persons of interest, 6 red-flag events. Wanted to post a follow-up letting folks know it’s there. Same link as before, just updated. Let me know if that’s good to go.

https://randallscott25-star.github.io/epstein-forensic-finance/narratives/19_grand_opus_narrative.html

submitted by /u/Specialist_Rip5492
[link] [comments]

Looking For Public Datasets Of English Idioms (idiom Text + Meaning + Example Sentences + Frequency If Possible)

I’m assembling a small resource to evaluate and improve “idiomaticity” in LLM rewrites (outputs can be fluent but still feel literal).
For that, I’m looking for datasets of English idioms expressions with:

  • idiom text (canonical form if possible)
  • meaning
  • example sentences
  • ideally some frequency signal
  • licensing that allows research

Questions

  1. Are there any well-known public idiom corpora you’d recommend?
  2. Any good frequency proxies you’ve used for idioms?
  3. If you’ve built something similar: what fields ended up being most important?

If helpful, I can share the exact retrieval endpoint I’m using for testing — but mostly I’m looking for dataset pointers.

submitted by /u/Own-Importance3687
[link] [comments]

Title: I Audited $2.1 Billion In Epstein Financial Records. Here’s Every Name The Money Touched.

This is the 19th data narrative from my forensic finance project analyzing the DOJ EFTA document releases. I built a relational database from 1.48 million public documents and traced 6,310 payments totaling $2.146 billion across 14 banks, 8 shell entities, and 123 connected nodes. This piece maps the full financial network — banks, shells, operators, and key persons — with every dollar amount sourced from publicly released DOJ documents, court filings, and SAR reports. All amounts are tagged (Unverified). No paywalls, no ads, no monetization. Pro bono work.

Full repository is public on GitHub.

Link: https://randallscott25-star.github.io/epstein-forensic-finance/narratives/19_grand_opus_narrative.html

Repository: https://github.com/randallscott25-star/epstein-forensic-finance

submitted by /u/Specialist_Rip5492
[link] [comments]