Data Quality Best Practices + Snowflake Connection For Sample Data

I’m seeking for guidance on data quality management (DQ rules & Data Profiling) in Ataccama and establishing a robust connection to Snowflake for sample data. What are your go-to strategies for profiling, cleansing, and enriching data in Ataccama, any blogs, videos?

submitted by /u/Substantial_Mix9205
[link] [comments]

0

Patterns In Data! Is There Any No-code Solution?

submitted by /u/Amazing_Database1964
[link] [comments]

0

[Resource] 20,000+ Pages Of U.S. House Oversight Epstein Estate Docs (OCR’d & Cleaned For RAG/Analysis)

submitted by /u/Ok-District-1330
[link] [comments]

0

Hello, I Am In The Need For ‘big’ Dataset.

The dataset i need needs to weight at least 1GB and it should be used later on some ML algorithms. It can be either regression or classification task. Thank you for the help!

submitted by /u/Mate0ff
[link] [comments]

0

Downloading Select Files / Avoiding Downloading Entire Datasets

https://cds.climate.copernicus.eu/

consider that i have downloaded models. but i am unsure as to whether i have downloaded the full amount of datasets.

I just want a way to get the provenance.json, provenance.png and the names of .nc files.

The rest is just comparing files names to confirm if I have downloaded and placed data correctly.

submitted by /u/__Muhammad_
[link] [comments]

0

We Built A Database Of 290,000 English Medieval Soldiers – Here’s What It Reveals

submitted by /u/cavedave
[link] [comments]

0

Are There Any Open Access Crop Row Datasets Like CRBD?

I am looking for stereo image datasets of crop rows from within the field (not aerial) for row identification. Especially if they have depth and segmentation. I came accross CRBD and CropDeep but the latter doesn’t seem to be available for public yet. Any ideas would be really appreciated 🙂

submitted by /u/Majestic-Age-4636
[link] [comments]

0

Need Ideas For Utilizing Gcp’s $300 Free Credits In The Next Three Days And Get The Most Long Term Value Out Of It (something That Stays Even After The Credits Expire)

So the thing is my gcp account’s free trial is expiring in 3 days. I was hoping to get some long-term value out of it, something that stays even after the free credits expire like maybe running a vm 24/7 for data extraction process but im not sure what kind of data to extract. Anything that can be of value to me later on after the credits expire doesnt have to be necessarily datasets

submitted by /u/Mean_Interest8611
[link] [comments]

0

Benchmarked TabPFN On 1M-10M Row Datasets

We just put out a blog post with TabPFN benchmarks on datasets from 1M to 10M rows.

For context: TabPFN is a transformer pretrained on millions of synthetic datasets that does in-context learning for tabular classification/regression. No hyperparameter tuning needed – you just give it training data at inference and it predicts.

TabPFNv2 published in Nature this year
TabPFN-2.5 beats models tuned for 4h (report here), #1 on TabArena leaderboard atm

Compared our Scaling Mode against CatBoost, XGBoost, LightGBM on internal classification datasets. Performance keeps improving with more data and the gap to gradient boosting isn’t shrinking.

Benchmark results show normalized scores across datasets plus individual results showing ROC AUC improvements. You can find them here: https://priorlabs.ai/technical-reports/large-data-model

Would be interesting to keep on benchmarking this on public large tabular datasets. Anyone know good large public tabular datasets?

submitted by /u/Diligent_Inside6746
[link] [comments]

0

Guidance On Beginning A Data Project On Matcha And Its Rise

Hello Reddit! Apologies if this isn’t the right sub, but I’m working on a fun data project exploring how matcha lattes have exploded in popularity over the last year or so.

The thing is, I’m having a hard time finding any datasets that actually include matcha sales. My backup idea is to look for a dataset from a boba or Thai tea shop (since they usually sell matcha) and compare those sales to a cafe over the same time period that may not sell matcha?

This project is just for fun—mainly an excuse for me to play around with Kaggle, SQL, R, etc.—so the dataset doesn’t have to be perfect. If anyone has suggestions, dataset ideas, or guidance on where to look, I’d really appreciate it!

submitted by /u/Pristine-Rhubarb-787
[link] [comments]

0

Looking For Science Education Data Sets

I have a introductory data science class and my project requires me to do some basic analysis on some data set related to a topic I like. However my topic I am genuinely interested in is education in computer science. However I have had some trouble finding a data set I can work with, I found the annual stack overflow questionnaire but I don’t think it will work because of how they asked the questions. I also found another one that has all the schools that offer computer science in the US but my professor didn’t like that one. I have like two days to do the project so i need to find the data like today, please please if anyone knows Id love the help. Ive decided that it can be something related to just science in general or even education in general, its just a topic I want to study but I have struggled to find a good data set that I am pretty far from my original question anyways. Pleas and thanks to anyone who can help!

submitted by /u/papiyou
[link] [comments]

0

I Have Tried To Scrape Current Premier League Table

so I have tried to scrape current premier league table link is given here

i would try to update it every week if u like it dont forget to upvote it there and suggest what more dataset you want!

submitted by /u/Mental-Flight8195
[link] [comments]

0

96 Million INaturalist Research-grade Plant Records Dataset (free And Open Source)

I’ve built a large-scale plant dataset from iNaturalist research-grade observations:
96.1 million rows containing:

species / genus / family names
GBIF taxonomy IDs
lat / lon
event dates
image URLs (iNat open data)
license information
dataset keys / source info

It’s meant for anyone doing:

image classification (plants, ecology, biodiversity)
large-scale ViT/ConvNext pretraining
location-aware species modelling
weak-supervised learning from image URLs
training LoRA adapters for regional plant ID

Dataset (parquet, streamable via HF Datasets):
https://huggingface.co/datasets/juppy44/gbif-plants-raw

let me know what you build with it!

submitted by /u/Lonely-Marzipan-9473
[link] [comments]

0

Synthetic HTTP Requests Dataset For AI WAF Training

This dataset is synthetically generated and contains a diverse set of HTTP requests, labeled as either ‘benign’ or ‘malicious’. It is designed for training and evaluating AI based Web Application Firewalls (WAFs).

submitted by /u/muneebdev
[link] [comments]

0

TagPilot – Image Dataset Preparation Tool

Hey guys, just finished a simple tool to help you prepare your dataset for Lora trainings. It suggest how to crop your images, tags all images using Gemini API with several options and more.

You can download it on GitHub: https://github.com/vavo/TagPilot

submitted by /u/no3us
[link] [comments]

0

I Asked An AI To “Generate A Poor Family” 5,000 Times. It Mostly Gave Me South Asians.

submitted by /u/Born_Shelter_8354
[link] [comments]

0

Tiktok Trending Hashtags Dataset (2022-2025)

Introducing the tiktok-trending-hashtags dataset: a compilation of 1,830 unique trending hashtags on TikTok from 2022 to 2025. This dataset captures viral one-time and seasonal viral moments on TikTok and is perfect for researchers, marketers, and content creators studying viral content patterns on social media.

submitted by /u/Ok_Employee_6418
[link] [comments]

0

Can You Actually Make Money Building And Running A Digital-content E-commerce Platform From Scratch? “I Will Not Promote”

I’m thinking about building a digital-only e-commerce marketplace from scratch (datasets, models, data packages, technical courses). One-off purchases, subscriptions, licenses anyone can buy or sell. Does this still make sense today, or do competition and workload kill most of the potential profit?

submitted by /u/panspective
[link] [comments]

0

Is There A Reproducible Way To Quantify Dataset Drift Over Time?

I track daily extractions from several sources. Every month, something shifts – structure, language, or value ranges – and my models subtly degrade. I’d like a numeric drift score for datasets, not just ML features. Something that captures schema changes + statistical shifts + missing field ratios in one metric. Has anyone attempted that? What would your formula look like?

submitted by /u/Vivid_Stock5288
[link] [comments]

0

Zillow Removes Data On Risk Of Homes To Disasters. Did Anyone Scrape It In Advance?

submitted by /u/cavedave
[link] [comments]

0

Data Share Platform (A Platform Where You Can Share Data, Targeted More Towards IT People)

(A platform where you can share data, targeted more towards IT people)

submitted by /u/khaos238
[link] [comments]

0

I Built And API For Deep Web Research (with Country Filter) That Generates Reports With Source Excerpts And Crawl Logs

I’ve been working on an API that pulls web pages for a given topic, crawls them, and returns a structured research dataset.

You get the synthesized summary, the source excerpts it pulled from, and the crawl logs.
Basically a small pipeline that turns a topic into a verifiable mini dataset you can reuse or analyze.

I’m sharing it here because a few people told me the output is more useful than the “AI search” tools that hide their sources.

If anyone here works with web-derived datasets, I’d like honest feedback on the structure, fields, or anything that’s missing.

submitted by /u/Affectionate-Olive80
[link] [comments]

0

# Network Structure Analysis: Detecting Anomalies In Redacted Public Records

submitted by /u/Old_Iron986
[link] [comments]

0

A Silent Data Void — Evidence Of Institutional Harm

It begins, as so many medical journeys regrettably do, with an act of faith. A person in crisis — frightened, disoriented, clinging to the thinnest thread of resolve — presents themselves to a system that adverts itself as a sanctuary. One imagines an orderly progression: distress recognized, risk assessed, treatment initiated, follow‑up secured. That is the mythology.

What actually unfolds bears little resemblance to such reassuring narratives. Instead, the patient is ushered through a succession of assessments — often repeated, conducted by staff stretched thin — and then discharged into a statistical void with nothing but vague promises of follow‑up. All of this proceeds with the serene confidence of an institution that knows no one is counting the outcomes. It is, in its quiet way, sinister.

I do not employ that word carelessly. “Sinister” is reserved for matters in which harm is not incidental but structural: the result of machinery designed without regard to the human beings ground within it. One thing becomes abundantly clear when tracing these medical peregrinations: the system is configured to avert its gaze precisely at the moments it should stare hardest.

The Growing Gap Between Demand and Capacity

Recent data show dramatic and sustained growth in demand for mental health services across England. In 2024/25, there were on average 453,930 new referrals to secondary mental health services every month — a 15 % increase compared to 2022/23 (CQC, 2025). Yet despite this surge, systemic capacity has not scaled accordingly. Waiting times remain protracted, and bottlenecks continue to accumulate.

According to the most recent Care Quality Commission (CQC) “Community Mental Health Survey 2024,” which collected responses from over 14,000 people, a third (33%) reported waiting three months or more between their assessment and their first treatment appointment, and 14% waited more than six months (CQC, 2025). Meanwhile, two in five (40%) felt the waiting time was too long, and 42% reported their mental health worsened during that wait (CQC, 2025).

These findings reflect a severe mismatch — the system is accepting referrals, but it cannot guarantee timely treatment. And for many, “timely” is no longer meaningful if measured in months.

What the Data Does Not Capture — and Why That Silence Matters

If one draws a schematic of the typical pathway for a person in crisis — referral → assessment → treatment → outcome (improvement, stabilization, deterioration, or death) — a robust system would record every node. But in the current configuration of the national data‑sets, especially the Mental Health Services Data Set (MHSDS) and associated reporting frameworks, outcome data is scant or absent.

Specifically, publicly-available data rarely track whether:

each assessment (particularly crisis referrals) resulted in a first treatment contact within a clinically reasonable timeframe;

the person’s condition improved, stayed stable, deteriorated, or resulted in self-harm or suicide during the waiting period;

the assessment was conducted by appropriately qualified personnel (psychiatrist vs nurse vs unqualified staff);

there was continuity of care, repeated contacts, discharge, re-referral, or follow-up;

demographic variables — such as socioeconomic status, region, ethnicity — influence access, delay, or outcomes.

In short: there is no epidemiological “upstream‑to‑outcome” tracking for mental health crisis care. A system so structured effectively guarantees that failures — deterioration, relapse, suicide — may occur without ever being attributed back to the system’s delays or mismanagement. That “data‑void” is not incidental — it is functional. By omitting outcome‑tracking, the system immunises itself against systemic accountability.

The Human Cost — Testimony Speaks Where Quantitative Outcome Data Is Silent

Where quantitative, gold‑standard longitudinal outcome data fails, qualitative testimony still shows a consistent pattern of suffering and abandonment. In the 2025 survey by Rethink Mental Illness, many respondents described being left in crisis for months or years without meaningful support. The report quotes one individual:

“I received no help at all until it was too late. My psychosis was full‑on, and an attempted suicide was the only thing that got me help.” (Rethink Mental Illness, 2025, p. 7)

In that same survey, 3 in 4 respondents (73%) said they did not receive the right treatment at the right time (Rethink Mental Illness, 2025). A majority (83%) said their mental health had deteriorated while waiting, and approximately one in three (31%) reported they had attempted to take their own life during that wait (Rethink Mental Illness, 2025). Additional harms included increased self-harm behaviours, substance use, job loss, and repeated emergency‐service contact (Rethink Mental Illness, 2025).

When such qualitative testimonies are aggregated — repeated across hundreds of respondents — they form a pattern. A consistent motif of abandonment, institutional invisibility, and human cost. That this is experienced across different regions, conditions, and backgrounds suggests systemic failure — not just misfortune or isolated poor service.

Crisis Referrals: Escalation Without Resolution

The pressure on crisis services has surged. According to CQC 2024/25 data, the number of “very urgent” referrals to crisis teams rose sharply — to 60,935 in 2024/25, marking a 77 % increase compared with 2023/24 (CQC, 2025). Yet the capacity to respond has not kept pace: many people endure long waits, receive no follow-up, or are discharged after assessment without treatment. The report notes “inconsistencies in commissioning” and “huge variation in care depending on geography” (CQC, 2025).

These are not nominal failures — these are failures at the very moment of acute risk, when prompt intervention might make the difference between life and death.

The Structural Invisibility of Harm — Why “No Data” Means “No Accountability”

When a system fails to measure its outcomes, it removes the possibility of accountability. That is not just bureaucratic oversight — it is structural self‑preservation. Because we do not record:

how many people deteriorated or attempted self‑harm while waiting for treatment,

how many died by suicide following referral‑and-wait,

how many had repeated assessments without ever entering true treatment pathways,

which demographics are disproportionately harmed —

the system can survive waves of crisis, budget cuts, rising demand — and still claim “we met demand,” because what it counts is inputs (referrals, contacts, assessments, crisis calls) — not outcomes (recovery, stabilization, harm, death).

That is a disservice to the patients who fall through—and a betrayal of the social contract between public health and public trust.

Toward a Minimum Data Framework — What Real‑World Accountability Would Look Like

If one were to design a system that actually protected patients, rather than protected itself, one would demand the following data be collected and published (anonymised, aggregated, but with sufficient granularity):

Referral‑to‑treatment latency: for every referral or crisis assessment, record the date of first treatment contact; compute median, mean, distribution, disaggregated by risk level, region, demographic.

Longitudinal clinical outcomes: at defined intervals (e.g. 1, 3, 6, 12 months), record clinical status: stable, improved, worsened, self‑harm, suicide attempt, suicide.

Provider credentials data: for every assessment and treatment contact, record the role/qualifications of staff (psychiatrist, nurse, support worker, peer‑support, etc.).

Continuity and care trajectory: for each patient — number of repeated assessments, number of actual treatment interventions, discharges, re‑referrals, drop‑outs, follow‑up rates.

Equity / demographic metadata: age, gender, ethnicity, socioeconomic status, region — to reveal systemic inequalities and postcode‑lotteries.

Transparency and public reporting: annual publication of anonymised, aggregated outcome data — with sufficient detail to detect systemic failures, variation, and inequality.

In research‑terms: what is needed is a prospective longitudinal registry — analogous to those used in large‑scale chronic‑illness cohorts — but for mental‑health crisis referrals. Only such a registry could reveal the “mortality” of waiting lists, the morbidity of delay, and the human cost hidden within the clerical columns.

Why the Absence of Data Is Possibly the Strongest Evidence of Institutional Harm

We often regard bad data as a hindrance — something that complicates research. But in this context, “no data” is not an unfortunate oversight. It is likely the mechanism by which the system maintains plausible deniability.

If the system counted suicides that occur after referral‑and‑waiting, it might reveal a high mortality associated with waiting lists.

If it tracked repeated assessments without treatment, it might show that many people never receive care beyond a paper trail.

If it captured outcomes by region, it could expose inequalities and postcode‑lotteries.

If it recorded staff credentials, it would show how many assessments are done by under‑qualified staff — or outside recommended professional standards.

By failing to collect those data, the system ensures that such exposures are impossible.

The result: a healthcare institution that can truthfully claim “we handled X hundred thousand referrals this year,” while a large—and unknown—number of people deteriorated, self‑harmed, or died in limbo.

That is not negligence; that is structural self‑protection.

Conclusion: Silence Is Not Innocence — It’s Evidence

If one accepts that public‑health systems owe patients not only care but accountability, then the absence of outcome data for mental‑health crisis care must be understood as a failure of duty.

We do not have reliable epidemiological data on how many people assessed in crisis go on to receive timely, adequate treatment — nor on how many deteriorate or self‑harm or die while waiting. What we do have — in surveys and qualitative testimonies — is clear evidence that many endure intolerable delay, inadequate or inappropriate care, repeated institutional abandonment.

In research‑terms: this means the “denominator” (people assessed) is known — but the “numerator” (people treated successfully; people harmed; people lost) is invisible. A ratio that can never be calculated. A failure that can never be quantified.

Yet that invisibility is precisely where the greatest harm occurs. It is a void that swallows stories, strips suffering of official recognition, and renders statistical the fate of individuals.

This is not a benign omission — it is a method of institutional self‑preservation.

Until we insist — politically, socially, ethically — that mental health outcomes be tracked with the same rigor as physical health outcomes, the system will continue to shield itself behind the pretense of “data.” But that very pretense is the most damning data of all: the data that tells us the system does not care to know its failures — and in doing so, ensures they continue.

References

Care Quality Commission. (2025). High demand, long waits, and insufficient support, mean people with mental health issues still not getting the support they need [Press release].

Care Quality Commission. (2025). State of Care 2024/25 — Mental health: Access, demand and complexity.

Rethink Mental Illness. (2025). Right Treatment, Right Time 2025 report.

submitted by /u/AdventurousFeeling20
[link] [comments]

0

Total Users Of Music Streaming Services Each Year For The Past ~20 Years

I am looking for some well sourced data that (in one way or another) shows the increase in popularity for music streaming services since their conception (or at least fairly early on). This can be in the form of global revenue or total users, and ideally would be the total for multiple music streaming services (although just the top is fine too).

TLDR: Any useable data accurately showing the usage for music streaming services year-by-year.

submitted by /u/padlowan
[link] [comments]

0

[PAID] I Have This Data On Hand – Global Business Activity Datasets (Jobs, News, Tech, And Business Connections)

I have access to a set of large-scale business activity datasets that might be interesting for anyone working on market research, enrichment, or business intelligence projects.

The data comes from company websites and public sources, focused on tracking real-world signals like hiring, funding, and partnerships.

Job Openings Data

Since 2018: 232M+ job openings detected
8.5M active job openings currently tracked (extracted directly from company websites)

News Events Data

Since 2016: 8.6M+ news events detected
Categorized into 29 event types such as receives funding, expanding locations, hiring C-level executives, etc.
Includes a subset dataset: Financing Events – 214K funding rounds tracked since 2016

Technologies Data

Since 2018: 1B+ technology adoptions detected
Coverage: 65M websites

Key Customers / Business Connections Data

Since 2019: 248M connections detected
Coverage: 50M websites
Uses an image recognition system to scan logos found on company websites and categorize relationships such as customers, partners, vendors, and investors

—————————————–

Used for: Sales and marketing intelligence, consulting, investment research, and trend analysis.

—————————————–

Feel free to drop me a question if you have any.

submitted by /u/Expensive_Horse6568
[link] [comments]

0

Looking To Find A Data Set From An Electric Company Based In The Philippines

For our stupid final project we need to acquire a data set from an electric company to clean and create a concept paper for it, My team and i originally chose Mpower but private companies just do not publish their data sets easily, so we’re finding other companies that has a public data set so we can work on it

submitted by /u/Enterinaf
[link] [comments]

0

I Built A Free Random Data Generator For Devs

submitted by /u/Intelligent_Noise_34
[link] [comments]

0

I’m Making An App That Predicts Where A Flood Will Show And Tell User What They Will Experience If They Travel Here

To clarify this app will be on three rural village going to a main town, do I need to request data from my country’s official weather agency for flood data to receive the necessary information I need to build this app so it will have a precise outcome?

This is a project for my grade 12 subject, I had this idea and me and my group members will build this.

I’ve been doing some research for the necessary data of the three town we picked. But I can’t find anything.

submitted by /u/victeriano
[link] [comments]

0

Transitioning From Java Spring Boot To Data Engineering: Where Should I Start And Is Python Mandatory?

submitted by /u/Madhudhanusu_K
[link] [comments]

0

Category: Datatards