Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Tiktok Trending Hashtags Dataset (2022-2025)

Introducing the tiktok-trending-hashtags dataset: a compilation of 1,830 unique trending hashtags on TikTok from 2022 to 2025. This dataset captures viral one-time and seasonal viral moments on TikTok and is perfect for researchers, marketers, and content creators studying viral content patterns on social media.

submitted by /u/Ok_Employee_6418
[link] [comments]

I Built And API For Deep Web Research (with Country Filter) That Generates Reports With Source Excerpts And Crawl Logs

I’ve been working on an API that pulls web pages for a given topic, crawls them, and returns a structured research dataset.

You get the synthesized summary, the source excerpts it pulled from, and the crawl logs.
Basically a small pipeline that turns a topic into a verifiable mini dataset you can reuse or analyze.

I’m sharing it here because a few people told me the output is more useful than the “AI search” tools that hide their sources.

If anyone here works with web-derived datasets, I’d like honest feedback on the structure, fields, or anything that’s missing.

submitted by /u/Affectionate-Olive80
[link] [comments]

A Silent Data Void — Evidence Of Institutional Harm

It begins, as so many medical journeys regrettably do, with an act of faith. A person in crisis — frightened, disoriented, clinging to the thinnest thread of resolve — presents themselves to a system that adverts itself as a sanctuary. One imagines an orderly progression: distress recognized, risk assessed, treatment initiated, follow‑up secured. That is the mythology.

What actually unfolds bears little resemblance to such reassuring narratives. Instead, the patient is ushered through a succession of assessments — often repeated, conducted by staff stretched thin — and then discharged into a statistical void with nothing but vague promises of follow‑up. All of this proceeds with the serene confidence of an institution that knows no one is counting the outcomes. It is, in its quiet way, sinister.

I do not employ that word carelessly. “Sinister” is reserved for matters in which harm is not incidental but structural: the result of machinery designed without regard to the human beings ground within it. One thing becomes abundantly clear when tracing these medical peregrinations: the system is configured to avert its gaze precisely at the moments it should stare hardest.

The Growing Gap Between Demand and Capacity

Recent data show dramatic and sustained growth in demand for mental health services across England. In 2024/25, there were on average 453,930 new referrals to secondary mental health services every month — a 15 % increase compared to 2022/23 (CQC, 2025). Yet despite this surge, systemic capacity has not scaled accordingly. Waiting times remain protracted, and bottlenecks continue to accumulate.

According to the most recent Care Quality Commission (CQC) “Community Mental Health Survey 2024,” which collected responses from over 14,000 people, a third (33%) reported waiting three months or more between their assessment and their first treatment appointment, and 14% waited more than six months (CQC, 2025). Meanwhile, two in five (40%) felt the waiting time was too long, and 42% reported their mental health worsened during that wait (CQC, 2025).

These findings reflect a severe mismatch — the system is accepting referrals, but it cannot guarantee timely treatment. And for many, “timely” is no longer meaningful if measured in months.

What the Data Does Not Capture — and Why That Silence Matters

If one draws a schematic of the typical pathway for a person in crisis — referral → assessment → treatment → outcome (improvement, stabilization, deterioration, or death) — a robust system would record every node. But in the current configuration of the national data‑sets, especially the Mental Health Services Data Set (MHSDS) and associated reporting frameworks, outcome data is scant or absent.

Specifically, publicly-available data rarely track whether:

each assessment (particularly crisis referrals) resulted in a first treatment contact within a clinically reasonable timeframe;

the person’s condition improved, stayed stable, deteriorated, or resulted in self-harm or suicide during the waiting period;

the assessment was conducted by appropriately qualified personnel (psychiatrist vs nurse vs unqualified staff);

there was continuity of care, repeated contacts, discharge, re-referral, or follow-up;

demographic variables — such as socioeconomic status, region, ethnicity — influence access, delay, or outcomes.

In short: there is no epidemiological “upstream‑to‑outcome” tracking for mental health crisis care. A system so structured effectively guarantees that failures — deterioration, relapse, suicide — may occur without ever being attributed back to the system’s delays or mismanagement. That “data‑void” is not incidental — it is functional. By omitting outcome‑tracking, the system immunises itself against systemic accountability.

The Human Cost — Testimony Speaks Where Quantitative Outcome Data Is Silent

Where quantitative, gold‑standard longitudinal outcome data fails, qualitative testimony still shows a consistent pattern of suffering and abandonment. In the 2025 survey by Rethink Mental Illness, many respondents described being left in crisis for months or years without meaningful support. The report quotes one individual:

“I received no help at all until it was too late. My psychosis was full‑on, and an attempted suicide was the only thing that got me help.” (Rethink Mental Illness, 2025, p. 7)

In that same survey, 3 in 4 respondents (73%) said they did not receive the right treatment at the right time (Rethink Mental Illness, 2025). A majority (83%) said their mental health had deteriorated while waiting, and approximately one in three (31%) reported they had attempted to take their own life during that wait (Rethink Mental Illness, 2025). Additional harms included increased self-harm behaviours, substance use, job loss, and repeated emergency‐service contact (Rethink Mental Illness, 2025).

When such qualitative testimonies are aggregated — repeated across hundreds of respondents — they form a pattern. A consistent motif of abandonment, institutional invisibility, and human cost. That this is experienced across different regions, conditions, and backgrounds suggests systemic failure — not just misfortune or isolated poor service.

Crisis Referrals: Escalation Without Resolution

The pressure on crisis services has surged. According to CQC 2024/25 data, the number of “very urgent” referrals to crisis teams rose sharply — to 60,935 in 2024/25, marking a 77 % increase compared with 2023/24 (CQC, 2025). Yet the capacity to respond has not kept pace: many people endure long waits, receive no follow-up, or are discharged after assessment without treatment. The report notes “inconsistencies in commissioning” and “huge variation in care depending on geography” (CQC, 2025).

These are not nominal failures — these are failures at the very moment of acute risk, when prompt intervention might make the difference between life and death.

The Structural Invisibility of Harm — Why “No Data” Means “No Accountability”

When a system fails to measure its outcomes, it removes the possibility of accountability. That is not just bureaucratic oversight — it is structural self‑preservation. Because we do not record:

how many people deteriorated or attempted self‑harm while waiting for treatment,

how many died by suicide following referral‑and-wait,

how many had repeated assessments without ever entering true treatment pathways,

which demographics are disproportionately harmed —

the system can survive waves of crisis, budget cuts, rising demand — and still claim “we met demand,” because what it counts is inputs (referrals, contacts, assessments, crisis calls) — not outcomes (recovery, stabilization, harm, death).

That is a disservice to the patients who fall through—and a betrayal of the social contract between public health and public trust.

Toward a Minimum Data Framework — What Real‑World Accountability Would Look Like

If one were to design a system that actually protected patients, rather than protected itself, one would demand the following data be collected and published (anonymised, aggregated, but with sufficient granularity):

Referral‑to‑treatment latency: for every referral or crisis assessment, record the date of first treatment contact; compute median, mean, distribution, disaggregated by risk level, region, demographic.

Longitudinal clinical outcomes: at defined intervals (e.g. 1, 3, 6, 12 months), record clinical status: stable, improved, worsened, self‑harm, suicide attempt, suicide.

Provider credentials data: for every assessment and treatment contact, record the role/qualifications of staff (psychiatrist, nurse, support worker, peer‑support, etc.).

Continuity and care trajectory: for each patient — number of repeated assessments, number of actual treatment interventions, discharges, re‑referrals, drop‑outs, follow‑up rates.

Equity / demographic metadata: age, gender, ethnicity, socioeconomic status, region — to reveal systemic inequalities and postcode‑lotteries.

Transparency and public reporting: annual publication of anonymised, aggregated outcome data — with sufficient detail to detect systemic failures, variation, and inequality.

In research‑terms: what is needed is a prospective longitudinal registry — analogous to those used in large‑scale chronic‑illness cohorts — but for mental‑health crisis referrals. Only such a registry could reveal the “mortality” of waiting lists, the morbidity of delay, and the human cost hidden within the clerical columns.

Why the Absence of Data Is Possibly the Strongest Evidence of Institutional Harm

We often regard bad data as a hindrance — something that complicates research. But in this context, “no data” is not an unfortunate oversight. It is likely the mechanism by which the system maintains plausible deniability.

If the system counted suicides that occur after referral‑and‑waiting, it might reveal a high mortality associated with waiting lists.

If it tracked repeated assessments without treatment, it might show that many people never receive care beyond a paper trail.

If it captured outcomes by region, it could expose inequalities and postcode‑lotteries.

If it recorded staff credentials, it would show how many assessments are done by under‑qualified staff — or outside recommended professional standards.

By failing to collect those data, the system ensures that such exposures are impossible.

The result: a healthcare institution that can truthfully claim “we handled X hundred thousand referrals this year,” while a large—and unknown—number of people deteriorated, self‑harmed, or died in limbo.

That is not negligence; that is structural self‑protection.

Conclusion: Silence Is Not Innocence — It’s Evidence

If one accepts that public‑health systems owe patients not only care but accountability, then the absence of outcome data for mental‑health crisis care must be understood as a failure of duty.

We do not have reliable epidemiological data on how many people assessed in crisis go on to receive timely, adequate treatment — nor on how many deteriorate or self‑harm or die while waiting. What we do have — in surveys and qualitative testimonies — is clear evidence that many endure intolerable delay, inadequate or inappropriate care, repeated institutional abandonment.

In research‑terms: this means the “denominator” (people assessed) is known — but the “numerator” (people treated successfully; people harmed; people lost) is invisible. A ratio that can never be calculated. A failure that can never be quantified.

Yet that invisibility is precisely where the greatest harm occurs. It is a void that swallows stories, strips suffering of official recognition, and renders statistical the fate of individuals.

This is not a benign omission — it is a method of institutional self‑preservation.

Until we insist — politically, socially, ethically — that mental health outcomes be tracked with the same rigor as physical health outcomes, the system will continue to shield itself behind the pretense of “data.” But that very pretense is the most damning data of all: the data that tells us the system does not care to know its failures — and in doing so, ensures they continue.

References

Care Quality Commission. (2025). High demand, long waits, and insufficient support, mean people with mental health issues still not getting the support they need [Press release].

Care Quality Commission. (2025). State of Care 2024/25 — Mental health: Access, demand and complexity.

Rethink Mental Illness. (2025). Right Treatment, Right Time 2025 report.

submitted by /u/AdventurousFeeling20
[link] [comments]

Total Users Of Music Streaming Services Each Year For The Past ~20 Years

I am looking for some well sourced data that (in one way or another) shows the increase in popularity for music streaming services since their conception (or at least fairly early on). This can be in the form of global revenue or total users, and ideally would be the total for multiple music streaming services (although just the top is fine too).

TLDR: Any useable data accurately showing the usage for music streaming services year-by-year.

submitted by /u/padlowan
[link] [comments]

[PAID] I Have This Data On Hand – Global Business Activity Datasets (Jobs, News, Tech, And Business Connections)

I have access to a set of large-scale business activity datasets that might be interesting for anyone working on market research, enrichment, or business intelligence projects.

The data comes from company websites and public sources, focused on tracking real-world signals like hiring, funding, and partnerships.

Job Openings Data

  • Since 2018: 232M+ job openings detected
  • 8.5M active job openings currently tracked (extracted directly from company websites)

News Events Data

  • Since 2016: 8.6M+ news events detected
  • Categorized into 29 event types such as receives funding, expanding locations, hiring C-level executives, etc.
  • Includes a subset dataset: Financing Events – 214K funding rounds tracked since 2016

Technologies Data

  • Since 2018: 1B+ technology adoptions detected
  • Coverage: 65M websites

Key Customers / Business Connections Data

  • Since 2019: 248M connections detected
  • Coverage: 50M websites
  • Uses an image recognition system to scan logos found on company websites and categorize relationships such as customers, partners, vendors, and investors

—————————————–

Used for: Sales and marketing intelligence, consulting, investment research, and trend analysis.

—————————————–

Feel free to drop me a question if you have any.

submitted by /u/Expensive_Horse6568
[link] [comments]

I’m Making An App That Predicts Where A Flood Will Show And Tell User What They Will Experience If They Travel Here

To clarify this app will be on three rural village going to a main town, do I need to request data from my country’s official weather agency for flood data to receive the necessary information I need to build this app so it will have a precise outcome?

This is a project for my grade 12 subject, I had this idea and me and my group members will build this.

I’ve been doing some research for the necessary data of the three town we picked. But I can’t find anything.

submitted by /u/victeriano
[link] [comments]

Is There A Practical Standard For Documenting Web-scraped Datasets?

Every dataset repo has its own README style – some list sources, others list fields, almost none explain the extraction process. I’m thinking scraped data deserves its own metadata standard: crawl date, frequency, robots.txt compliance, schema history, coverage ratio. But no one seems to agree on how deep to go. How would you design a reproducible, lightweight standard for scraped data documentation something between bare minimum CSV and academic paper appendix?

submitted by /u/Vivid_Stock5288
[link] [comments]

I’ve Built A Automatic Data Cleaning Application. Looking For MESSY Spreadsheets To Clean/test.

Hello everyone!

I’m a data analyst/software developer. Ive built a data cleaning, processing, and analyses software but I need datasets to clean and test it out thoroughly.

I’ve used AI generated datasets, which works great but hallucinates a lot with random data after a while.

I’ve used datasets from kaggle but most of them are pretty clean.

I’m looking for any datasets in any industry to test the cleaning process. Preferably datasets that take a long time to clean and process before doing the data analysis.

CSV and xlsx file types. Anything helps! 🙂 Thanks

submitted by /u/spicytree21
[link] [comments]

Looking For Housing Price Dataset To Do Regression Analysis For School

Hi all, I’m looking through kaggle to find a housing dataset with at least 20 columns of data and I can’t find any that look good and have over 20 columns. Do you guys know of one off the top your head by any chance or at least be able to find one quick?

I’m looking for one with attributes like, roof replaced x years ago, or garage size measured by cars, sq footage etc. Anything that might change the value of a house. The one I’ve got now is only 13 columns of data which will work but I would like to find one that is better.

submitted by /u/labor_anoymous
[link] [comments]

What Your Data Provider Won’t Tell You: A Practical Guide To Data Quality Evaluation

Hey everyone!

Coresignal here. We know Reddit is not the place for marketing fluff, so we will keep this simple.

We are hosting a free webinar on evaluating B2B datasets, and we thought some people in this community might find the topic useful. Data quality gets thrown around a lot, but the “how to evaluate it” part usually stays vague. Our goal is to make that part clearer.

What the session is about

Our data analyst will walk through a practical 6-step framework that anyone can use to check the quality of external datasets. It is not tied to our product. It is more of a general methodology.

He will cover things like:

  • How to check data integrity in a structured way
  • How to compare dataset freshness
  • How to assess whether profiles are valid or outdated
  • What to look for in metadata if you care about long-term reliability

When and where

  • December 2 (Tuesday)
  • 11 AM EST (New York)
  • Live, 45 minutes + Q&A

Why we are doing it

A lot of teams rely on third-party data and end up discovering issues only after integrating it. We want to help people avoid those situations by giving a straightforward checklist they can run through before committing to any provider.

If this sounds relevant to your work, you can save a spot here:
https://coresignal.com/webinar/

Happy to answer questions if anyone has them.

submitted by /u/Coresignal
[link] [comments]

Looking For A Piracy Dataset On Games

So my university requires me do a data analysis capstone project and i have decided to create hypothesis on the piracy level of a country based on GDP per capita and the prices that these games that are sold for is not acquirable for the masses and how unfair the prices are according to GDP per capita, do comment on wt you think also if you guys have a better idea do enlighten me also yea please suggest me a dataset for this coz i cant see anything that’s publicly available?!

submitted by /u/NecessaryBig2035
[link] [comments]

[Offer] Glassdoor MSCI Companies Job Review Dataset (2145 Companies, 1.31GB) – Preview Available

Hi everyone,

I’m offering a structured dataset of employee job reviews for MSCI index companies, built from public job review platforms (e.g. Glassdoor).

I’m sharing a free preview sample, and the full dataset (1.31 GB) is available on request.

🗂 Dataset Overview

Coverage: 2,145 MSCI-listed companies

Size: ~1.31 GB

Content: Company-level job reviews, including:

Overall rating information

Job titles and review dates

Free-text review content (pros/cons, comments, etc., where available)

Timeframe: Recent data (latest version at time of collection)

The data is cleaned and structured for analytics and modeling (CSV / similar tabular format).

🔧 Potential Use Cases

HR & people analytics – benchmarking employee satisfaction across MSCI companies

NLP / LLM training – sentiment analysis, aspect-based opinion mining, topic clustering

Market & equity research – linking employee sentiment to performance, risk, or ESG signals

Academic / research projects – labor studies, organizational behavior, etc.

📥 Preview & Full Access

I’m happy to provide a small preview sample so you can check structure and suitability for your use case.

If you’re interested in the full version of this dataset, please contact me directly:

📧 [a.corradini0215@gmail.com](mailto:a.corradini0215@gmail.com)

We can discuss:

Use case (research vs. commercial)

Licensing / usage terms

Pricing and any customization (e.g., specific sectors, time ranges)

⚖️ Notes

Please ensure that any use of the dataset complies with your local laws, your organization’s policies, and the terms of the original review platforms. I’m happy to clarify the structure and collection approach if needed.

Thanks, and feel free to ask questions here or by email if you want more details about fields, schema, or example rows.

submitted by /u/Crafty_Beach_3733
[link] [comments]

[PAID] I Spent Months Scraping 140+ Low-cap Solana Memecoins From Launch (10s Intervals), Dataset Just Published!

Disclosure: This is my own dataset. Access is gated.

Hey everyone,

I’ve been working on a dataset since September, and finally published it on Hugging Face.

I’ve traded (well.. gambled) with Solana memecoins for almost 3 years now, and discovered an incredible amount of factors at play when trying to determine if a coin was worth buying.

I’d dabble mostly in low market cap coins, while keeping the vast majority of my crypto assets in mid-high cap coins, Bitcoin for example. It was upsetting seeing new narratives with high price potential go straight to 0, and finally decided to start approaching this emotional game logically.

I ended up building a web scraper to both constantly scrape new coin data as they were deployed, and make API calls to a coin’s social data, rugcheck data, and tons of other tokenomics at the same time.

The dataset includes large amount of features per token snapshot (every max 10 second pulse), such as:

  • market cap
  • volume
  • holders
  • top 10 holder %
  • bot holding estimates
  • dev wallet behavior
  • social links
  • linked website scraping analysis (*title, HTML, reputation, etc*)
  • rugcheck scores
  • up to hundreds of other features

In total I collected thousands of coin’s chart histories, and filtered this number down to 140+ clean charts, each with nearly 300 data points on average.

With some quick exploratory analysis, I was able to spot smaller patterns, such as how the presence of social links could correlate with a higher market cap ATH. I’m a data engineer, not a data scientist yet, I’m sure those with formal ML backgrounds could find much deeper patterns and predictive signals from this dataset than I can.

For the full dataset description/structure/charts/and examples, see the Hugging Face Dataset Card.

submitted by /u/wtfmase
[link] [comments]

Times Higher Education World University Rankings Dataset (2011-2026) – 44K Records, CSV/JSON, Python Scraper Included

I’ve created a comprehensive dataset of Times Higher Education World University Rankings spanning 16 years (2011-2026).

📊 Dataset Details:44,000+ records from 2,750+ universities worldwide – 16 years of historical data (2011-2026) – Dual format: Clean CSV files + Full JSON backups – Two data types: Rankings scores AND key statistics (enrollment, staff ratios, international students, etc.)

📈 What’s included: – Overall scores and individual metrics (teaching, research, citations, industry, international outlook) – Student demographics and institutional statistics – Year-over-year trends ready for analysis

🔧 Python scraper included: The repo includes a fast, reliable Python scraper that: – Uses direct API calls (no browser automation) – Fetches all data in 5-10 minutes – Requires only requests and pandas

💡 Use cases: – Academic research on higher education trends – Data visualization projects – Institutional benchmarking – ML model training – University comparison tools

GitHub: https://github.com/c3nk/THE-World-University-Rankings

The scraper respects THE’s public API endpoints and is designed for educational/research purposes. All data is sourced from Times Higher Education’s official rankings.

Feel free to fork, star, or suggest improvements!

submitted by /u/cenkK
[link] [comments]

Rest Api To Dataset Just A Few Prompts Away

Hey folks, senior data engineer and dlthub cofounder here (dlt = oss python library for data integration)

Most datasets are behind rest APIS. We created a system by which you can vibe-code a rest api connector (python dict based, looks like config, easy to review) including llm context, a debug app and easy ways to explore your data.

We describe it as our “LLM native” workflow. Your end product is a resilient, self healing production grade pipeline. We created 8800+ contexts to facilitate this generation but it also works without them to a lesser degree. Our next step is we will generate running code, early next year.

Blog tutorial with video: https://dlthub.com/blog/workspace-video-tutorial

And once you created this pipeline you can access it via what we call dataset interface https://dlthub.com/docs/general-usage/dataset-access/dataset which is a runtime agnostic way to query your data (meaning we spin up a duckdb on the fly if you load to files, but if you load to a db we use that)

More education opportunities from us (data engineering courses): https://dlthub.learnworlds.com/

hope this was useful, feedback welcome

submitted by /u/Thinker_Assignment
[link] [comments]

University Statistics Report Confusion

I am doing a statistics report but I am really struggling, the task is this: Describe GPA variable numerically and graphically. Interpret your findings in the context. I understand all the basic concepts such as spread, variability, centre etc etc but how do I word it in the report and in what order? Here is what I have written so far for the image posted (I split it into numerical and graphical summary).

The mean GPA of students is 3.158, indicating that the average student has a GPA close to 3.2, with a standard deviation of 0.398. This indicates that most GPAs fall within 0.4 points above or below the mean. The median is 3.2 which is slightly higher than the mean, suggesting a slight skew to the left. With Q1 at 2.9 and Q3 at 3.4, 50% of the students have GPAs between these values, suggesting there is little variation between student GPAs. The minimum GPA is 2 and the Maximum is 4, using the 1.5xIQR rule to determine potential outliers, the lower boundary is 2.15 and the upper boundary is 4.15. A minimum of 2 indicates potential outliers, explaining why the mean is slightly lower than the median.

Because GPA is a continuous variable, a histogram is appropriate to show the distribution. The histogram shows a unimodal distribution that is mostly symmetrical with a slight left skew, indicating a cluster of higher GPAs and relatively few lower GPAs.

Here is what is asked for us when describing a single categorical variable: Demonstrates precision in summarising and interpreting quantitative and categorical variables. Justifies choice of graphs/statistics. Interprets findings critically within the report narrative, showing awareness of variable type and distributional meaning.

submitted by /u/Sad-Beautiful-7945
[link] [comments]

What’s Your Preferred Way To Store Incremental Updates For Large Datasets?

I’m maintaining a dataset that changes daily. Full refreshes are too heavy; diffs get messy. I’ve tried append-only logs, versioned tables, even storing compressed deltas. Each tradeoff hurts either readability, reproducibility, or storage. If you manage big evolving datasets, how do you structure yesterday + today without rewriting history or duplicating half your records?

submitted by /u/Vivid_Stock5288
[link] [comments]