Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

In Need Of A Dataset That Has Over 1000 Rows

im currently doing a school project right now and it requires me to have a dataset that has over 1000 rows and able to download into google sheets. im currently on a mac computer so i was wondering if anyone could reply to this with links to ones that would also be available to open on this device. thanks

submitted by /u/formithica
[link] [comments]

Need Help To Access The IAM Handwriting Dataset

I need help with the IAM handwriting dataset as I cannot access it from anywhere, I don’t even have an account from which I could remote download.

Can anybody please provide me a working link to that dataset (gdrive, mega, anything). If you have ever download it and have it in your drive can you please share.

This is the link to the dataset: https://fki.tic.heia-fr.ch/databases/iam-handwriting-database

submitted by /u/Chiragrvijay
[link] [comments]

Will A Tool Like This Help You In Visualisation?

We are working on Mokkup.ai

https://www.mokkup.ai/

Which is a dashboard wireframing tool that helps create high fidelity wireframes in minutes, even for people with no design acumen.

We are targeting data analysts, PMs, developers, HRs, other business teams and stakeholders. It’s super simple to use, with drag and drop elements, 150+ pre built templates spanning across industries and for several, custom use cases.

Will a tool like this help you to create a dashboard to translate your ideas before moving to working w real data sets?

I’d love to hear about your reviews, thoughts about what we are creating! This year we have geared up to do some mad business so your every insight and comment would be incredibly valuable. Thank you!

submitted by /u/Hamburgerleader
[link] [comments]

Looking For Multivariate Data For Assignment About Microbiology?

Hi everyone, for my doctoral training, I am following a multivariate statistics course. For the exam we need to make an assignment in which we analyse a multivariate dataset of our choice by using different methods (such as PCA, discriminant analysis, factor analysis, biplot, cluster analysis …).

Do you have recommendations for interesting data sets to analyse that are available online. It would be cool if it can be about microbiology (or bacteriophage research) since this is what my doctoral research is about.

Many thanks and happy new year!

submitted by /u/Subject-Extent5978
[link] [comments]

Is There A Dataset Of Fake/fraudulent/pseudoscientific Illnesses And Medical Conditions?

There’s a system that allows users to add their medical conditions from a list. I found that there are some non-existent conditions in the list, things like autistic enterocolitis.

I need a list of conditions that have been claimed to exist but are not recognised by mainstream medicine, so I can make a script to detect the overlap.

Does such a list exist?

submitted by /u/Defiant-Snow8782
[link] [comments]

TSA – Time Series Analysis Decomposition

I am studying a daily dataset from a game which contains informations about the peak of players from 2013-2023. This is my first try applying decomposition to see how time series components behaves. Later I would like to perform a forecasting using some models, I have applied the ADF test and it revealed the series as stationary. I’m having some questions to determine what value could fit better in the period parammeter with a daily data.

This is the series over the whole time:
https://i.stack.imgur.com/jlsSm.png

Additive model:
https://i.stack.imgur.com/jlsSm.png

Multiplicative model:
https://i.stack.imgur.com/3MAoB.png

Based on different types of time series data such as annual, monthly and daily, how should the choice of period be made?

submitted by /u/Dota_curious
[link] [comments]

Dataset For A Healtcare Triage System

Hey everyone! I had a personal side project idea and am looking for feedback on whether there’s a dataset that people might know of that could be applied to this project:

I’m looking to build an AI model that has the capability of telling a user which type of healthcare facility they should go to, depending on their symptoms.

More specifically, I was planning to have the user input information relating to certain factors that would be used as model features, such as:

Age

Gender

Symptoms

Underlying conditions

So that the model would tell the user which type of healthcare facility is most optimal for them out of these options:

Hospital

Pediatrics

Clinic

Pharmacy

Long-Term Care (For older aged people)

Specialized Care (For non-emergency situations that require invasive procedures)

I’ve begun looking for datasets that have this type of information, but haven’t found any usable ones so far.

Does anyone know if there are possible datasets available that I could use to train this type of model? Would I have to create my own dataset? Or is there no available dataset?

submitted by /u/CharacterAlbatross16
[link] [comments]

[Part-Synthetic] “Generative AI For Math: Part I — MathPile: A Billion-Token-Scale Pretraining Corpus For Math”

Paper: https://arxiv.org/abs/2312.17120

Datasets: https://huggingface.co/datasets/GAIR/MathPile

Code: https://github.com/GAIR-NLP/MathPile/

Project page: https://gair-nlp.github.io/MathPile/

Abstract:

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of “less is more”, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of MathPile with the scripts used for processing, to facilitate future developments in this field.

submitted by /u/APaperADay
[link] [comments]

Request: Average Daily Temperature For Each Month Across The World

I am looking to plan a trip around the world and the task is to have two distinct periods of warm weather and cool weather. This helps minimise our clothes and stay comfortable.

My plan was to simply draw a line in QGIS of our route (and also use it to predict flight costs based on km traveled) but also work out which areas are too cool / hot at at given month.’

Looking for:

GIS file (vector) with a table of each month’s average daily temp or raster file of same output.

Any help would be great! I had a deep dive this morning but can only find values aggregated to yearly. I couldn’t seem to find the right sources.

submitted by /u/KICKERMAN360
[link] [comments]

US Census-designated Place Quality Of Bike / Public Transit Infrastructure

Hi, I’d like to know the quality of a place’s bike and/or public transit infrastructure for every place in the US. The census lists about 31k places (towns, villages, cities, etc.) which makes things difficult.

I could try out the Walk Score API but they require specific lat/long coordinates and that can be cherry-picked from place to place. Maybe I can request 10 random coordinates within each place?

A simpler route might be this peopleforbikes.org dataset that has 1,500 US cities ranked. Anyone know of anything better?

submitted by /u/DataScience0
[link] [comments]

Looking For Dataset For Birds In Any Form Possible

Hello,

i am working on a project for my final thesis. The project is around birds and taking photos of them. I wanted to give user option of categorising them and choosing from list of most popular birds. But the problem is i couldnt find any dataset with birds and some pictures of them. I would like to have atleast 50-100 birds species with names and few pictures. Does anybody know any apis, csv file, jsons or anything i could get some data for my project?

Thank you in advance and have a great holidays 🙂

submitted by /u/CuddlyBear789
[link] [comments]

Disappearing Data Moats – Thanks To Synthetic Data

Hey crew – I wanted to share a blog post on the emergence of synthetic data and the implications on many companies’ data moats.

See the full post below and TLDR. For a better reading experience with graphics, images, links, etc. I recommend reading here.

If you’re interested in this content, feel free to subscribe to my weekly newsletter covering similar topics.

TL;DR: The blog post argues that synthetic data is overtaking human-generated data in AI, challenging the traditional value of data moats held by companies. With advancements in Generative AI, synthetic data is becoming more valuable due to its cleanliness and privacy benefits. By 2030, it’s expected to be the primary data source in AI, reducing the reliance on and value of human-generated data. Techniques like GANs and VAEs facilitate this shift, with the market and applications for synthetic data rapidly expanding.

FULL POST – Disappearing data moats

Are companies overvaluing their data moats in our new world of AI? A deeper exploration into synthetic data suggests this may be the case…

Many companies believe they have strategic moats made from consumer data. Data they’ve spent years aggregating, which now appears more valuable in this AI-centric world. The phrase “data is the new oil” has been used to describe data as an asset that differentiates the haves from the have-nots.

In addition to the perceived value of data, we’re seeing significant investment in Generative AI (GenAI) applications. To me, there was an obvious area for value to be extracted from the market – the infrastructure (NVIDIA) and data (Meta, Google, Amazon). Chamath Palihapitiya reaffirmed my conviction multiple times in different venues over the last year.

However, through researching data in generative AI, I discovered an under-discussed trend – synthetic data. This led me to realize that data is NOT the new oil and that these strategic data moats are shrinking. Human-generated data will likely become less valuable, shifting value towards synthetic data.

I’m not alone in this perspective. Gartner predicts that by 2024, 60% of the data used in AI will be synthetically generated. They also estimate that by 2030, synthetic data will overshadow real data in AI models, meaning nearly all data used to train AI by 2030 will likely be synthetic.

But that’s not all, let’s see what Sam Altman has to say…

“In May of 2023, Sam was asked whether he was worried about regulatory probes into ChatGPT’s potential privacy violations. Sam brushed it off, saying he was “pretty confident that soon all data will be synthetic data”.”

In a future where synthetic data is preferred over human-generated data, several key changes will emerge:

– Value Shift: Businesses focused on creating high-quality synthetic data will draw significant value, overshadowing firms that rely on decades of human-generated data.
– Enhanced Competition: Startups will find it easier to challenge established companies, especially in niche sectors like biology, due to lower market entry barriers.
– Privacy Solutions: Synthetic data offers a workaround for the privacy issues plaguing sensitive datasets, like financial and medical records, enabling AI training without risking personal data exposure.
– Reduced Human Data Dependence: The constant need to harvest human-generated data will decrease over time, along with the value from that data.
– … The list goes on…

Synthetic data is poised to surpass human-generated data in both volume and quality, challenging the notion that real data is always superior. Real-world data is often problematic – it’s messy, biased, and fraught with privacy issues. Synthetic data, on the other hand, is cleaner and more controlled.

Now, you may wonder how this is all possible.

Synthetic data generation – 101

Synthetic data generation, with a history spanning decades, primarily found its application in simulations. A notable example is Tesla’s extensive simulation of outlier scenarios for their self-driving vehicles.

The methods for generating synthetic data vary, each suited to particular uses and having its own set of trade-offs. Some of these methods include:

– Generative Adversarial Networks (GANs): Imagine two AI programs in a contest: one creates fake images, and the other tries to detect which are fake. Over time, the creator gets good at making realistic images.
– Variational Autoencoders (VAEs): These are a bit like GANs, but less complex. They learn the pattern of existing data and then use this understanding to make new, similar data. They’re known for being more stable and easier to handle than GANs.
– Simulation-Based Methods: Here, you create a virtual model of a real-world situation (like a weather system). By running simulations in this model, you can generate data that predicts how things might happen in real life. Like Deepmind’s recent breakthrough in weather predictions via Graphcast.
– Agent-Based Modeling: Imagine a video game where each character has rules and behaviors. By watching how these characters interact, you can gather data about similar situations in the real world.

Today’s primary methods for generating synthetic data start with “seed data,” which is originally human-generated. This seed data serves as a base to ensure the synthetic version remains statistically similar to the original.

Experts in synthetic data generation focus on three key quality metrics: fidelity, diversity, and utility.

– Fidelity: This is about how closely the synthetic data matches real data, it should look and act very similar to the real thing.
– Diversity: This ensures the synthetic data includes a wide range of scenarios, not just repeating the most common situations, but incorporating the outliers too.
– Utility: This is about how useful the synthetic data is based on its original purpose. The data should be good enough to help build and test systems effectively, just like real data.

For more sensitive datasets, balancing fidelity and privacy becomes crucial. Our goal should be to maximize fidelity while preserving privacy.

One method to protect individual privacy in real-world datasets used for synthetic data is differential privacy. Differential privacy adds a small amount of random noise to the data, making it hard to identify any one person’s information, while still maintaining the overall usefulness of the data. A real-world use case we interact with daily would be auto-complete for words and emojis within both Apple and Google devices. For optimal results, this method should mainly be used on massive datasets.

The state of synthetic data

The current landscape for synthetic data is interesting. There’s a growing market demand for synthetic data, with startups and incumbents aiming to fill that need.

This market can be broadly classified into two categories: structured and unstructured synthetic data generators.
– Structured: Data that can be systematically arranged in tables, such as those found in databases or spreadsheets. Examples include financial transactions and patient records. Its structured nature allows for more straightforward handling and organization.
– Unstructured: Data that doesn’t conform to a standard structure and is inherently more complex. It includes diverse forms such as text, images, videos, and audio. These are vital for applications in areas like speech recognition, autonomous driving, and robotics.

I predict that the real value capture and competition will center around unstructured data. This prediction is based on the use cases derived from unstructured data, most of which will focus on training AI models.

Advancements and challenges

Now that we understand the market structure, let’s explore recent advancements in training AI using synthetic data and the associated challenges.

The adoption of synthetic data is rapidly growing in the field of generative AI, primarily through a concept called “bootstrapping.” Bootstrapping involves training one model using data from another model. A typical example is using GPT-4 to train GPT-3.5 for specific tasks, like LLM evaluation.

“Recent research has shown that training small, efficient language models (SLMs) on high-quality, diverse data can achieve state-of-the-art results- even rivaling or surpassing LLMs 5x the size such as Llama2-7b and Falcon-7b on common tasks, as demonstrated by models like Microsoft’s “phi-1.5” (from their paper “Textbooks Are All You Need”), Orca2, and IBM’s Granite. “

These small language models (SLMs) are paving the way for generating high-quality models using synthetic data, and this approach has the potential to scale to much larger models. Recent successes include Microsoft’s Phi-2 and Google’s ReST^EM.

Success in this field also brings its share of challenges, particularly within the realm of synthetic data. One crucial aspect is ensuring that synthetic data faithfully replicates real-world conditions. Failure to capture these complexities can lead to poor model performance in practical scenarios, which becomes challenging for complex data, like images.

Another significant concern voiced by skeptics of synthetic data is what’s known as “mode collapse.” This issue frequently arises when employing the GAN method mentioned earlier. Mode collapse occurs when an AI, originally designed to generate a wide range of diverse items, ends up repetitively producing a limited set of items instead of maintaining diversity. It’s like a chef who only cooks a handful of dishes despite knowing a vast cookbook, thus earning the term “mode collapse” as the AI converges into a single mode.

Luckily, there are a variety of ways to fix most challenges with synthetic data. The key lies in ensuring the diversity of your data and continually validating it. Additionally, incorporating updated original data into the data generator on an ongoing basis helps maintain the high fidelity of synthetic data.

Throughout this post, I’ve argued that strategic data moats are diminishing while acknowledging the ongoing importance of human-generated “seed data.” I understand this might seem contradictory. Currently, human-generated data plays a crucial role in training AI models, but its significance is expected to diminish over time. Let me provide you with two recent research findings that further support this trend, in case you aren’t already convinced.

First, there’s MimicGen, which has demonstrated its ability to create high-quality synthetic data from small samples of real-world data. They successfully scaled up from 200 human demos to 50k synthetic demos. This research underscores the diminishing need for human-generated data.

Second, there’s the concept of “no shot” or “zero-shot” synthetic data generation, where data can be generated without any initial real-world data. Rendered AI has hinted at its success with this approach on multiple occasions (see here and here).

In the end, if we want powerful AI incorporated into all aspects of our lives, then synthetic data is critical. The quantity and quality of our real world is not enough.

submitted by /u/Dtdavis70
[link] [comments]