Global Advertising Data From 2013-2023

Anyone have any data sets showing global advertising data from the last decade. Thank you.

submitted by /u/Otherwise_Alarm9482
[link] [comments]

US Voting History By Census-Designated Place Or County

I’d like to see 2020 presidential election results by census-designated “place” preferably, but county-level would work too. Anyone know of a good dataset online?

NYT has a good map/dataset but even they’re missing data in certain states.

submitted by /u/DataScience0
[link] [comments]

0

Where Can I Find A Large Dataset Of Resumes In English For My Project?

I’m working on a machine learning project and need a substantial dataset of resumes in English. Any recommendations on where I can find one? Appreciate your insights!

submitted by /u/Ok-Assistance3681
[link] [comments]

0

Disappearing Data Moats – Thanks To Synthetic Data

Hey crew – I wanted to share a blog post on the emergence of synthetic data and the implications on many companies’ data moats.

See the full post below and TLDR. For a better reading experience with graphics, images, links, etc. I recommend reading here.

If you’re interested in this content, feel free to subscribe to my weekly newsletter covering similar topics.

TL;DR: The blog post argues that synthetic data is overtaking human-generated data in AI, challenging the traditional value of data moats held by companies. With advancements in Generative AI, synthetic data is becoming more valuable due to its cleanliness and privacy benefits. By 2030, it’s expected to be the primary data source in AI, reducing the reliance on and value of human-generated data. Techniques like GANs and VAEs facilitate this shift, with the market and applications for synthetic data rapidly expanding.

—

FULL POST – Disappearing data moats

—

Are companies overvaluing their data moats in our new world of AI? A deeper exploration into synthetic data suggests this may be the case…

Many companies believe they have strategic moats made from consumer data. Data they’ve spent years aggregating, which now appears more valuable in this AI-centric world. The phrase “data is the new oil” has been used to describe data as an asset that differentiates the haves from the have-nots.

In addition to the perceived value of data, we’re seeing significant investment in Generative AI (GenAI) applications. To me, there was an obvious area for value to be extracted from the market – the infrastructure (NVIDIA) and data (Meta, Google, Amazon). Chamath Palihapitiya reaffirmed my conviction multiple times in different venues over the last year.

However, through researching data in generative AI, I discovered an under-discussed trend – synthetic data. This led me to realize that data is NOT the new oil and that these strategic data moats are shrinking. Human-generated data will likely become less valuable, shifting value towards synthetic data.

I’m not alone in this perspective. Gartner predicts that by 2024, 60% of the data used in AI will be synthetically generated. They also estimate that by 2030, synthetic data will overshadow real data in AI models, meaning nearly all data used to train AI by 2030 will likely be synthetic.

But that’s not all, let’s see what Sam Altman has to say…

“In May of 2023, Sam was asked whether he was worried about regulatory probes into ChatGPT’s potential privacy violations. Sam brushed it off, saying he was “pretty confident that soon all data will be synthetic data”.”

In a future where synthetic data is preferred over human-generated data, several key changes will emerge:

– Value Shift: Businesses focused on creating high-quality synthetic data will draw significant value, overshadowing firms that rely on decades of human-generated data.
– Enhanced Competition: Startups will find it easier to challenge established companies, especially in niche sectors like biology, due to lower market entry barriers.
– Privacy Solutions: Synthetic data offers a workaround for the privacy issues plaguing sensitive datasets, like financial and medical records, enabling AI training without risking personal data exposure.
– Reduced Human Data Dependence: The constant need to harvest human-generated data will decrease over time, along with the value from that data.
– … The list goes on…

Synthetic data is poised to surpass human-generated data in both volume and quality, challenging the notion that real data is always superior. Real-world data is often problematic – it’s messy, biased, and fraught with privacy issues. Synthetic data, on the other hand, is cleaner and more controlled.

Now, you may wonder how this is all possible.

Synthetic data generation – 101

Synthetic data generation, with a history spanning decades, primarily found its application in simulations. A notable example is Tesla’s extensive simulation of outlier scenarios for their self-driving vehicles.

The methods for generating synthetic data vary, each suited to particular uses and having its own set of trade-offs. Some of these methods include:

– Generative Adversarial Networks (GANs): Imagine two AI programs in a contest: one creates fake images, and the other tries to detect which are fake. Over time, the creator gets good at making realistic images.
– Variational Autoencoders (VAEs): These are a bit like GANs, but less complex. They learn the pattern of existing data and then use this understanding to make new, similar data. They’re known for being more stable and easier to handle than GANs.
– Simulation-Based Methods: Here, you create a virtual model of a real-world situation (like a weather system). By running simulations in this model, you can generate data that predicts how things might happen in real life. Like Deepmind’s recent breakthrough in weather predictions via Graphcast.
– Agent-Based Modeling: Imagine a video game where each character has rules and behaviors. By watching how these characters interact, you can gather data about similar situations in the real world.

Today’s primary methods for generating synthetic data start with “seed data,” which is originally human-generated. This seed data serves as a base to ensure the synthetic version remains statistically similar to the original.

Experts in synthetic data generation focus on three key quality metrics: fidelity, diversity, and utility.

– Fidelity: This is about how closely the synthetic data matches real data, it should look and act very similar to the real thing.
– Diversity: This ensures the synthetic data includes a wide range of scenarios, not just repeating the most common situations, but incorporating the outliers too.
– Utility: This is about how useful the synthetic data is based on its original purpose. The data should be good enough to help build and test systems effectively, just like real data.

For more sensitive datasets, balancing fidelity and privacy becomes crucial. Our goal should be to maximize fidelity while preserving privacy.

One method to protect individual privacy in real-world datasets used for synthetic data is differential privacy. Differential privacy adds a small amount of random noise to the data, making it hard to identify any one person’s information, while still maintaining the overall usefulness of the data. A real-world use case we interact with daily would be auto-complete for words and emojis within both Apple and Google devices. For optimal results, this method should mainly be used on massive datasets.

The state of synthetic data

The current landscape for synthetic data is interesting. There’s a growing market demand for synthetic data, with startups and incumbents aiming to fill that need.

This market can be broadly classified into two categories: structured and unstructured synthetic data generators.
– Structured: Data that can be systematically arranged in tables, such as those found in databases or spreadsheets. Examples include financial transactions and patient records. Its structured nature allows for more straightforward handling and organization.
– Unstructured: Data that doesn’t conform to a standard structure and is inherently more complex. It includes diverse forms such as text, images, videos, and audio. These are vital for applications in areas like speech recognition, autonomous driving, and robotics.

I predict that the real value capture and competition will center around unstructured data. This prediction is based on the use cases derived from unstructured data, most of which will focus on training AI models.

Advancements and challenges

Now that we understand the market structure, let’s explore recent advancements in training AI using synthetic data and the associated challenges.

The adoption of synthetic data is rapidly growing in the field of generative AI, primarily through a concept called “bootstrapping.” Bootstrapping involves training one model using data from another model. A typical example is using GPT-4 to train GPT-3.5 for specific tasks, like LLM evaluation.

“Recent research has shown that training small, efficient language models (SLMs) on high-quality, diverse data can achieve state-of-the-art results- even rivaling or surpassing LLMs 5x the size such as Llama2-7b and Falcon-7b on common tasks, as demonstrated by models like Microsoft’s “phi-1.5” (from their paper “Textbooks Are All You Need”), Orca2, and IBM’s Granite. “

These small language models (SLMs) are paving the way for generating high-quality models using synthetic data, and this approach has the potential to scale to much larger models. Recent successes include Microsoft’s Phi-2 and Google’s ReST^EM.

Success in this field also brings its share of challenges, particularly within the realm of synthetic data. One crucial aspect is ensuring that synthetic data faithfully replicates real-world conditions. Failure to capture these complexities can lead to poor model performance in practical scenarios, which becomes challenging for complex data, like images.

Another significant concern voiced by skeptics of synthetic data is what’s known as “mode collapse.” This issue frequently arises when employing the GAN method mentioned earlier. Mode collapse occurs when an AI, originally designed to generate a wide range of diverse items, ends up repetitively producing a limited set of items instead of maintaining diversity. It’s like a chef who only cooks a handful of dishes despite knowing a vast cookbook, thus earning the term “mode collapse” as the AI converges into a single mode.

Luckily, there are a variety of ways to fix most challenges with synthetic data. The key lies in ensuring the diversity of your data and continually validating it. Additionally, incorporating updated original data into the data generator on an ongoing basis helps maintain the high fidelity of synthetic data.

Throughout this post, I’ve argued that strategic data moats are diminishing while acknowledging the ongoing importance of human-generated “seed data.” I understand this might seem contradictory. Currently, human-generated data plays a crucial role in training AI models, but its significance is expected to diminish over time. Let me provide you with two recent research findings that further support this trend, in case you aren’t already convinced.

First, there’s MimicGen, which has demonstrated its ability to create high-quality synthetic data from small samples of real-world data. They successfully scaled up from 200 human demos to 50k synthetic demos. This research underscores the diminishing need for human-generated data.

Second, there’s the concept of “no shot” or “zero-shot” synthetic data generation, where data can be generated without any initial real-world data. Rendered AI has hinted at its success with this approach on multiple occasions (see here and here).

In the end, if we want powerful AI incorporated into all aspects of our lives, then synthetic data is critical. The quantity and quality of our real world is not enough.

submitted by /u/Dtdavis70
[link] [comments]

0

Question About Complex Sampling Designs

Hello all. I am working with a large CDC survey combining multiple years of data and it is required that I use complex sampling procedures to analyze the data. Since this is a national survey and I’m analyzing multiple years combined, the sample size is quite large when raw and even larger when weighted (obviously!). I’m worried about being overpowered when I apply weights, however weighting it is required per CDC for more accurate interpretation of the findings and complex sampling procedures in SPSS require weighting to be input into the plan file. My question after all of this is 1) if anyone has general advice on what I described and 2) if weighting is always required when I am analyzing data that uses complex sampling designs? Thank you!!!

submitted by /u/PharmaNerd1921
[link] [comments]

0

Great Databases In Bulk Hope You Enjoy It

princeton_university_datasets.zip

crunchbase_organizations.zip

crunchbase_people.zip

All HERE 👉 Download Page

submitted by /u/DataExpx
[link] [comments]

0

Looking For Data On Hospital Equipment Usage

Did a quick search on this sub and I can see that hospital data is frequently requested and can be tricky to access. But as I understand this is mostly the case with patient information and the like. I’m looking for things like Operating room, radiology, x-ray usage rates.

I’ve looked around without much success so any help would be great. Thank you.

submitted by /u/TheShrlmp
[link] [comments]

0

Looking For A Dataset Breaking Down The Details For The Happiest People In The World

The World Happiness Report 2017‘s Figure 2.1 charts population-weighted distributions of happiness for various world regions, where ‘happiness’ is self-reported happiness on a 0-10 scale.

In every world region there are respondents who report 10/10; I’ve always been interested in these people (who are they? What can I learn about them regarding the other factors examined in the World Happiness Reports, from GDP per capita to healthy life expectancy to social support to generosity to perceptions of corruption to freedom to make life choices? What can I calculate? etc)

Unfortunately the accompanying dataset linked for Figure 2.1 doesn’t break down the data more granularly, it only reports the summarized chart values. Do any of you know of a more granular breakdown? The precise year (2017) and charts (Figure 2.1) don’t really matter to me; I really just want to see the data for the other factors corresponding to these self-reported happiness = 10/10 people. Thanks 🙂

submitted by /u/MoNastri
[link] [comments]

0

Building Your Sausage Machine For Data Products 🌭: Less Tech, More Strategy

submitted by /u/growth_man
[link] [comments]

0

Looking For A Free Use Disease Data Set With Medical And Lifestyle Features

Hi all,

I am looking for a free use dataset with an outcome variable of e.g has heart desiease, diabetes, stroke etc. I would like the dataset to have as many features as possible, including medical results, such as blood preasuere (things only a docotor could measure) as well as lifestyle factor features, like exercise, smoking etc. (things anyone could measure). Unfortunatly most datasets only seem to have medical, or life style and not both. I would hope to have around 10+ medical and 20+ lifestyle. Anyone know of any datasets?

many thanks,

submitted by /u/Josh_Bonham
[link] [comments]

0

Seeking Guidance On Extracting And Analyzing Subreddit/Post Comments Using ChatGPT-4?

Hello! While I have basic programming knowledge and a fair understanding of how it works, I wouldn’t call myself an expert. However, I am quite tech-savvy.

For research, I’m interested in downloading all the comments from a specific Subreddit or Post and then analyzing them using ChatGPT-4. I realize that there are likely some challenges in both collecting and storing the comments, as well as limitations in ChatGPT-4’s ability to analyze large datasets.

If someone could guide me through the process of achieving this, I would be extremely grateful. I am even willing to offer payment via PayPal for the assistance. Thank you!

submitted by /u/JackJackCreates
[link] [comments]

0

IAM Handwriting Database Internal Server Error

The Iam Handwriting Database demanding registeration for the download, when i try to register to the site it gives İnternal Server error. Are there anyone facing the same issue or is it my internet connection ?

submitted by /u/IdealRich2750
[link] [comments]

0

Nutritionist Recommended Food (fruit/vegetables) Dataset

Hi All, I am looking for dietician recommended food dataset for thyroid patients. Any idea on where I can find this dataset?

submitted by /u/Surek95
[link] [comments]

0

Require A Dataset For My Current Project

I am currently working on an agriculture based chatbot. So could some one of you please provide good sources of datasets about crops, climatic conditions for crops, plant diseases and preferred cure, land based crop cultivations etc.

submitted by /u/Sreehari_J_Nair
[link] [comments]

0

Anyone Have A Manual To Use MarketScan On Redivis?

There are almost no helpful videos/ guides that help me code in MarketScan, and I am getting extremely frustrated. Can someone please point me to a good resource?

submitted by /u/phymathnerd
[link] [comments]

0

I Have An Issue With Importing A Dataset From Kaggle. I Am A Novice And Want Tips To Learn ML Through AWS Sagemaker.

I am a novice on ML and want tips to where should I upload an image dataset. There some datasets of medical images named as ODIR-5K on Kaggle, and I can’t use Kaggle API to work with AWS Sagemaker notebooks. I tried on Google Collaboratory but they just work fine there instead, but for the sake of my own wallet, I prefer to use Sagemaker on a free tier. Is there any way to import a dataset from Kaggle without issues on a Jupyter Notebook / AWS SageMaker Notebook? Or is it best to change the place I store this dataset?

submitted by /u/MemeH4rd
[link] [comments]

0

What Is The Difference Between Apache Airflow And Apache NiFi

Are you confused between Apache Airflow and Apache NiFi? 🤔 Both are popular open-source data integration tools, but they serve different purposes. 🤷‍♂️
✅ Apache Airflow: is a platform for programmatically defining, scheduling, and monitoring workflows. It’s great for data engineering tasks, like ETL, data warehousing, and data processing. 📊
✅ Apache NiFi: is a data integration tool for real-time data processing and event-driven architecture. It’s designed for stream processing, data routing, and data transformation. 🌊
If you want to learn more about the differences between Apache Airflow and Apache NiFi, check out this article. 📄
In this article, you’ll get a detailed comparison of the two tools, including their features, use cases, and architecture. 🏗️
https://devblogit.com/what-is-the-difference-between-a-data-lake-and-a-delta-lake/
#ApacheAirflow #ApacheNiFi #DataIntegration #DataEngineering #ETL #DataWarehousing #DataProcessing #StreamProcessing #EventDrivenArchitecture #DataScience #DataEngineer #ITPro

submitted by /u/Bubbly_Bed_4478
[link] [comments]

0

IMDB Dataset – How Do I Get Film Posters?

I’m developing a film recommendation system using the IMDB datasets, using around 350,000 films after pre-processing. Does IMDB offer a way to access the relevant film poster for each items in its dataset, or does anyone know a different source or method to import these?

Any help would be appreciated

submitted by /u/wobowizard
[link] [comments]

0

Need Help With Physionet Databases…

Hey there!
I am a freshman currently working on an independent project that requires data from MIMIC III however I do not have physionet credentials and I literally have no one who can refer me in. Is there any other way to get access to the database? If you could refer me, I can provide you with a brief description of what I am building.

submitted by /u/Global_Landscape1119
[link] [comments]

0

Lung Cancer Dataset For AI Detection

I need to do a CNN that can detect cancer on x-ray images. Do you know any available (image or data) dataset on the internet that can help me ?

submitted by /u/Nearby_Plant_7949
[link] [comments]

0

Need Data For Conversation Between Agent And Customer

I need this data in context of late credit card payments. If you know any data source for other context then do mention that as well. The idea is to fine tune an LLM to assist the agent in future

submitted by /u/Evermore2307
[link] [comments]

0

I Want The Datasets Used In Tableau E-learning Platform. Where Can I Find Them?

I am looking specifically for the Active duty marital status navy dataset for now, but will need the other datasets for tableau Prep as well as Tableau later. Where can i find the dataset?

The link in data.gov is not working. Please help me find the dataset or a similar dataset to practice.

submitted by /u/thecrazytughlaq
[link] [comments]

0

Azure Synapse Analytics: A Step-by-Step Guide

submitted by /u/Bubbly_Bed_4478
[link] [comments]

0

Help With MIMIC 4 Extraction. I’m A Dummy

Hey. I need some help extracting data from MIMIC 4. I am bad at SQL and Google Big Query confuses me. If anyone has experience with this I would be happy to pay you for your help. Please PM me.

submitted by /u/sterigenic
[link] [comments]

0

How To Get A GDP Breakdown For Sub-industries?

Hi guys,

I need for a project to get the data of the GDP of countries by sub-industries and the best would be to have it breakdown using the Global Industry Classification Standard (or an other advanced standard that shows sub-industries).

I wasn’t able to found data that was that much precise (most GDP by sectors or some big sectors by not going into industries & sub). So maybe the data needed is on a special website that I don’t know or is hardly accessible by a simple Google search.

Thanks for any response / upvote / help.

submitted by /u/Haunting_Taste6349
[link] [comments]

0

Looking For US & UK Bill Of Lading Shippment Data

For my project i am looking for US & UK Bill of lading shippment dataset, i tried in data.gov search data set i couldn’t find out and i did same for uk search in uktrade.info but no luck. Prefering to download directly from their gov public webiste. Appreciate if you anybody advise exacts url’s please

submitted by /u/efor007
[link] [comments]

0

Data Set On AI Adoption In Consultants

Looking for a data set on AI adoption and usage by consultants by work / job task type and by gender for a research project

submitted by /u/Aggravating-Grape-22
[link] [comments]

0

How The List Of Most Polluted Cities Changed In Last 50 Years. There Seems To Be Lack Of Data/plots On This Question.

submitted by /u/albatgalbat
[link] [comments]

0

Dataset For Customer Churn Due To Bad Customer Service/support In Ecommerce Or Retail

I am doing a research on AI/ML on ecommerce industry. I would appreciate some help to find a dataset to prove that there is customer churn due to bad customer service and support in the commerce and retail industry.

submitted by /u/azl023a
[link] [comments]

0

Is There A Longitudinal Dataset On US Newspaper Ownership Such That I Can Track Changes In The Ownership Of Any Given US Newspaper/daily Over A Period Of Time?

I want to look at how change in ownership affects the type of information conveyed by a newspaper, especially in cases where the acquirer may have a vested commercial motive. For example, there has been a significant uptick in the number of US newspapers acquired by private equity players. I’d like to see if such acquisitions affect the choice and delivery of content that may have direct commercial implications for the private equity owner.

submitted by /u/Charming-Incident600
[link] [comments]

0

Category: Datatards