Last Sem, I Developed A 3D Shapes Dataset Generator For One Of My CV Projects, As The Shapes3d Dataset Had Only Plenty Of Shapes And No Operations To Train. Today, I Felt That It Might Turn Out To Be Useful To The Community As Well, So I Open-sourced It. Feel Free To Use It For Your DL/CV Projects

submitted by /u/aniketrajnish
[link] [comments]

Textraction.ai Released! AI Text Parsing API

It allows extracting custom user-defined entities from free text. Very exciting!
It can extract exact values (e.g. names, prices, dates), as well as provide ChatGPT-like semantic answers (e.g. text summary).
I like the interactive demo on their website (https://www.textraction.ai/) – it allowed me to try my own texts and entities within minutes. It works great 🙂
The service is accessible also as an API for any purpose via the RapidAPI platform: https://rapidapi.com/textractionai/api/ai-textraction (sign up to RapidAPI and get your own token)

submitted by /u/DoorDesigner7589
[link] [comments]

0

[Self-Promotion] US Loan-Level Mortgage Data From HDMA

Data from HDMA available on Snowflake:

https://app.snowflake.com/marketplace/listing/GZTSZAS2KG8/cybersyn-inc-us-home-mortgage-disclosure-act-loan-apps?originTab=provider&providerName=Cybersyn%2C%20Inc

submitted by /u/aiatco2
[link] [comments]

0

Datalab: Automatically Detect Common Real-World Issues In Your Datasets

Hello Redditors!

I’m excited to share Datalab — a linter for datasets.

I recently published a blog introducing Datalab and an open-source Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I’ve made a quick Jupyter tutorial to run Datalab on your own data.

All of us that have dealt with real-world data know it’s full of various issues like label errors, outliers, (near) duplicates, drift, etc. One line of open-source code datalab.find_issues() automatically detects all of these issues.

In Software 2.0, data is the new code, models are the new compiler, and manually-defined data validation is the new unit test. Datalab combines any ML model with novel data quality algorithms to provide a linter for this Software 2.0 stack that automatically analyzes a dataset for “bugs”. Unlike data validation, which runs checks that you manually define via domain knowledge, Datalab adaptively checks for the issues that most commonly occur in real-world ML datasets without you having to specify their potential form. Whereas traditional dataset checks are based on simple statistics/histograms, Datalab’s checks consider all the pertinent information learned by your trained ML model.

Hope Datalab helps you automatically check your dataset for issues that may negatively impact subsequent modeling — it’s so easy to use you have no excuse not to 😛

Let me know your thoughts!

submitted by /u/jonas__m
[link] [comments]

0

Need A Dataset For Image/Document Authenticity / Forgery Detection

I am onto a AI project on forgery detection in documents/images being submitted in the banking/insurance industry. I am finding it a little difficult to pick or find the right dataset. It would be great if you share any. Thanks!!!

submitted by /u/a-reindeer
[link] [comments]

0

Where Can I Find Datasets Containing Residential Sales Prices In Australian Capitals That I Can Download For Free?

Having a time series since 2019 would be ideal.

submitted by /u/Procrastinator9Mil
[link] [comments]

0

Request For A Dataset Of Fake Reviews

Hey everyone. I was wondering if anyone has a dataset of fake reviews. Please let me know if you find any. I couldn’t get a good one online

submitted by /u/DarthTater_666
[link] [comments]

0

Slct.ai – A Simple To Use, AI Tool To Get Any Data With Just One Url.

My friend and I are working on a small project to make it possible to request any data with one url easily. It’s best used for reference data, test data, education/teaching contexts and training data.

Here is an example of how you can use it to get any data in pandas:

url = “https://slct.ai/us_states_and_populations.csv” df = pd.read_csv(url)

Would love to get the community’s point of view on how they would like to see this evolve and general feedback.

submitted by /u/Upstairs-Security-66
[link] [comments]

0

Dataset Of EEG Recording During A Passive Viewing Video Task

Hi!

I am looking for datasets in which EEG recordings are made while participants are watching videos. I am not interested in specific videos, however I need a dataset in which both EEG recordings and videos employed in the experimental setup are provided. I would like to run some analysis correlating the visual properties of each frame (e.g. brightness) with EEG signals.

I have fond the SEEDs dataset, however the original video are not provided. Does anyone know of any dataset that provides both EEG recordings and videos?

submitted by /u/stephdaedalus
[link] [comments]

0

Market Research For Big Data – All Suggestions Welcome!

Hey Everyone,

This is a bit of an odd request, but I am looking for help from anyone who works with big data, now or in the past.

I am specifically trying to learn about buying and selling of user data – where is it bought, how much does it cost, who sells it, what is the process like, is it just a giant csv file? etc

Any help is very much appreciated!

submitted by /u/Crumbedsausage
[link] [comments]

0

Reliable Datasets For Tourism Arrivels Per Country?

I am looking for statistics about tourist arrivals in different countries. I found WorldData.info, World Bank, and Statista, but I am not sure if these sources are reliable and the numbers are accurate. It seems that the data on these websites is inconsistent because they confuse the terms “tourists” (= people traveling for leisure) and “visitors” (= also including people traveling for business). Can anyone help me find a reliable and extensive dataset for tourism arrivals only?

submitted by /u/Edc312
[link] [comments]

0

I’m Looking For A Dataset Through Which I Can Solve A Real World Problem(Time Series Modelling)

I wnat a dataset on which I can apply ARIMA, Lazy Predict, LSTM, etc and that dataset has somewhat of a real world sense.

submitted by /u/flavomico
[link] [comments]

0

Looking For Website Where “regular” People Upload Pdfs Publicaly Available

Hello 🤗

I want to build a dataset of manipulated documents with the original document and the modified version because I work on a model to localize those forgeries 🧐 The available public datasets that exist are not sufficient but I believe it is possible to build one without resourting to synthetic datasets. In the french gazette website, organizations and funds are required to upload their financial reports every year and they are publicly available. If they make a mistake, the wrong document is left on the website for a while and a rectified document has to be uploaded. Now if the two versions match everywhere pixel to pixel except for a tiny portion, the it has only been modified digitaly and not rescanned. I have been able to find a few pair of documents like that be no nearly enough to train a model. Do you know any websites that work the same way? Where people upload pdfs and these pdfs are sometimes rectified and both versions are still online? Preferably free form pdf and not a specific form like the US gazette.

Thank you for your help!

submitted by /u/VegetableMistake5007
[link] [comments]

0

Inexpensive Demographic Interests/hobbies Dataset?

I’m looking for a data set that links demographic background of a person (e.g. age, gender, education, etc.) to a list of personal interests like hobbies or buying habits (e.g. pets, sports, cars, etc.). The dataset could explain consumer behavior for e.g. marketing analysis or targeted advertising.

Is there such a dataset that is “inexpesive” (e.g. 1000 USD one time purchase) or ideally free?

The ones I found turned out to be very costly yearly subscriptions.

Thanks a lot for any recommendations and insights!

submitted by /u/Immediate-Albatross9
[link] [comments]

0

EU Critical Materials And Metals Datasets

Hi all!

I am not able to find a dataset including eu’s import for critical materials and metals for an “appropriate” period of time, I would say 20-30 years. Do you have any idea?

Thanks a lot!!

submitted by /u/NickDisponibile
[link] [comments]

0

Inmate Population Datasets For California, Colorado, And Texas

Hello, I’m currently working on my dissertation and one of the variables I’m planning on using is inmate population. Does anyone have any links to where I can find them?

Thanks!

submitted by /u/ljr_2k
[link] [comments]

0

Where To Find Census Tract Racial Datasets?

Hey,

I’ve been mapping out Potentially Underserved Communities in North Carolina over the past few weeks, and have a time-series animation from 2010-2023 at the Census Tract Level, but my professors are wanting me to go further back with the data. It seems like the first American Community Survey 5-Year Estimates came out in 2010, so I think I’ll have to use Decennial Census Data, but was having trouble locating anything prior to 2010 on there website. Any tips?

submitted by /u/Riley_L27
[link] [comments]

0

Guysssss Upvote For My New Dataset :)

https://www.kaggle.com/datasets/vinayak121/a-reddit-collectiontitles-from-popular-subreddits

This is my new dataset you can use this dataset for training NLP models

Please Upvote 🙂

submitted by /u/riskyhomo_69
[link] [comments]

0

[self-promotion] All TV Series Details Dataset From TheMovieDB

Hello /r/Datasets,

I present you a dataset including all the details of all the series (+155k) available on The Movie Database. The dataset is available on Kaggle.

Generation

This dataset was generated in ~10 hours by fetching each ID from The Movie Database API (+225k IDs).

You can generate the same dataset using my NodeJS application available in open-source on GitHub.

Missing data

Some data are missing on some series because The Movie Database API does not provide them. It happens on old TV series not very well known.

Including

id name original name overview tagline in production ? status original language origin country created by first air date last air date number of episodes number of seasons production companies poster path genres vote average vote count popularity

I hope I got this post right, I wasn’t sure how to go about it. I also hope this dataset can be useful to you!

submitted by /u/kodle
[link] [comments]

0

Effects Of COVID On Cruiselines And Airlines

Where would I find datasets in regard to this topic? I tried Kaggle and just searched it up. I still can’t find any even though I know it’s out there. I’m new to this lol

submitted by /u/steven-0611
[link] [comments]

0

New Destinations For Mockingbird – FOSS Mock Data Stream Generator

When we launched Mockingbird a few weeks ago, the idea was to make it super simple to generate mock data from a schema that you could stream to any destination. When we launched it, you could send mock data streams to Tinybird and Upstash Kafka.

Now, we’ve added support for Ably, AWS SNS, and Confluent.

You can check out the UI here: https://tbrd.co/mock-rd and it’s also available as a CLI with npm install @tinybirdco/mockingbird-cli

Hope this helps when you can’t find the dataset you need!

submitted by /u/tinybirdco
[link] [comments]

0

All Men Under 55 Who Died On June 19, 2022 In VA

Hello, as the title states, I am looking for all men under 55 in the state of Va, USA who died on that date.

I’ve tried VA newspapers but most are not online.

Any help would be appreciated.

submitted by /u/DerpSherpa
[link] [comments]

0

How To Store 175 Million Rows And Query Them

Hey! I have many Json files that equate to 175 million pieces of data, I’m unsure how to store them in a database, I’ll need to first create one big json file or loop through each or the files and move the data from them into a database.

I’ll need to do querys against the whole dataset multiples times a day so the quicker the better.

I’ve already experimented with mongodb but I just can’t see past the way querys are written.

Any ideas?

submitted by /u/ScottishVigilante
[link] [comments]

0

Looking For Dataset For LLM Tokenization: Need Around 1GB Multi-lingual + Code

I’ve been working on a tokenizer that determines the best possible tokens to represent the test dataset in the least number of tokens for various different vocabulary sizes.

It works well but I’ve been testing with The Pile test data, but it’s mostly English so it’s a not good representation for multi-lingual. It also lacks a fair amount of code and tags.

I need around 1-2GB raw text uncleaned and uncensored, that represents a few different languages and a fair amount of code from different programming languages. Better to be raw, and include data both with HTML tags as it would be when scraped, and also without HTML tags (as it would prioritize the HTML tags too heavily if they were always present).

So just a good representation of general text.

I know I could build my own dataset from various different ones, but it seems to me that a dataset like this should already exist. Any leads would be helpful. Thank you.

submitted by /u/Pan000
[link] [comments]

0

Looking For US Data Set Of Corporate Employee Titles And Job Level

Hi. I’m looking for a data set of employee titles at US companies. Not specific companies and not specific people just generalities. Currently have a record set of a million plus job titles of all varieties from cashier to CEO… However looking to pair these up with job level. Wondering if anything like this exists anywhere. I’ve tried the SAMs database but that’s even messier than the records that I have.

Thanks

submitted by /u/cdtoad
[link] [comments]

0

School Closures Caused By The COVID-19 Pandemic

submitted by /u/cavedave
[link] [comments]

0

[self-promotion] Hosted Embedding Marketplace – Stop Scraping Every New Data Source, Load It As Embeddings On The Fly For Your Large Language Models

We are building a hosted embedding marketplace for builders to augment their leaner open-source LLMs with relevant context. This lets you avoid all the infra for finding, cleaning, and indexing public and third-party datasets, while maintaining the accuracy that comes with larger LLMs.

Will be opening up early access soon, if you have any questions be sure to reach out and ask!

Learn more here

submitted by /u/achyutjoshi
[link] [comments]

0

Anyone Have Access To Statista And Help Out A Poor College Student ? Their Yearly Rate Is Egregious And Need Data For Research Thesis I’m Working On

Need help

submitted by /u/Minute_Marionberry98
[link] [comments]

0

[D] Seeking Guidance On Accessing FMRI Datasets Related To Schizophrenia For AI Development

submitted by /u/cavedave
[link] [comments]

0

Data On Humanitarian Aid In Form Of Aid-Data And ODA-Data

Hey guys, does someone maybe know how I can get a global ODA or Aid-Data dataset. I am curently working on my Bachelors thesis and therefore I would need these Datasets in a form that I could use with Stata.

I tried downloading the ODA-Data and import it to stata but for some reason I didn’t have any obervations. I would need Data which would contain information on the amount of aid delivered to countries in a given year for an analysis of the impact of humanitarian aid on conflict dynamics.

If someone has a tip or could help me, that would be realy nice.

submitted by /u/HighVoltageplay
[link] [comments]

0

Category: Datatards

Last Sem, I Developed A 3D Shapes Dataset Generator For One Of My CV Projects, As The Shapes3d Dataset Had Only Plenty Of Shapes And No Operations To Train. Today, I Felt That It Might Turn Out To Be Useful To The Community As Well, So I Open-sourced It. Feel Free To Use It For Your DL/CV Projects

Textraction.ai Released! AI Text Parsing API

[Self-Promotion] US Loan-Level Mortgage Data From HDMA

Datalab: Automatically Detect Common Real-World Issues In Your Datasets

Need A Dataset For Image/Document Authenticity / Forgery Detection

Where Can I Find Datasets Containing Residential Sales Prices In Australian Capitals That I Can Download For Free?

Request For A Dataset Of Fake Reviews

Slct.ai – A Simple To Use, AI Tool To Get Any Data With Just One Url.

Dataset Of EEG Recording During A Passive Viewing Video Task

Market Research For Big Data – All Suggestions Welcome!

Reliable Datasets For Tourism Arrivels Per Country?

I’m Looking For A Dataset Through Which I Can Solve A Real World Problem(Time Series Modelling)

Looking For Website Where “regular” People Upload Pdfs Publicaly Available

Inexpensive Demographic Interests/hobbies Dataset?

EU Critical Materials And Metals Datasets

Inmate Population Datasets For California, Colorado, And Texas

Where To Find Census Tract Racial Datasets?

Guysssss Upvote For My New Dataset :)

[self-promotion] All TV Series Details Dataset From TheMovieDB

Generation

Missing data

Including

Effects Of COVID On Cruiselines And Airlines

New Destinations For Mockingbird – FOSS Mock Data Stream Generator

All Men Under 55 Who Died On June 19, 2022 In VA

How To Store 175 Million Rows And Query Them

Looking For Dataset For LLM Tokenization: Need Around 1GB Multi-lingual + Code

Looking For US Data Set Of Corporate Employee Titles And Job Level

School Closures Caused By The COVID-19 Pandemic

[self-promotion] Hosted Embedding Marketplace – Stop Scraping Every New Data Source, Load It As Embeddings On The Fly For Your Large Language Models

Anyone Have Access To Statista And Help Out A Poor College Student ? Their Yearly Rate Is Egregious And Need Data For Research Thesis I’m Working On

[D] Seeking Guidance On Accessing FMRI Datasets Related To Schizophrenia For AI Development

Data On Humanitarian Aid In Form Of Aid-Data And ODA-Data

Recent Posts

Recent Comments

18+ Content

Generation

Missing data

Including

Recent Posts

Recent Comments