Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Datasets Suggestions For These Requirments

Hey guys. I am currently starting to work on my universtiy project for Fundamentals of Artificial Intelligence class. I would really appreciate if you could suggest me the datasets according to these requirments :

“Select a dataset that is suitable for a classification task. The student must avoid selecting the Iris dataset or the

Palmer Archipelago (Antarctica) penguin dataset. In addition, the meaningfulness of the classification has to be

considered, e.g. it is meaningless to classify continents by the number of Covid-19 cases because, first, there are

only six continents and new ones will not appear soon, second, the number of Covid-19 cases is not a

defining characteristic of continents;

• it is preferable to select a dataset that is already given in the format of a .csv datafile;

• the dataset should be well-documented (there should be information about who created the set, when and what

the data source is);

• the dataset should be of reasonable size (at least 200 data objects);

• the dataset should be deeply annotated (there should be information about which features are stored and what

they mean);

• the number of features should be between 5-15;

• the dataset should be labelled;

• the student must avoid datasets with many Boolean (true/false, 1/0, etc.) or categorical type feature (attribute)

values. It is preferable to use datasets in which most of the attributes are represented by continuous attribute

values;

• you should avoid datasets of unlabelled data (e.g. text corpora and raw images)”

submitted by /u/kktsrvii
[link] [comments]

Looking For Java Exception/Error Datasets And Solutions

Hey fellow developers!

I hope you’re all doing well. I’m currently working on a project that involves analyzing Java exceptions and errors. To enhance the accuracy of my analysis, I’m in need of a comprehensive dataset that includes various Java exceptions, errors, and their corresponding solutions. I believe having such a dataset would greatly benefit the development community as a whole.

Therefore, I’m reaching out to you all to see if anyone knows of any existing datasets or resources that provide information about Java exceptions and errors. Specifically, I’m looking for a dataset that encompasses a wide range of exceptions, covering different classes, such as NullPointerException, ArrayIndexOutOfBoundsException, and IllegalArgumentException, among others.

Ideally, the dataset would include:

Exception/Error name

Description and context of the exception/error

Stack trace (if available)

Common causes/triggers of the exception/error

Recommended solutions and best practices to handle or avoid the exception/error

I understand that documenting every exception and error might be an enormous task, but even a partial dataset or relevant resources would be highly appreciated. I’m willing to put in the effort to curate and organize the information into a cohesive format, making it accessible to the community.

Additionally, if you have any personal experiences or insights related to specific Java exceptions or errors, feel free to share them! Practical examples and real-life scenarios are often invaluable for understanding and addressing these issues effectively.

Thank you in advance for your time and assistance. Your contribution will not only aid my project but will also assist numerous developers who encounter similar challenges in their Java projects. Let’s collaborate and make Java development more seamless for everyone!

Looking forward to your suggestions, datasets, and insights.

Happy coding!

submitted by /u/Farjou69
[link] [comments]

Looking For A Dataset Of Letters. Any Ideas?

I’m doing a project for a website where I analyze the similarity in writing style and content of letters of different users and try to match them to another user with the highest similarity. I need a dataset of letters/emails/long text messages for that and that’s what I’m looking for. I’ve found the subreddits r/letters and r/loveletters but they haven’t been too satisfactory in terms of the quality of texts. I’ve thought about making dataset with sample letter texts from English exams but since there is no one authentic human writer behind it, it’s not the best source either. Historic archives exist but since my focus is on modern casual letter/email writing, I’ve decided to pass on that. If there was a blog, for example, where someone publicly wrote letters to someone, that would be great but I am unable to find any. Any help would be must appreciated!

submitted by /u/cakeandflowers2202
[link] [comments]

You Haven’t Killed Anyone Driving, Have You? Of Course Not!

You might never have been in an accident and certainly not one where three people were sent to the hospital. Or morgue. I mean, that option was put on the table, too.

And you might not be that bad of a driver — no what the others say about you.

I’m in your corner here. I want you to know that. And help you, my friend, here are 10 years of [Denver Traffic Accident data](https://www.kaggle.com/datasets/hrokrin/denver-traffic-accidents).

Now, you might be thinking: “How is this going to help me?” A valid question.

Cherry-picking is always a good option but let’s not forget both obfuscation and actual analysis. Three solid options right there and let’s be honest, already this has been worth your time.

Think of how good you’re going to look when you can *conclusively* (or not) show how accidents due to cell phone usage have been trending so that fender bender is not *technically* your fault.

The [attached notebook](https://www.kaggle.com/code/hrokrin/denver-traffic-accidents-eda) is there … just waiting for you. Your improvements; your questions. Just waiting.

What’s the best place to hit a pedestrian in a car? Just waiting. Which precinct does the worst job with its paperwork? Just waiting. What’s the best neighborhood to take a bike ride in case you don’t want to get hit? JJust waiting. Is there a correlation between road conditions and accidents? Denver has great snow clearing right? Right? Just waiting.

Oh, and there’s a heat map.

This isn’t some picked-over dataset about people on a boat. Who cares? They’re dead already! Not that many in this dataset are.

Ok, so in all seriousness, I’d love feedback. And for you to take a spin the two for a spin.

submitted by /u/hrokrin
[link] [comments]

ECG Data Using Apple Watch And HealthKit Api In CSV Format

Hi, Fellows I need ECG data from apple watch in .csv format for a project wich is due in a week. I need only 10 sample to prove what I am doing. Unfortunately, I live in a region where apple’s to collect and export ECG data in .csv format is not available. I need your help to get the 10 ECG samples taken at rest from 10 different people using apple watch and apple official app in .csv format. Can anyone here help me get the samples?

submitted by /u/u109e114
[link] [comments]

Textraction.ai Released! AI Text Parsing API

It allows extracting custom user-defined entities from free text. Very exciting!
It can extract exact values (e.g. names, prices, dates), as well as provide ChatGPT-like semantic answers (e.g. text summary).
I like the interactive demo on their website (https://www.textraction.ai/) – it allowed me to try my own texts and entities within minutes. It works great 🙂
The service is accessible also as an API for any purpose via the RapidAPI platform: https://rapidapi.com/textractionai/api/ai-textraction (sign up to RapidAPI and get your own token)

submitted by /u/DoorDesigner7589
[link] [comments]

Datalab: Automatically Detect Common Real-World Issues In Your Datasets

Hello Redditors!

I’m excited to share Datalab — a linter for datasets.

I recently published a blog introducing Datalab and an open-source Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I’ve made a quick Jupyter tutorial to run Datalab on your own data.

All of us that have dealt with real-world data know it’s full of various issues like label errors, outliers, (near) duplicates, drift, etc. One line of open-source code datalab.find_issues() automatically detects all of these issues.

In Software 2.0, data is the new code, models are the new compiler, and manually-defined data validation is the new unit test. Datalab combines any ML model with novel data quality algorithms to provide a linter for this Software 2.0 stack that automatically analyzes a dataset for “bugs”. Unlike data validation, which runs checks that you manually define via domain knowledge, Datalab adaptively checks for the issues that most commonly occur in real-world ML datasets without you having to specify their potential form. Whereas traditional dataset checks are based on simple statistics/histograms, Datalab’s checks consider all the pertinent information learned by your trained ML model.

Hope Datalab helps you automatically check your dataset for issues that may negatively impact subsequent modeling — it’s so easy to use you have no excuse not to 😛

Let me know your thoughts!

submitted by /u/jonas__m
[link] [comments]

Slct.ai – A Simple To Use, AI Tool To Get Any Data With Just One Url.

My friend and I are working on a small project to make it possible to request any data with one url easily. It’s best used for reference data, test data, education/teaching contexts and training data.

Here is an example of how you can use it to get any data in pandas:

url = “https://slct.ai/us_states_and_populations.csv” df = pd.read_csv(url)

Would love to get the community’s point of view on how they would like to see this evolve and general feedback.

submitted by /u/Upstairs-Security-66
[link] [comments]

Dataset Of EEG Recording During A Passive Viewing Video Task

Hi!

I am looking for datasets in which EEG recordings are made while participants are watching videos. I am not interested in specific videos, however I need a dataset in which both EEG recordings and videos employed in the experimental setup are provided. I would like to run some analysis correlating the visual properties of each frame (e.g. brightness) with EEG signals.

I have fond the SEEDs dataset, however the original video are not provided. Does anyone know of any dataset that provides both EEG recordings and videos?

submitted by /u/stephdaedalus
[link] [comments]

Market Research For Big Data – All Suggestions Welcome!

Hey Everyone,

This is a bit of an odd request, but I am looking for help from anyone who works with big data, now or in the past.

I am specifically trying to learn about buying and selling of user data – where is it bought, how much does it cost, who sells it, what is the process like, is it just a giant csv file? etc

Any help is very much appreciated!

submitted by /u/Crumbedsausage
[link] [comments]

Reliable Datasets For Tourism Arrivels Per Country?

I am looking for statistics about tourist arrivals in different countries. I found WorldData.info, World Bank, and Statista, but I am not sure if these sources are reliable and the numbers are accurate. It seems that the data on these websites is inconsistent because they confuse the terms “tourists” (= people traveling for leisure) and “visitors” (= also including people traveling for business). Can anyone help me find a reliable and extensive dataset for tourism arrivals only?

submitted by /u/Edc312
[link] [comments]

Looking For Website Where “regular” People Upload Pdfs Publicaly Available

Hello 🤗

I want to build a dataset of manipulated documents with the original document and the modified version because I work on a model to localize those forgeries 🧐 The available public datasets that exist are not sufficient but I believe it is possible to build one without resourting to synthetic datasets. In the french gazette website, organizations and funds are required to upload their financial reports every year and they are publicly available. If they make a mistake, the wrong document is left on the website for a while and a rectified document has to be uploaded. Now if the two versions match everywhere pixel to pixel except for a tiny portion, the it has only been modified digitaly and not rescanned. I have been able to find a few pair of documents like that be no nearly enough to train a model. Do you know any websites that work the same way? Where people upload pdfs and these pdfs are sometimes rectified and both versions are still online? Preferably free form pdf and not a specific form like the US gazette.

Thank you for your help!

submitted by /u/VegetableMistake5007
[link] [comments]

Inexpensive Demographic Interests/hobbies Dataset?

I’m looking for a data set that links demographic background of a person (e.g. age, gender, education, etc.) to a list of personal interests like hobbies or buying habits (e.g. pets, sports, cars, etc.). The dataset could explain consumer behavior for e.g. marketing analysis or targeted advertising.

Is there such a dataset that is “inexpesive” (e.g. 1000 USD one time purchase) or ideally free?

The ones I found turned out to be very costly yearly subscriptions.

Thanks a lot for any recommendations and insights!

submitted by /u/Immediate-Albatross9
[link] [comments]

Where To Find Census Tract Racial Datasets?

Hey,

I’ve been mapping out Potentially Underserved Communities in North Carolina over the past few weeks, and have a time-series animation from 2010-2023 at the Census Tract Level, but my professors are wanting me to go further back with the data. It seems like the first American Community Survey 5-Year Estimates came out in 2010, so I think I’ll have to use Decennial Census Data, but was having trouble locating anything prior to 2010 on there website. Any tips?

submitted by /u/Riley_L27
[link] [comments]

[self-promotion] All TV Series Details Dataset From TheMovieDB

Hello /r/Datasets,

I present you a dataset including all the details of all the series (+155k) available on The Movie Database. The dataset is available on Kaggle.

Generation

This dataset was generated in ~10 hours by fetching each ID from The Movie Database API (+225k IDs).

You can generate the same dataset using my NodeJS application available in open-source on GitHub.

Missing data

Some data are missing on some series because The Movie Database API does not provide them. It happens on old TV series not very well known.

Including

id name original name overview tagline in production ? status original language origin country created by first air date last air date number of episodes number of seasons production companies poster path genres vote average vote count popularity

I hope I got this post right, I wasn’t sure how to go about it. I also hope this dataset can be useful to you!

submitted by /u/kodle
[link] [comments]