Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Offering To Scrape Datasets For Free

Hey all, if anyone wants a dataset that would obtainable through web scraping, send me a request through a comment and I’ll scrape the data for you. Obviously I can’t just scrape anything but I have quite a bit of scraping experience.

1 rule, data you want scraped can’t be behind a paywall or a login.

submitted by /u/k7r7f80d
[link] [comments]

Corpus Of Task-oriented Dialogues Focused On Quantities?

To analyse spontaneous but comparable speech samples, researchers often use task-oriented corpora, like the Montclair Map Task Corpus. These are, naturally, focused on location/answering the question ‘where are you?’

Is there anything like this, but focused on determining ‘how much’? Basically, sets of dialogues where speakers have to communicate quantities (price, size, number of marbles, etc)?

Not necessarily just quantities, could be location or other information, too. Just that the map corpora have very few explicit mentions of distances, it’s mostly direction/environment descriptions.

submitted by /u/dennu9909
[link] [comments]

Irish Gender Pay Gap Dataset – Paygap.ie [self-promotion]

I realise this is a little niche, but in case it is of use to someone out there. Since 2022 Irish companies with over 250 employees have had to publish a gender pay gap report (legislation very similar to that in the UK). However the govt. here haven’t standardised the format or provided the central portal they promised (again like they have in the UK). This makes the data itself difficult to gather and compare.

So I decided to fill that niche with paygap.ie. This site gathers as much available data as I could find for 2022 and 2023 reports. There are searchable tables for each year, and the dataset can be downloaded per year or as a combined csv. I’ve also linked to github where I maintain regular updates to the dataset in case that’s anyone’s preferred jam. So hopefully it’s an as accessible and reusable a dataset as possible for anyone looking to explore this kind of data and include Ireland statistics in it.

Every single piece of data has been gathered by hand, by me, because there is absolutely no standard format so there’s no useful way to write a scraper or processor. But hopefully I’ve walked so others can run, since now the data is available in several usable formats.

Hopefully someone here finds this interesting!

Disclaimer: I’ve labelled this as self-promotion per the rules, because I made this website, but I really just want the data to be accessible to all. It’s under an open license, free of charge, there’s no advertising on the site. I get nothing from this other than the joy of solving a problem, and I legitimately wish our govt. would get their shit together with a central portal and make all my hard work redundant.

submitted by /u/zenbuffy
[link] [comments]

Dateno – A New Dataset Search Engine

Hi! Just recently we launched Dateno, a dataset search engine with 10M dataset search index from 4.9k data catalogs, near real-time search, 13 facets and filters and data quality in mind and priority. It’s still very beta, lots of duplicates, errors, broken links and so on, but it works and you could try it.

Inside the search engine is a Common Data Index, a registry of all available data catalogs that I worked on last year.

Nearly 10k data catalogs were collected, documented, analyzed, API discovered and so on. Actually quite boring but necessary work to see the data catalog landscape around the world.

Dateno is the next step after these catalogs. We analyzed existing API, tested several crawling techniques outside OAI-PMH indexing or indexing schema.org dataset objects. Finally now search index complete and open API will come soon.

The final goal is very ambitious, we would like to create open search index and dataset search engine that will be bigger, wider, deeper and better data quality than Google Dataset Search (50M datasets in early 2023). We plan to add more than 20M datasets during 2024, more features, more filters and better understanding and representation of dataset metadata.

Really want to see your thoughts on this.

Disclaimer: I am the creator and founder of Dateno, feel free to ask me anything about it and datasets discovery topics.

submitted by /u/ivan-begtin
[link] [comments]

[REQUEST] Comprehensive Dataset Of Undergraduate College Programs And Salaries

Hello, I would like a dataset that captures undergraduate college programs and the subsequent salaries of their graduates. While College Scorecard provides an extensive amount of data, it only covers students who have accepted federal financial aid. This limitation means that approximately 40% of students, those who did not receive federal financial aid, are excluded from the dataset.
My objective is to conduct a thorough comparison of salaries from graduates across various programs / institutions, striving for as much uniformity in the comparison as possible — essentially comparing “apples to apples.” To achieve this, I’m seeking a dataset that includes, but is not limited to, the following features:
– Name of institution and their respective undergraduate programs
– For each program, mean salary of graduates post-completion of these programs (1 year, 4 year, mid-career — whatever is available)

Any pointers towards datasets that include non-aided students, or resources that might be pieced together to construct a broader picture, would be immensely appreciated.
Thx!

submitted by /u/Data-Solid-Spring
[link] [comments]

Dataset Request: 360 Images And Their 2D Images

Hi everyone,

I am a third year computing science student and I was wondering if anyone have a dataset of 360 warped images and their 2D non warped counterparts that only show part of the image.

I haven’t been able to find anything like that and I would really appreciate if you could help me find even a small datasets.

submitted by /u/jagashot
[link] [comments]

My Sorta Wikipedia For Data Proposal

I’ve had this idea that I can’t shake and I’d like to ask your advice.

Some years ago I was gifted silly.io. For a while I called it the Ministry of Silly Things and it had JSON data sets of US States, Countries, planets of the solar system, table of elements, letters of the alphabet and a few other things. A visitor could download the JSON, link directly to it from other environments like an experimental data language for kids that I was working on. You could also embed it as a table in your own page, or use it as a source to make interesting graphs, learning games, etc.

I’m thinking of rebooting the project to be a Wikipedia for Computable Data. It would be like Wikipedia in that anyone can add to it. It would be computable in that all fields have schemas and units. This would let you compute something like:

show the thickness of iPhone models over time from 2007 to the present plot the atomic mass of elements vs their atomic number graph letters of the alphabet by number of syllables 🙂

Do you think this is a good idea? Should I spend time working on it and if so which datasets should I start with.

It would be completely open source and creative commons, BTW.

submitted by /u/joshmarinacci
[link] [comments]

Dataset For Plsql Unit Test Generation Using LLM

Hello peeps,

I am trying to build a poc wherein training a LLM to generate uts for given pks (spec file) and pkb (package body).

I have trained llama 2 and mixtral on NSQL-350m dataset (text to sql) but couldn’t get any meaningful results.

Can somebody help me with some public github repo with multiple plsql packages and their uts?

Or Any dataset which has Sql to sql generation prompts which could be helpful for this usecase.

submitted by /u/DifficultyProud9291
[link] [comments]

Suggestion For Real Life Dataset For University Project

Hi, I need a real life dataset which should have more than 5000 records and could be broken down to atleast 10 tables after BCNF/other normalisation methods. It can be of any domain.I checked various domains like e-commerce and medical fields on kaggle, data.gov, data.world but I am struggling to find a dataset which can be broken down into 10 tables.

Does anyone have any suggestions for a dataset or where I can find this type of dataset?
Thanks!

submitted by /u/cod5_1o
[link] [comments]

Advice For Textual Dataset For A NLP Project

Hi I am doing a NLP based project where I am grouping community of different apps, games to classify then as toxic, supportive or neutral. I want to then compare different communities.

For apps and games, I am using Play Store and App Store reviews. For reddit, I am using past data sets available for different subreddits.

I need suggestions for 2 data types of data. 1. In game chat for different Massive Multiplayer Online (MMO) games. 2. Community Social Media apps posts and comments. Apps other than reddit. I don’t want to do Twitter.

Any suggestions on how to get this data or other data sources that I can explore will be really helpful.

Thanks in advance.

submitted by /u/SilentScroller23
[link] [comments]

How Would You Guys Go About Cleaning Up PDF Data?

I’m trying to take the CDSs (common data sets) of a bunch of universities and compare them together, but I need to find some way to automate the process of extracting the data from them (probably into a SQL database). The issue is that although the questions on the forms are standardized, some universities convery it very differently. For example, look at C7 on the Stanford and Princeton common data sets.

So how should I go about doing this? I tried to leverage Claude’s sonnet model but it didn’t go too well, the context was too large for Claude and it was mixing up multiple fields.

And using something like tabula or pdfplumber doesn’t really help since the universities format it so differently.

Any advice would be appreciated, thank you!

submitted by /u/Roxy201
[link] [comments]

Political Party Co-Preference Dataset

I’m running simulations of ranked-choice and other voting methods and I want to find a survey-supported dataset of related preferences between US politial parties. e.g. people who prefer the green party have some proportional preference for the democratic party. I would also accept a survey-supported metric or principal component analysis on quantitative or qualitative e.g. a political spectrum which captures meaningful variations of preference in survey samples. I would very strongly prefer non-partizan research, however if that is simply not possible to find, it would be at least necessary to find studies from multiple partizan organizations to compare.

(I’m also looking to learn more about who is doing research in this area so I can follow and look for more datasets that come up)

submitted by /u/bduxbellorum
[link] [comments]

Seeking Health-Related Longitudinal Datasets

Hi all,

We’re looking for good sources of longitudinal/time-series datasets in the area of health. The datasets have to include repeated entries (e.g., one person through a long time period). The domains we are interested include:

– exercise decisions (e.g., which days people choose to exercise/run etc)

– gym and fitness class attendance

– male/female birth order (per family) or in a delivery room

– dieting & nutrition (e.g., the order that people consume healthy or unhealthy foods each day)

– pain intensity

– weight development and progression

We have searched quite a bit on common repositories like Kaggle, Data World, and UCI Machine Learning, but we have not had much luck in finding data that meets our requirements and is a decent time-series. Any specific suggestions (e.g., organisations or repositories that have publicly available health data ) would be very helpful.

Please note that we are excluding datasets that show trends that are monotonically increasing or decreasing. This generally removes broader health domains like disease spread (e.g., Covid case numbers), worldwide health development (e.g., global nutrition), life expectancy, and mortality rates.

Thank you!

submitted by /u/Remarkable_Review327
[link] [comments]