Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Is There A Good Up-to-date Rotten Tomatoes Dataset?

I’m looking for a Rotten Tomatoes dataset that has user reviews, critic reviews and movies (doesn’t need to necessarily have metadata but would be preferred) for a recommendation system I’m trying to build. Are there any good datasets that would work for this or would I need to attempt to scrape it myself (I have 0 experience webscraping).

submitted by /u/RealHellcharm
[link] [comments]

Ai Datasets Built By Community – Need Feedback

hey there,

after 5 years of building AI models from scratch I know to the bone the importance of dataset to model quality. hence openai is there where it is, solely bc of qualitative dataset.

haven’t seen a good “service” that offers a way to build a dataset (any task: chat, instruct, qa, speech, etc) that’s baked by community.

thinking to start a service that will help companies & individuals to build a dataset by rewarding people w/ a crypto coin as a incentivization mechanism . after ds is build ~data’s collection finalized, that could be sent to HF or any other service for model training / finetuning.

what’s your feedback folks? what do you think about this? does the market exists?

submitted by /u/betimd
[link] [comments]

Offering To Scrape Datasets For Free

Hey all, if anyone wants a dataset that would obtainable through web scraping, send me a request through a comment and I’ll scrape the data for you. Obviously I can’t just scrape anything but I have quite a bit of scraping experience.

1 rule, data you want scraped can’t be behind a paywall or a login.

submitted by /u/k7r7f80d
[link] [comments]

Corpus Of Task-oriented Dialogues Focused On Quantities?

To analyse spontaneous but comparable speech samples, researchers often use task-oriented corpora, like the Montclair Map Task Corpus. These are, naturally, focused on location/answering the question ‘where are you?’

Is there anything like this, but focused on determining ‘how much’? Basically, sets of dialogues where speakers have to communicate quantities (price, size, number of marbles, etc)?

Not necessarily just quantities, could be location or other information, too. Just that the map corpora have very few explicit mentions of distances, it’s mostly direction/environment descriptions.

submitted by /u/dennu9909
[link] [comments]

Irish Gender Pay Gap Dataset – Paygap.ie [self-promotion]

I realise this is a little niche, but in case it is of use to someone out there. Since 2022 Irish companies with over 250 employees have had to publish a gender pay gap report (legislation very similar to that in the UK). However the govt. here haven’t standardised the format or provided the central portal they promised (again like they have in the UK). This makes the data itself difficult to gather and compare.

So I decided to fill that niche with paygap.ie. This site gathers as much available data as I could find for 2022 and 2023 reports. There are searchable tables for each year, and the dataset can be downloaded per year or as a combined csv. I’ve also linked to github where I maintain regular updates to the dataset in case that’s anyone’s preferred jam. So hopefully it’s an as accessible and reusable a dataset as possible for anyone looking to explore this kind of data and include Ireland statistics in it.

Every single piece of data has been gathered by hand, by me, because there is absolutely no standard format so there’s no useful way to write a scraper or processor. But hopefully I’ve walked so others can run, since now the data is available in several usable formats.

Hopefully someone here finds this interesting!

Disclaimer: I’ve labelled this as self-promotion per the rules, because I made this website, but I really just want the data to be accessible to all. It’s under an open license, free of charge, there’s no advertising on the site. I get nothing from this other than the joy of solving a problem, and I legitimately wish our govt. would get their shit together with a central portal and make all my hard work redundant.

submitted by /u/zenbuffy
[link] [comments]

Dateno – A New Dataset Search Engine

Hi! Just recently we launched Dateno, a dataset search engine with 10M dataset search index from 4.9k data catalogs, near real-time search, 13 facets and filters and data quality in mind and priority. It’s still very beta, lots of duplicates, errors, broken links and so on, but it works and you could try it.

Inside the search engine is a Common Data Index, a registry of all available data catalogs that I worked on last year.

Nearly 10k data catalogs were collected, documented, analyzed, API discovered and so on. Actually quite boring but necessary work to see the data catalog landscape around the world.

Dateno is the next step after these catalogs. We analyzed existing API, tested several crawling techniques outside OAI-PMH indexing or indexing schema.org dataset objects. Finally now search index complete and open API will come soon.

The final goal is very ambitious, we would like to create open search index and dataset search engine that will be bigger, wider, deeper and better data quality than Google Dataset Search (50M datasets in early 2023). We plan to add more than 20M datasets during 2024, more features, more filters and better understanding and representation of dataset metadata.

Really want to see your thoughts on this.

Disclaimer: I am the creator and founder of Dateno, feel free to ask me anything about it and datasets discovery topics.

submitted by /u/ivan-begtin
[link] [comments]

[REQUEST] Comprehensive Dataset Of Undergraduate College Programs And Salaries

Hello, I would like a dataset that captures undergraduate college programs and the subsequent salaries of their graduates. While College Scorecard provides an extensive amount of data, it only covers students who have accepted federal financial aid. This limitation means that approximately 40% of students, those who did not receive federal financial aid, are excluded from the dataset.
My objective is to conduct a thorough comparison of salaries from graduates across various programs / institutions, striving for as much uniformity in the comparison as possible — essentially comparing “apples to apples.” To achieve this, I’m seeking a dataset that includes, but is not limited to, the following features:
– Name of institution and their respective undergraduate programs
– For each program, mean salary of graduates post-completion of these programs (1 year, 4 year, mid-career — whatever is available)

Any pointers towards datasets that include non-aided students, or resources that might be pieced together to construct a broader picture, would be immensely appreciated.
Thx!

submitted by /u/Data-Solid-Spring
[link] [comments]

Dataset Request: 360 Images And Their 2D Images

Hi everyone,

I am a third year computing science student and I was wondering if anyone have a dataset of 360 warped images and their 2D non warped counterparts that only show part of the image.

I haven’t been able to find anything like that and I would really appreciate if you could help me find even a small datasets.

submitted by /u/jagashot
[link] [comments]

My Sorta Wikipedia For Data Proposal

I’ve had this idea that I can’t shake and I’d like to ask your advice.

Some years ago I was gifted silly.io. For a while I called it the Ministry of Silly Things and it had JSON data sets of US States, Countries, planets of the solar system, table of elements, letters of the alphabet and a few other things. A visitor could download the JSON, link directly to it from other environments like an experimental data language for kids that I was working on. You could also embed it as a table in your own page, or use it as a source to make interesting graphs, learning games, etc.

I’m thinking of rebooting the project to be a Wikipedia for Computable Data. It would be like Wikipedia in that anyone can add to it. It would be computable in that all fields have schemas and units. This would let you compute something like:

show the thickness of iPhone models over time from 2007 to the present plot the atomic mass of elements vs their atomic number graph letters of the alphabet by number of syllables 🙂

Do you think this is a good idea? Should I spend time working on it and if so which datasets should I start with.

It would be completely open source and creative commons, BTW.

submitted by /u/joshmarinacci
[link] [comments]

Dataset For Plsql Unit Test Generation Using LLM

Hello peeps,

I am trying to build a poc wherein training a LLM to generate uts for given pks (spec file) and pkb (package body).

I have trained llama 2 and mixtral on NSQL-350m dataset (text to sql) but couldn’t get any meaningful results.

Can somebody help me with some public github repo with multiple plsql packages and their uts?

Or Any dataset which has Sql to sql generation prompts which could be helpful for this usecase.

submitted by /u/DifficultyProud9291
[link] [comments]