I have to do a work on data mining to complete my degree on statistics
Do you recommend a specific database that isn’t very hard for data mining? I know literally nothing about this
submitted by /u/Aston28
[link] [comments]
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
I have to do a work on data mining to complete my degree on statistics
Do you recommend a specific database that isn’t very hard for data mining? I know literally nothing about this
submitted by /u/Aston28
[link] [comments]
Hi Guys, I am trying to do some analysis on the credit and payment behaviour of Indian customers. For this I am trying to get significant external public data on customer demographics and affluence and spend data on location basis
TIA
submitted by /u/nerdy-oged
[link] [comments]
Hey all, if anyone wants a dataset that would obtainable through web scraping, send me a request through a comment and I’ll scrape the data for you. Obviously I can’t just scrape anything but I have quite a bit of scraping experience.
1 rule, data you want scraped can’t be behind a paywall or a login.
submitted by /u/k7r7f80d
[link] [comments]
What are some good large PET-scan datasets containing PET scans of patients. Does not need to be full-body, PET scan of any part of the body would be just fine.
submitted by /u/MrShikaslad
[link] [comments]
What are some good large PET-scan datasets containing PET scans of patients. Does not need to be a full-body, any kind of PET of any part of the body is fine.
submitted by /u/MrShikaslad
[link] [comments]
To analyse spontaneous but comparable speech samples, researchers often use task-oriented corpora, like the Montclair Map Task Corpus. These are, naturally, focused on location/answering the question ‘where are you?’
Is there anything like this, but focused on determining ‘how much’? Basically, sets of dialogues where speakers have to communicate quantities (price, size, number of marbles, etc)?
Not necessarily just quantities, could be location or other information, too. Just that the map corpora have very few explicit mentions of distances, it’s mostly direction/environment descriptions.
submitted by /u/dennu9909
[link] [comments]
Does anyone know any source or any website I can scrape data from for serial killers with trauma, reason for killing, geographical locations, victomology, etc. Please this is urgent it’s for a project I’m working on and it’s due in 2 days
submitted by /u/Toottootyarabamoot
[link] [comments]
Hi Can anyone suggest a path to generate a list of manufacturing companies in the US. I am not looking for a specific industry/ But I am looking for manufacturing companies across the industry? Thanks!
submitted by /u/todays_dumbest
[link] [comments]
I realise this is a little niche, but in case it is of use to someone out there. Since 2022 Irish companies with over 250 employees have had to publish a gender pay gap report (legislation very similar to that in the UK). However the govt. here haven’t standardised the format or provided the central portal they promised (again like they have in the UK). This makes the data itself difficult to gather and compare.
So I decided to fill that niche with paygap.ie. This site gathers as much available data as I could find for 2022 and 2023 reports. There are searchable tables for each year, and the dataset can be downloaded per year or as a combined csv. I’ve also linked to github where I maintain regular updates to the dataset in case that’s anyone’s preferred jam. So hopefully it’s an as accessible and reusable a dataset as possible for anyone looking to explore this kind of data and include Ireland statistics in it.
Every single piece of data has been gathered by hand, by me, because there is absolutely no standard format so there’s no useful way to write a scraper or processor. But hopefully I’ve walked so others can run, since now the data is available in several usable formats.
Hopefully someone here finds this interesting!
Disclaimer: I’ve labelled this as self-promotion per the rules, because I made this website, but I really just want the data to be accessible to all. It’s under an open license, free of charge, there’s no advertising on the site. I get nothing from this other than the joy of solving a problem, and I legitimately wish our govt. would get their shit together with a central portal and make all my hard work redundant.
submitted by /u/zenbuffy
[link] [comments]
What datasets do you recommend that have information about products? Name, price, brand, reviews etc… I have seen that NHTSA has car recall data.l and CPSC also has good data. Any other sources?
submitted by /u/Green-Piano-2545
[link] [comments]
Preferably for the presidential election but also for all others as well. Would like name, address, city, state, zip code and other related information
submitted by /u/Prodigious1995
[link] [comments]
Basically I want to see which songs people are listening to in different countries. This data is not available on the SpotifyAPI unfortunately.
The bigger this data set the better, and the further back this data reaches the better
Thank you
submitted by /u/ForeskinBiter
[link] [comments]
Looking for making an audio dataset for ASR.. can someone suggest
submitted by /u/Trysem
[link] [comments]
Hi! Just recently we launched Dateno, a dataset search engine with 10M dataset search index from 4.9k data catalogs, near real-time search, 13 facets and filters and data quality in mind and priority. It’s still very beta, lots of duplicates, errors, broken links and so on, but it works and you could try it.
Inside the search engine is a Common Data Index, a registry of all available data catalogs that I worked on last year.
Nearly 10k data catalogs were collected, documented, analyzed, API discovered and so on. Actually quite boring but necessary work to see the data catalog landscape around the world.
Dateno is the next step after these catalogs. We analyzed existing API, tested several crawling techniques outside OAI-PMH indexing or indexing schema.org dataset objects. Finally now search index complete and open API will come soon.
The final goal is very ambitious, we would like to create open search index and dataset search engine that will be bigger, wider, deeper and better data quality than Google Dataset Search (50M datasets in early 2023). We plan to add more than 20M datasets during 2024, more features, more filters and better understanding and representation of dataset metadata.
Really want to see your thoughts on this.
Disclaimer: I am the creator and founder of Dateno, feel free to ask me anything about it and datasets discovery topics.
submitted by /u/ivan-begtin
[link] [comments]
Hello, I would like a dataset that captures undergraduate college programs and the subsequent salaries of their graduates. While College Scorecard provides an extensive amount of data, it only covers students who have accepted federal financial aid. This limitation means that approximately 40% of students, those who did not receive federal financial aid, are excluded from the dataset.
My objective is to conduct a thorough comparison of salaries from graduates across various programs / institutions, striving for as much uniformity in the comparison as possible — essentially comparing “apples to apples.” To achieve this, I’m seeking a dataset that includes, but is not limited to, the following features:
– Name of institution and their respective undergraduate programs
– For each program, mean salary of graduates post-completion of these programs (1 year, 4 year, mid-career — whatever is available)
Any pointers towards datasets that include non-aided students, or resources that might be pieced together to construct a broader picture, would be immensely appreciated.
Thx!
submitted by /u/Data-Solid-Spring
[link] [comments]
Hi everyone,
I am a third year computing science student and I was wondering if anyone have a dataset of 360 warped images and their 2D non warped counterparts that only show part of the image.
I haven’t been able to find anything like that and I would really appreciate if you could help me find even a small datasets.
submitted by /u/jagashot
[link] [comments]
fortunately it is pretty easy to find top websites through wikipedia and similarweb
However this really only applies to desktop web browsing… any chance anyone knows where I can find the mobile phone equivalent?
submitted by /u/paprika-orimoto
[link] [comments]
I’ve had this idea that I can’t shake and I’d like to ask your advice.
Some years ago I was gifted silly.io. For a while I called it the Ministry of Silly Things and it had JSON data sets of US States, Countries, planets of the solar system, table of elements, letters of the alphabet and a few other things. A visitor could download the JSON, link directly to it from other environments like an experimental data language for kids that I was working on. You could also embed it as a table in your own page, or use it as a source to make interesting graphs, learning games, etc.
I’m thinking of rebooting the project to be a Wikipedia for Computable Data. It would be like Wikipedia in that anyone can add to it. It would be computable in that all fields have schemas and units. This would let you compute something like:
show the thickness of iPhone models over time from 2007 to the present plot the atomic mass of elements vs their atomic number graph letters of the alphabet by number of syllables 🙂
Do you think this is a good idea? Should I spend time working on it and if so which datasets should I start with.
It would be completely open source and creative commons, BTW.
submitted by /u/joshmarinacci
[link] [comments]
Hello peeps,
I am trying to build a poc wherein training a LLM to generate uts for given pks (spec file) and pkb (package body).
I have trained llama 2 and mixtral on NSQL-350m dataset (text to sql) but couldn’t get any meaningful results.
Can somebody help me with some public github repo with multiple plsql packages and their uts?
Or Any dataset which has Sql to sql generation prompts which could be helpful for this usecase.
submitted by /u/DifficultyProud9291
[link] [comments]
I need datasets for facial grimaces like left eye blink, right eye blink, tongue out, duckface, open mouth…etc I didn’t know how and where to look
submitted by /u/Jetza99
[link] [comments]
Hi, I need a real life dataset which should have more than 5000 records and could be broken down to atleast 10 tables after BCNF/other normalisation methods. It can be of any domain.I checked various domains like e-commerce and medical fields on kaggle, data.gov, data.world but I am struggling to find a dataset which can be broken down into 10 tables.
Does anyone have any suggestions for a dataset or where I can find this type of dataset?
Thanks!
submitted by /u/cod5_1o
[link] [comments]
Does anyone know any database or dataset on indoor outdoor plants for gardening? Scientific names, characterisrics, low light/ more light, need for water, preferred soil, etc? I reckon maybe some nurseries might manage these datasets? Thank you in advance!
submitted by /u/hyeppy
[link] [comments]
Hi I am doing a NLP based project where I am grouping community of different apps, games to classify then as toxic, supportive or neutral. I want to then compare different communities.
For apps and games, I am using Play Store and App Store reviews. For reddit, I am using past data sets available for different subreddits.
I need suggestions for 2 data types of data. 1. In game chat for different Massive Multiplayer Online (MMO) games. 2. Community Social Media apps posts and comments. Apps other than reddit. I don’t want to do Twitter.
Any suggestions on how to get this data or other data sources that I can explore will be really helpful.
Thanks in advance.
submitted by /u/SilentScroller23
[link] [comments]
I need labelled osteoarthritis datasets to train an ai model. The images can be either MRI or Xrays. Does anyone know where I can find them?
submitted by /u/IDAB3002
[link] [comments]
I’m trying to take the CDSs (common data sets) of a bunch of universities and compare them together, but I need to find some way to automate the process of extracting the data from them (probably into a SQL database). The issue is that although the questions on the forms are standardized, some universities convery it very differently. For example, look at C7 on the Stanford and Princeton common data sets.
So how should I go about doing this? I tried to leverage Claude’s sonnet model but it didn’t go too well, the context was too large for Claude and it was mixing up multiple fields.
And using something like tabula or pdfplumber doesn’t really help since the universities format it so differently.
Any advice would be appreciated, thank you!
submitted by /u/Roxy201
[link] [comments]
Hi, i am searching for datasets for my nilm disaggregation project , but all the links i found are down.Can anyone share a link or send a dataset to me ?
submitted by /u/marouska91
[link] [comments]
I’m running simulations of ranked-choice and other voting methods and I want to find a survey-supported dataset of related preferences between US politial parties. e.g. people who prefer the green party have some proportional preference for the democratic party. I would also accept a survey-supported metric or principal component analysis on quantitative or qualitative e.g. a political spectrum which captures meaningful variations of preference in survey samples. I would very strongly prefer non-partizan research, however if that is simply not possible to find, it would be at least necessary to find studies from multiple partizan organizations to compare.
(I’m also looking to learn more about who is doing research in this area so I can follow and look for more datasets that come up)
submitted by /u/bduxbellorum
[link] [comments]
Hi all,
We’re looking for good sources of longitudinal/time-series datasets in the area of health. The datasets have to include repeated entries (e.g., one person through a long time period). The domains we are interested include:
– exercise decisions (e.g., which days people choose to exercise/run etc)
– gym and fitness class attendance
– male/female birth order (per family) or in a delivery room
– dieting & nutrition (e.g., the order that people consume healthy or unhealthy foods each day)
– pain intensity
– weight development and progression
We have searched quite a bit on common repositories like Kaggle, Data World, and UCI Machine Learning, but we have not had much luck in finding data that meets our requirements and is a decent time-series. Any specific suggestions (e.g., organisations or repositories that have publicly available health data ) would be very helpful.
Please note that we are excluding datasets that show trends that are monotonically increasing or decreasing. This generally removes broader health domains like disease spread (e.g., Covid case numbers), worldwide health development (e.g., global nutrition), life expectancy, and mortality rates.
Thank you!
submitted by /u/Remarkable_Review327
[link] [comments]