submitted by /u/cavedave
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
Is data containing per part component servicing/replacement of automobiles and motorcycles available? If yes, where can I access them?
Example: date serviced= 01/01/2020, part replaced = front driver’s side shock absorber, odometer during service = 20000kms.
submitted by /u/officialisma
[link] [comments]
There’s more of like two parts with this question, so yeah.
First question: Let’s say I want to train a ML model to detect a basic disease based off an image, say a brain. I can find a large dataset on regular. Then, I find multiple smaller datasets with not as many brain with disease images. Thus, I take all these smaller datasets of brains with diseases, combine them into one, then use this new dataset (brain with diseases) and the other dataset (large dataset with regular brain), and use them for classification. Is this possible?
Second question: can we extend this to multiple classes? Say we have a disease that requires many conditions/symptoms to detect. Can I find these conditions from multiple data sets (One dataset contains characteristics, one dataset contains duration, one dataset includes images, etc) and essentially merge them all into one as long as they classify the same disease??
submitted by /u/ResearchingTinBot
[link] [comments]
Does anyone have a working link to the million songs dataset? The original one that was hosted on aws (https://aws.amazon.com/datasets/million-song-dataset/) does not exist anymore. Even if you have a copy somewhere please do share. This is for a class project amd I’d be grateful for any help.
submitted by /u/Aspiring_DE
[link] [comments]
For my ML project I need the scan files or pdf of banking statements to train model. Maybe synthetic data will do, the main thing is that I need them in diversity.
Business banking statement are needed too.
submitted by /u/i_kramer
[link] [comments]
My question is regarding this Formula 1 dataset
https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020
It contains multiple csv files- circuit data, driver IDs, lap times, results etc. Im currently trying to merge these into a single usable csv. I’m very new to data analysis/coding so is this something that is possible? If it is, how would I go about doing that? Appreciate the help!
submitted by /u/FalconStone95
[link] [comments]
Hi everyone,
I’m currently working on an assignment for a Junior Data Engineer role, and I could use some guidance. The task involves merging three datasets from different sources (Facebook, Google, and Company Website) into one comprehensive dataset. The columns I’m focusing on are:
Domain (most reliable) Phone Number (second most reliable) Name Category Address
I’ve mostly cleaned the datasets, but I need to merge them accurately. My main goals are to:
Merge the datasets using one or two columns (Domain and Phone Number). Ensure no overlap in information and that each row complements itself to create the most accurate and reliable data.
Could anyone suggest the best steps to take for this process? Should I use tools like Power Query or MySQL? Any recommendations for tutorials or YouTube videos would also be greatly appreciated.
Thanks in advance for your help!
submitted by /u/FortaDeMunca
[link] [comments]
Can anyone please tell me where can I find data set of US across all 50 years of this century. Particularly I am looking for Farenheit, avg per month or day for all states, doesn’t have to be for each city. I couldn’t really find a good one online
submitted by /u/Boring-Baker-3716
[link] [comments]
Hello everyone, I would like to work on my Data analysis skills and am in the hunt for a few datasets that I could work on. I want to work on my Excel, SQL and Tableau skills. I would love to get hold of some datasets that start from extremely easy to an intermediate level so that I can improve my skills gradually. Any reccomendations on a data viz tool to use and anything else is highly appreciated too. Thank you!
submitted by /u/Shoddy-Scallion4712
[link] [comments]
It would be really helpful if someone can share some sources for fetching real-time and historic data for blockchain metrics, the following parameters to be specific:
Average block size
Number of user addresses
Number of transactions
Miners’ revenue
The data should preferably begin from the year of 2017.
submitted by /u/Mustaksi
[link] [comments]
I am trying to find a way to find all bills that were in congress (senate and house) with their information (such as title of the bill, what the bill is about, etc.) and find the distribution of votes on each bill by the rep and their state
I looked into
1) https://api.congress.gov/#/bill/bill_list_all – seems like you can find a specific bill, but there is no way to search and download all say the 118 2023-2024 about 2000 bills at once. I was also unable to find vote information
2) https://projects.propublica.org/represent/ – no longer working
3) https://www.govtrack.us/congress/votes – for example https://www.govtrack.us/congress/votes/118-2024/h328#details . This option seems to have the information I am looking for but they are no longer allowing bulk data.
for 3 I guess I can brute-force it with getting all the urls from the html, then write a script to visit all urls for each page and try to parse the html data into a json/xml of sort, but that seems not great
would love to know if anyone has any suggestions
submitted by /u/psychic_shadow_lugia
[link] [comments]
I am trying to further my excel skills, eventually also python, power bi and sql. I just find it fun and i think its good skills to have.
My question is. What are some of the first things to examine after getting a dataset and cleaning it?
Im working with some datasets from kraggle.
Are there some things the experienced people always do? Like make a top 5 of valuables, or of top sellers etc, or is it something completely different that i am skipping?
submitted by /u/FuegoFlamingo
[link] [comments]
Hello, suppose I have built a “user review on products” dataset by scraping from a website.
Now I want to publish the dataset, 1. Do I need to get their consent for publishing it? 2. What if I cant reach out to them to get consent?
If yall could kindly give me solutions to this. Thanks.
submitted by /u/Second_Naf
[link] [comments]
Wanting to do a practice project for the agriculture or food industries.
Open to adjacents as well
Thanks!
submitted by /u/jd2004ed
[link] [comments]
Hi all, I just released a lot of SEC datasets that you can either access using DropBox or my python package datamule.
Datasets:
Every 10-K & 10-Q since 2001 (~200gb unzipped each, split into archives of ~1gb) Every FTD since 2004 Company Metadata (e.g. sic code, address) Company Former names
If you’re interested in SEC data, I recommend taking a look at the package as it has a lot of nice features & contains information on the data sources. (Also XBRL, etc…)
Links: https://github.com/john-friedman/%20datamule-python, https://www.dropbox.com/scl/fo/byxiish8jmdtj4zitxfjn/AAaiwwuyaYp_zRfFyqfBUS8?rlkey=g1zk5pg7iendbsa34ltnokuxl&st=t7cb6pp5&dl=0
submitted by /u/status-code-200
[link] [comments]
Hi, I need a influencers dataset, raw data of Instagram influencers. Looks easy but I cannot find this API based, every web that has this data it converts it into a web-based search, but I dont need that, I just need the data for my startup. API based would be perfect but also .csv is fine. I need to update it every month.
I need to search by followes, category/ncihe and location (of the influencer or target audience)
Hope somebody can help me…
PD: Also appreciate if you know if I can reach this using some easy Instagram Scraper, not much idea about these.
submitted by /u/Fancy_Way5065
[link] [comments]
Couldn’t find dataset on photocatalyst Material ID,Material Name,Synthesis Method,Type of Defect,Defect Engineering Method,Characterization Techniques,Band Gap Energy (eV),Photocatalytic Activity,Applications,References 1,Titanium Dioxide,Nanosheets,Sulfur Doping,Doping,XPS,2.8,85% degradation of dye in 2h,Water purification,“Smith et al., Journal of Catalysis, 2020” 2,Zinc Oxide,Nanoparticles,Oxygen Vacancies,H2 Treatment,PL,3.0,90% degradation of pollutant in 1h,Air purification,“Johnson et al., Applied Surface Science, 2019”
It should be similar to this format.If anyone could help to find the datasets on this …
submitted by /u/SENBONZAKURASOUL13
[link] [comments]
Hi.
I’m looking for any datasets related to historical box scores, game logs, or season totals for players in the NFL. I was previously using SportsData.io, but I found major inaccuracies with their season-long statistics for players (including some players having impossible records like 0.3 sacks).
I understand a lot of good datasets in this field are locked behind commercial grade API access, but I am curious if anyone here knows of any directions I can explore. Thanks!
submitted by /u/Even_Contribution_32
[link] [comments]
I have a project that needs log data from linux machines that contains data about when users change their screen brightness, change their volume level, opens apps and other settings and options like these, I searched online and found some datasets but they were more focused on systemd and kernel logs, so is there a dataset with this kind of data or a way to log such actions?
submitted by /u/moTheastralcat
[link] [comments]
Guys, for a project I need floor plan with dimensions dataset. If you know any dataset, please attach it here. Thanks in advance!
submitted by /u/Asta-12
[link] [comments]
Hello everyone,
I’m currently working on a university project where I need to build a machine learning system from scratch to recognize handwritten digits. The dataset I’m using is derived from the UCI Optical Recognition of Handwritten Digits Data Set but is relatively small—about 2,800 samples with 64 features each, split into two sets.
Constraints:
I must implement the algorithm(s) myself without using existing machine learning libraries for core functionalities. The BASE goal is to surpass the baseline performance of a K-Nearest Neighbors classifier using Euclidean distance, as reported on the UCI website; my goal is to find the best algorithm out there that can deal with this kind of dataset, as I plan on using the results of this coursework for another University’s application. I cannot collect or use additional data beyond what is provided.
What I’m Looking For:
Algorithm Suggestions: Which algorithms perform well on small datasets and can be implemented from scratch? I’m considering SVMs, neural networks, ensemble methods, or advanced KNN techniques. Overfitting Prevention: Best practices for preventing overfitting when working with small datasets. Feature Engineering: Techniques for feature selection or dimensionality reduction that could enhance performance. Distance Metrics: Recommendations for alternative distance metrics or weighting schemes to improve KNN performance. Resources: Any tutorials, papers, or examples that could guide me in implementing these algorithms effectively.
I’m aiming for high performance and would appreciate any insights or advice!
Thank you!
submitted by /u/Shin-Zantesu
[link] [comments]
Hello guys
I am looking for dataset which contains videos. And it is better if the dataset is a benchmark dataset (completely optional).
I thought camvid has videos in it, turns out no. They are frame from dashcam.
submitted by /u/maifee
[link] [comments]
Hello, I’m a university student and I’m making a machine learning model that will predict how much the population in a city would grow according to its infrastructure.
I have been able to extract and create my own infrastructure dataset with the OSM python library, but I’m having troubles finding and/or creating the population dataset.
I’ve found so far a few datasets with city population, but unfortunatly they only contain data from one or two years, and I would like for it to contain data from at least 5 years.
If anyone knows one, I’d apreciate the help! 😀
submitted by /u/Top_Hyena1923
[link] [comments]
I’m working on a project to roughly estimate the ghg impact of flights going in and out of particular u.s. airports. A dataset including the airport symbol and ind’l flights with sources/destinations and aircraft type and airline would be the perfect world. Does anyone know if there is something publicly available like this?
submitted by /u/dalberts
[link] [comments]
PROEJCT 2 REGRESSION PROJECT GUIDELINES One of the most versatile and powerful tools of econometric analysis is the multiple regression model. This project will give you practical experience in applying multiple regression analysis to a “real-world” problem. You will do the following: 1. Formulate a relationship between some variable of interest (call it Y) and a set of explanatory variables, X1, X2, X3, etc. 2. Gather observations on Y and X1, X2, X3, etc. 3. At least one of the variables should be dummy variable (0/1). 4. At least 30-50 observations (Companies, people, countries, etc., as the case may be), 5. At least 6 variables (pieces of information about the observations; e.g., stock price, revenues, profits, salaries, gender, etc.), 6. Dependent variables can’t be 0/1 variable. It has to be continuous variable. 7. Perform regression analysis on the relationship and possible alternative specifications. 8. Test a number of hypotheses about the relationship. 9. Hold out anywhere between 5 to 7 observations from the building model. 10. Summarize your results, qualifying them and drawing appropriate conclusions.
I. PROPOSAL The topic should have an economic or business emphasis; however, you should feel free to introduce any dimensions or variables that you feel are important in explaining your model. Choose a topic that interests you and about which you have some knowledge. Feel free to speak to any professor from another class (or even me) about a possible topic. The topic must be a clear, analytical topic. You must pose a hypothesis or relationship, gather evidence or data, and come to conclusions about the relationship you have specified. This is not simply a descriptive paper. The paper must be technically challenging; in other words, the conclusion cannot be drawn by a casual look at the data. Choose a topic for which you can find data.
II. FINAL PAPER – OUTLINE 1. Title: The title must be related to the topic of your paper. It is acceptable to phrase your title as a question. Do not call your paper “Multiple Regression …,” since that is a technique, not a topic or problem. 2. Introduction: The introduction provides a concise, descriptive statement introducing the background (nature), objective, and scope of the study. The reason for the study should be explained, such as testing a particular hypothesis. 3. Theoretical Model: State what the hypothesis you are testing. Describe your dependent and independent variables. Explain why you include them and what impact you think they will have on your dependent variable. 4. Empirical Results: From the regression results, present your findings and discuss them. Interpret the results of the regression analysis in a report of no more than one page (per model) using non-technical language. This interpretation should be meaningful to the person who has never had a statistics course. 6. Hold Out Sample: Remove the variables, if you think does not make sense – from p- value or sign perspective. Use the hold put sample to predict the value. Compare with the actual value. How close do you come to actual value? 5. Conclusion: Sum up your results. Mention the key points of your analysis. Are there any implications from your research? (no more than one page) 6. Page Limit: at least 4 but no more than 5 pages Case Evaluation Your case will be evaluated on the following criteria: • Quality of data • Quality of writing; how well do you communicate your approach to the problem and your analysis of results. How well do you express technical issues in ‘plain English?’ • Correctness of analysis and conclusions.
submitted by /u/mollykakers
[link] [comments]
Either my google-fu is failing me or they really do keep this really close to the chest. I was hoping to settle a debate between my friends and I about certain preference settings men use.
Anyone know where or if I would be able to find this?
submitted by /u/logikgames
[link] [comments]
Looking to get a dataset which includes information such as group names, age, and location, and size of groups if increased/decreased
submitted by /u/Over_Wrangler_6882
[link] [comments]