Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Dataset To Measure How Frequently Vehicular Parts Are Subjugated To Wear And Tear Of Specific Brand / Specific Model (ANY WILL DO).

So a few months back for our professor asked for topic for thesis. I was absent for a few days beforeso i didnt know it. He started asking for everyone’s topic which could be changed. Everyone were saying complex ML projects or data analysts topcis before me, So i just panicked and choose this topic. Fastfoward a few months i procrastinated all my projects so when the time came i just gave a rough proposal and turns out you cannot change your topic anymore. I searched in kaggle but just cant seem to get the dataset. I literally have no clue where and how to search for it, so even if i cant find it here where should i begin to search.

sorry for the poor english

submitted by /u/a_non_weeb
[link] [comments]

Dataset For Malicious Posts On Reddit

I submitted a project proposal for detecting and analyzing posts with malicious intent like scam, phishing, etc on Reddit. But later I realized that Reddit is very well moderated platform(atleast the most popular subreddits) where there are usually no such posts. So is there any dataset which contains any subreddits where I can find such posts? I dont want to change the topic for proposal now

submitted by /u/psbankar
[link] [comments]

Need Help Making My UCR Data Readable. Got It From ICSPR But 5 Years Are In A Strange Format I Can Do Anything

Need urgent help on converting data

I’m doing a project using UCR crime data from this source https://www.icpsr.umich.edu/web/ICPSR/series/57?start=0&SERIESQ=57&ARCHIVE=ICPSR&PUBLISH_STATUS=PUBLISHED&sort=score%20desc&rows=50&q=County%20level%20arrest

The data from 2003-2008 is only available in a strange format while 1994-2002 and 2009-2016 is available as complete datasets in either R or STATA. Can someone please help with that.

submitted by /u/ItsRickDalton
[link] [comments]

What Is The Best Way To Build A Model From A Dataset That Has Many Dummy Variables?

Hello everyone,

I have a linear regression model with a single dependent variable and several independent variables. Among the independent variables, I have 4 categorical variables that have been turned into dummies. However, some of the categorical variables have many levels and consequently many dummies were created…

I need fit the model in a 95% confidence level, so I’m running the Stepwise algorithm on the model. The Stepwise algorithm “deleted” many of the dummies that had been created, causing, for example, that a categorical variable that previously had 10 dummies referring to it, to have only 2 dummies referring to it. That happened because some of the dummies could not be considered at a confidence level of 95%…

My doubt is, should I discard the categorical variables that had some of their dummies excluded during the Stepwise algorithm and keep only the categorical variables whose all dummies were preserved? Or should I keep the categorical variables which dummies have been excluded? Which of these 2 options is better for a predictive model?

Grateful for anyone who can help.

submitted by /u/7inchesdream
[link] [comments]

Taylor Swift (42 Albums) Lyrical Data In Textual Format [self-promotion]

I started on this idea of how there’s a taylor swift for almost every generic scenario on could think of and thought maybe I could analyse sentiment for it. Quickly found out that I’ll have to collect it on my own since the other sources I found (mainly on Kaggle) were not of the desired format (I wanted completely textual data).

So sharing it here in case it helps anyone else.

Dataset link

This data was collected using the lyricsgenius python library and the Genius API.

Also sharing the other datasets I found if they might help someone –

https://www.kaggle.com/datasets/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums https://www.kaggle.com/datasets/thespacefreak/taylor-swift-song-lyrics-all-albums

submitted by /u/ishika_jo
[link] [comments]

Open Public Domain Exercise Dataset In JSON Format, Over 800+ Exercises & Images With A Browsable Public Searchable Frontend [self Promotion]

I started building another fitness related app and was looking for free/open source exercise datasets and imagery and I stumbled upon exercises.json though it needed a bit of cleaning up & restructuring so I

Renamed/Restructured the JSON to be more usable Added JSON Schema for validation Added some useful Makefile build tasks to concatenate the JSON into one single file or for importing into PostgreSQL if needed Added a browsable/searchable/frontend available at https://yuhonas.github.io/free-exercise-db/

The repo is available at

https://github.com/yuhonas/free-exercise-db

All feedback welcome but above all else enjoy!

submitted by /u/yuhonas
[link] [comments]

How To Get Data Containing The Country, The Phone Country Code And The Mobile Phone Digits Length?

I’m looking for data containing the name of the country, the phone country code, and the msisdn length (without the phone country code) like this

[{“name”: “United States”, “code”: “+1”, “length”: 10}, {“name”: “China”, “code”: “+86”, “length”: 11}]

I don’t need the landline number length, just the mobile number length. if a country can have multiple number lengths, like 7 and 8, then the dataset should have the max length, in this case, 8.

It would be great if I can get this data in JSON, but any format would do. I’m only finding this data on Wikipedias and such, with no reliable sources, and hard to work with, I need to copy and paste from webpages.

I need this data to validate input field phone numbers. I currently have 80 countries already, I’d like to have the complete list though.

submitted by /u/lynob
[link] [comments]

Historical Financial Data Available On DoltHub

You can find the following repositories on DoltHub:

Earnings Financial statements (balance sheets, income statements, cash flow statements). Annual figures back to 2012; Quarterly figures back to 2016. Covers stocks listed in the US Analyst estimates (sales and earnings per share). Recorded weekly. Data goes back to 2018. Covers stocks listed in the US Options Option prices, vols, greeks for SPDR ETFs and ETF components. Recorded Monday, Wednesday, and Friday. Saves 2, 4, 8 week expirations and does not save all strikes. Data goes back to 2019. Records 30 ATM volatility history for easy computation of implied volatility rank. Rates US Treasury interpolated yield curve as published by the Fed. Recorded daily. Data goes back to 1990. Stocks Daily prices, splits, dividends, and symbol info for US listed stocks. Data goes back to 2018. Symbols that have been delisted are still present in the data set.

DoltHub is an interface to dolt where you can query for data using the same SQL as you would in MySQL. This allows for much more flexible and powerful querying across datasets as opposed to extracting data from multiple CSVs.

Note: This is not self promotion as I am not affiliated with DoltHub

submitted by /u/funkinaround
[link] [comments]

Searching For A Dataset For NFL Salaries Going Back As Far As 1967

I know this may be a pipe dream, but I’ve been searching for salary information on 10-15 NFL players, most of which played in the ’90s, as far back as 1967. Does anyone have any idea as to where I could find this? nflfastr only dates back to 1999, sportsdata.io only goes back a few years

For reference, I am doing a draft analysis and I am trying to compare previous draft pick trades. I have a “draft pick value” chart, which gives me a number value for each pick in the draft. I want to compare each side of the trade and say if it were a “good deal” or a “bad deal” based on the total value from each side. This works fine when there are only picks being traded, but when there are players involved, I am having a hard time objectively comparing each side. My thought is that the first overall pick is paid a certain amount of money when they are drafted, so I can find the value of a player by giving them the value of the first overall pick multiplied by how many first round salaries are within that players contract. (ex. pick 1.01 is worth 3000 and is paid $7M/year, player is paid $14M per year and worth 6000 because 3000*2=6000). Any ideas on a different way to try this would help too.

submitted by /u/jordanar189
[link] [comments]

Inorganic Chlorides And Their Associated Boiling Points

Hey!

I’m not sure if this is the right place to ask this but, I was curious if anyone here knew of a list of inorganic chlorides and their associated boiling points.

I have been working on a project to “distill” specific chlorides from a mixed group of salts. I am looking for a list of inorganic chlorides and their boiling points so I can determine the temperature ranges and the elements that are in each temperature band.

Also, if folks know of a place to find the phase diagrams for inorganic chlorides.

submitted by /u/minimalweirdness
[link] [comments]

Shapefile For 1987 Westminster Constituencies

I am struggling to find a single source for a shapefile of England, Wales and Scottland westminster constituency boundaries for 1987 (which I believe are the same as for 1983). I want to make a chloropleth of some data I have on MPs, but I can only find seperate shapefiles. I would piece them together, but I’m a beginner with this stuff, so would like to avoid that if I can. Many thanks

submitted by /u/MacAnBhacaigh
[link] [comments]

[Q] Need Free Dataset For Business That Applied “industry 4.0” Related Innovations

Hi,

as the headline says, I would need guidance or direction on where to retrieve data for my homework. I have 15 days to complete and teacher is not help. I should work on my own apparently.

I am working on a paper where I should analyze industry 4.0 as : what it is, what technologies are involved, what are potential applications, etc. (EASY PART) and then pick specific showcases, analyze them and how new implemented technologies specifically improved their performances in comparison to competition.

I tried to cite some annual reports but was told that’s not what I should deliver. I should make my own OLS of my own datasource and make multi-variable analysis (e.g. productivity of assembly line increased due to : trend / multiple variables usually involded / gross fixed capital formation – justified on side as the new technology investment)

To be honest, I am lost. School is not much help. If I don’t do this in 15 days I am expelled… I also work FTE and pay the tuition, so the lack of guidance is not what I need right now. You might see this spammed all over reddit now.

Anyone knows how I can actually retrieve data for this?

submitted by /u/-Belon-
[link] [comments]

Need A Public Access-free-very Easy Data Set. Please. For R Ggplot Exercise

Hello redditors,
For a university task regarding R visualizations with GGPLOT and SHINY: I’m working on a COVID in Spain data set but I finding so many difficulties: R studio running slow because of 2million rows, charts not working completly…So I’m giving up and would like to start over with an easier and smaller data set.
What we are requested is that it is free and public access.

Could you recommend me a dataset, maybe related with cars, sales, touristic destination (something easy to analyze) that contains around 5-8 columns maximum and not more than some few thousands of lines? The topic is free.

Thank you 🙂

submitted by /u/aquakeyblademaster
[link] [comments]

Dataset On The Arts & Culture Sector Of United States

SMU DataArts offers detailed financial, operational, and programmatic information from thousands of nonprofit arts and cultural organizations nationwide. Files contain disaggregated unprocessed data fields in Comma Separated Value (CSV) format, and are intended for academics, students, and independent researchers with experience using raw structured data to perform calculations and analyses. Data access fee is waived for those using data for academic purposes.

https://www.culturaldata.org/what-we-do/for-researchers-advocates/access-the-dataset/

submitted by /u/planbecca
[link] [comments]

I Want Dataset For Topic Modeling In Json Format

I need dataset for this
Using the concept of topic modeling, implement it using:

(i) Rule-based method

(ii) Latent Dirichlet Allocation LDA method

For your convenience, take any unlabeled dataset

Perform data cleaning

Use TF-IDF vectorizer and any clustering method in case of Rule-based method

Fit LatentDirichletAllocation estimator in case of LDA method

submitted by /u/Particular-Pie-1640
[link] [comments]