Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Blu-ray Film-Disc Metadata Info Color Primäres

Is there a Data set That Provide Metadata from the disc Info. Like:

Format : Blu-ray Playlist File size : 3.81 KiB Duration : 2 h 1 min Overall bit rate mode : Variable Overall bit rate : 4 b/s

Video #1 ID : 4113 (0x1011) Menu ID : 1 (0x1) Format : HEVC Format/Info : High Efficiency Video Coding Format profile : Main 10@L5.1@High HDR format : SMPTE ST 2094 App 4, Version 1, HDR10+ Profile A compatible Codec ID : 36 Duration : 35 s 952 ms Width : 3 840 pixels Height : 2 160 pixels Display aspect ratio : 16:9 Frame rate : 23.976 (24000/1001) FPS Color space : YUV Chroma subsampling : 4:2:0 (Type 2) Bit depth : 10 bits Color range : Limited Color primaries : BT.2020 Transfer characteristics : PQ Matrix coefficients : BT.2020 non-constant Mastering display color primaries : Display P3 Mastering display luminance : min: 0.0001 cd/m2, max: 1000 cd/m2 Maximum Content Light Level : 233 cd/m2 Maximum Frame-Average Light Level : 63 cd/m2 format_identifier : HDMV Source : 00687.m2ts

Video #3 ID : 4113 (0x1011) Menu ID : 1 (0x1) Format : HEVC Format/Info : High Efficiency Video Coding Format profile : Main 10@L5.1@High HDR format : SMPTE ST 2094 App 4, Version 1, HDR10+ Profile A compatible Codec ID : 36 Duration : 1 h 55 min Width : 3 840 pixels Height : 2 160 pixels Display aspect ratio : 16:9 Frame rate : 23.976 (24000/1001) FPS Color space : YUV Chroma subsampling : 4:2:0 (Type 2) Bit depth : 10 bits Color range : Limited Color primaries : BT.2020 Transfer characteristics : PQ Matrix coefficients : BT.2020 non-constant Mastering display color primaries : Display P3 Mastering display luminance : min: 0.0001 cd/m2, max: 1000 cd/m2 Maximum Content Light Level : 737 cd/m2 Maximum Frame-Average Light Level : 130 cd/m2 format_identifier : HDMV Source : 00688.m2ts

Video #5 ID : 4113 (0x1011) Menu ID : 1 (0x1) Format : HEVC Format/Info : High Efficiency Video Coding Format profile : Main 10@L5.1@High HDR format : SMPTE ST 2094 App 4, Version 1, HDR10+ Profile A compatible Codec ID : 36 Duration : 2 min 1 s Width : 3 840 pixels Height : 2 160 pixels Display aspect ratio : 16:9 Frame rate : 23.976 (24000/1001) FPS Color space : YUV Chroma subsampling : 4:2:0 (Type 2) Bit depth : 10 bits Color range : Limited Color primaries : BT.2020 Transfer characteristics : PQ Matrix coefficients : BT.2020 non-constant Mastering display color primaries : Display P3 Mastering display luminance : min: 0.0001 cd/m2, max: 1000 cd/m2 Maximum Content Light Level : 1000 cd/m2 Maximum Frame-Average Light Level : 18 cd/m2 format_identifier : HDMV Source : 00674.m2ts

Video #7 ID : 4113 (0x1011) Menu ID : 1 (0x1) Format : HEVC Format/Info : High Efficiency Video Coding Format profile : Main 10@L5.1@High HDR format : SMPTE ST 2094 App 4, Version 1, HDR10+ Profile A compatible Codec ID : 36 Duration : 3 min 28 s Width : 3 840 pixels Height : 2 160 pixels Display aspect ratio : 16:9 Frame rate : 23.976 (24000/1001) FPS Color space : YUV Chroma subsampling : 4:2:0 (Type 2) Bit depth : 10 bits Color range : Limited Color primaries : BT.2020 Transfer characteristics : PQ Matrix coefficients : BT.2020 non-constant Mastering display color primaries : Display P3 Mastering display luminance : min: 0.0001 cd/m2, max: 1000 cd/m2 Maximum Content Light Level : 505 cd/m2 Maximum Frame-Average Light Level : 13 cd/m2 format_identifier : HDMV Source : 00689.m2ts

I am Primars interested und this section as this is Most missing in all Data i habe Access to.

Color primaries : BT.2020 Transfer characteristics : PQ Matrix coefficients : BT.2020 non-constant Mastering display color primaries : Display P3 Mastering display luminance : min: 0.0001 cd/m2, max: 1000 cd/m2 Maximum Content Light Level : 233 cd/m2 Maximum Frame-Average Light Level : 63 cd/m2

submitted by /u/Weak_Ad9730
[link] [comments]

[self-promo] All US Healthcare Providers On Snowflake

Processed NPPES data on all US healthcare providers, along with mapped taxonomies.

How many cardiologists in Philadelphia? Who are this year’s batch of medical students? How has the number of nurse practitioners changed over time?

We have a 30 day free trial. If you want to use this for academic reasons, just email/DM me and we can make it available for free. The cost is to cover our compute/effort in cleaning this up.

https://app.snowflake.com/marketplace/listing/GZTSZAS2KFN/cybersyn-inc-us-healthcare-providers

submitted by /u/aiatco2
[link] [comments]

New Dataset On Holes Drilled Into The Earth For Fun, Science, And Profit — But Mostly Profit.

I just put up another dataset and accompanying notebook on Kaggle. It’s the USGS Core Sample Catalog.

I’d love feedback on either but if you want to answer the burning questions of what keeps you up at night such as “What is the easternmost sample well drill in the US?”, “Why are the 64 well drilled in the pacific ocean?”, or “Why does US Geological Survey have nine wells samples that aren’t in the US? It’s not like we’re going to invade Canada and take their oil — or are we?” well, I’d understand.

submitted by /u/hrokrin
[link] [comments]

Dataset To Measure How Frequently Vehicular Parts Are Subjugated To Wear And Tear Of Specific Brand / Specific Model (ANY WILL DO).

So a few months back for our professor asked for topic for thesis. I was absent for a few days beforeso i didnt know it. He started asking for everyone’s topic which could be changed. Everyone were saying complex ML projects or data analysts topcis before me, So i just panicked and choose this topic. Fastfoward a few months i procrastinated all my projects so when the time came i just gave a rough proposal and turns out you cannot change your topic anymore. I searched in kaggle but just cant seem to get the dataset. I literally have no clue where and how to search for it, so even if i cant find it here where should i begin to search.

sorry for the poor english

submitted by /u/a_non_weeb
[link] [comments]

Dataset For Malicious Posts On Reddit

I submitted a project proposal for detecting and analyzing posts with malicious intent like scam, phishing, etc on Reddit. But later I realized that Reddit is very well moderated platform(atleast the most popular subreddits) where there are usually no such posts. So is there any dataset which contains any subreddits where I can find such posts? I dont want to change the topic for proposal now

submitted by /u/psbankar
[link] [comments]

Need Help Making My UCR Data Readable. Got It From ICSPR But 5 Years Are In A Strange Format I Can Do Anything

Need urgent help on converting data

I’m doing a project using UCR crime data from this source https://www.icpsr.umich.edu/web/ICPSR/series/57?start=0&SERIESQ=57&ARCHIVE=ICPSR&PUBLISH_STATUS=PUBLISHED&sort=score%20desc&rows=50&q=County%20level%20arrest

The data from 2003-2008 is only available in a strange format while 1994-2002 and 2009-2016 is available as complete datasets in either R or STATA. Can someone please help with that.

submitted by /u/ItsRickDalton
[link] [comments]

What Is The Best Way To Build A Model From A Dataset That Has Many Dummy Variables?

Hello everyone,

I have a linear regression model with a single dependent variable and several independent variables. Among the independent variables, I have 4 categorical variables that have been turned into dummies. However, some of the categorical variables have many levels and consequently many dummies were created…

I need fit the model in a 95% confidence level, so I’m running the Stepwise algorithm on the model. The Stepwise algorithm “deleted” many of the dummies that had been created, causing, for example, that a categorical variable that previously had 10 dummies referring to it, to have only 2 dummies referring to it. That happened because some of the dummies could not be considered at a confidence level of 95%…

My doubt is, should I discard the categorical variables that had some of their dummies excluded during the Stepwise algorithm and keep only the categorical variables whose all dummies were preserved? Or should I keep the categorical variables which dummies have been excluded? Which of these 2 options is better for a predictive model?

Grateful for anyone who can help.

submitted by /u/7inchesdream
[link] [comments]

Taylor Swift (42 Albums) Lyrical Data In Textual Format [self-promotion]

I started on this idea of how there’s a taylor swift for almost every generic scenario on could think of and thought maybe I could analyse sentiment for it. Quickly found out that I’ll have to collect it on my own since the other sources I found (mainly on Kaggle) were not of the desired format (I wanted completely textual data).

So sharing it here in case it helps anyone else.

Dataset link

This data was collected using the lyricsgenius python library and the Genius API.

Also sharing the other datasets I found if they might help someone –

https://www.kaggle.com/datasets/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums https://www.kaggle.com/datasets/thespacefreak/taylor-swift-song-lyrics-all-albums

submitted by /u/ishika_jo
[link] [comments]

Open Public Domain Exercise Dataset In JSON Format, Over 800+ Exercises & Images With A Browsable Public Searchable Frontend [self Promotion]

I started building another fitness related app and was looking for free/open source exercise datasets and imagery and I stumbled upon exercises.json though it needed a bit of cleaning up & restructuring so I

Renamed/Restructured the JSON to be more usable Added JSON Schema for validation Added some useful Makefile build tasks to concatenate the JSON into one single file or for importing into PostgreSQL if needed Added a browsable/searchable/frontend available at https://yuhonas.github.io/free-exercise-db/

The repo is available at

https://github.com/yuhonas/free-exercise-db

All feedback welcome but above all else enjoy!

submitted by /u/yuhonas
[link] [comments]

How To Get Data Containing The Country, The Phone Country Code And The Mobile Phone Digits Length?

I’m looking for data containing the name of the country, the phone country code, and the msisdn length (without the phone country code) like this

[{“name”: “United States”, “code”: “+1”, “length”: 10}, {“name”: “China”, “code”: “+86”, “length”: 11}]

I don’t need the landline number length, just the mobile number length. if a country can have multiple number lengths, like 7 and 8, then the dataset should have the max length, in this case, 8.

It would be great if I can get this data in JSON, but any format would do. I’m only finding this data on Wikipedias and such, with no reliable sources, and hard to work with, I need to copy and paste from webpages.

I need this data to validate input field phone numbers. I currently have 80 countries already, I’d like to have the complete list though.

submitted by /u/lynob
[link] [comments]

Historical Financial Data Available On DoltHub

You can find the following repositories on DoltHub:

Earnings Financial statements (balance sheets, income statements, cash flow statements). Annual figures back to 2012; Quarterly figures back to 2016. Covers stocks listed in the US Analyst estimates (sales and earnings per share). Recorded weekly. Data goes back to 2018. Covers stocks listed in the US Options Option prices, vols, greeks for SPDR ETFs and ETF components. Recorded Monday, Wednesday, and Friday. Saves 2, 4, 8 week expirations and does not save all strikes. Data goes back to 2019. Records 30 ATM volatility history for easy computation of implied volatility rank. Rates US Treasury interpolated yield curve as published by the Fed. Recorded daily. Data goes back to 1990. Stocks Daily prices, splits, dividends, and symbol info for US listed stocks. Data goes back to 2018. Symbols that have been delisted are still present in the data set.

DoltHub is an interface to dolt where you can query for data using the same SQL as you would in MySQL. This allows for much more flexible and powerful querying across datasets as opposed to extracting data from multiple CSVs.

Note: This is not self promotion as I am not affiliated with DoltHub

submitted by /u/funkinaround
[link] [comments]

Searching For A Dataset For NFL Salaries Going Back As Far As 1967

I know this may be a pipe dream, but I’ve been searching for salary information on 10-15 NFL players, most of which played in the ’90s, as far back as 1967. Does anyone have any idea as to where I could find this? nflfastr only dates back to 1999, sportsdata.io only goes back a few years

For reference, I am doing a draft analysis and I am trying to compare previous draft pick trades. I have a “draft pick value” chart, which gives me a number value for each pick in the draft. I want to compare each side of the trade and say if it were a “good deal” or a “bad deal” based on the total value from each side. This works fine when there are only picks being traded, but when there are players involved, I am having a hard time objectively comparing each side. My thought is that the first overall pick is paid a certain amount of money when they are drafted, so I can find the value of a player by giving them the value of the first overall pick multiplied by how many first round salaries are within that players contract. (ex. pick 1.01 is worth 3000 and is paid $7M/year, player is paid $14M per year and worth 6000 because 3000*2=6000). Any ideas on a different way to try this would help too.

submitted by /u/jordanar189
[link] [comments]

Inorganic Chlorides And Their Associated Boiling Points

Hey!

I’m not sure if this is the right place to ask this but, I was curious if anyone here knew of a list of inorganic chlorides and their associated boiling points.

I have been working on a project to “distill” specific chlorides from a mixed group of salts. I am looking for a list of inorganic chlorides and their boiling points so I can determine the temperature ranges and the elements that are in each temperature band.

Also, if folks know of a place to find the phase diagrams for inorganic chlorides.

submitted by /u/minimalweirdness
[link] [comments]