Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Seeking Assistance: Categorizing 500K Food Products Into Specific Categories

I’m currently faced with the task of categorizing a massive inventory of 500,000 food products into specific categories such as meat, dairy, pastry, and more. Despite extensive searches, I haven’t been able to locate a dataset that provides products with their corresponding categories.
I’ve scoured various sources, including old posts on this Subreddit, but unfortunately, I found nothing. If anyone could point me in the right direction or share a relevant dataset, I would greatly appreciate the help. Thank you in advance!

submitted by /u/omar_zr
[link] [comments]

Looking For A Set With OTC/RX Medications, Recommended Dosages, And Safe Dosing Intervals

I’m brainstorming a project and while I’m sure a set like this exists in the world, I imagine the risk of misuse and liability makes it difficult for someone without a doctorate to get their hands on. Looking to you folks for even a mock dataset/csv that would have something like

Ibuprofen, 200-400mg, 4-6 hours

Eventually, I would like to work towards a more complete dataset that factors body mass where applicable but something to this effect with even just OTC recommendations would be a huge boon

TIA!

submitted by /u/Life-Particular-9708
[link] [comments]

Princeton University ML Great Datasets

Princeton University ML Datasets

contents [8puzzle.zip – aol.zip – assign.zip – autocomplete-tst.zip – autocomplete.zip – backtrack.zip – bacon.zip – baseball.zip – batcher.zip – bins.zip – bottle.zip – burrows.zip – circle.zip – collinear.zip – factor.zip – goldberg.zip – kdtree.zip – linksort.zip – location.zip – map.zip – markov.zip – model.zip – moviedb-3.24.zip – netflix.zip – paths.zip – percolation.zip – puzzle.zip – queues.zip – redundant.zip – rogue.zip – seamCarving.zip]

Link 1

https://www.up-4ever.net/pskmv8n6p3p4

link 2

https://www.file-upload.org/wa9xtfas8fd1

submitted by /u/DataExpx
[link] [comments]

How To Sift Through Papers More Accurately Using These Search Terms?

Hi everyone

I’m trying to create a search for an analysis that I’m doing in rural health australia but I’m unable to sift through anymore of the papers and my current search is yielding 10,938.

how can i imrpove my mesh search term?

(((((((australia*[Title/Abstract] OR victoria*[Title/Abstract] OR tasmania*[Title/Abstract] OR western australia*[Title/Abstract] OR south australia*[Title/Abstract] OR northern territor*[Title/Abstract] OR queensland*[Title/Abstract] OR new south wales[Title/Abstract] OR australian capital territory[Title/Abstract]) AND (2013:2024[pdat])) OR (((australia or victoria or tasmania or western australia or south australia or northern territory or queensland or new south wales or australian capital territory[MeSH Terms]) AND (2013:2024[pdat])) OR (((australia[Affiliation] OR wa[Affiliation] OR sa[Affiliation] OR nsw[Affiliation] OR vic[Affiliation] OR nt[Affiliation] OR act[Affiliation] OR qld[Affiliation] OR tas[Affiliation])) OR (western australia[Affiliation] OR south australia[Affiliation] OR new south wales[Affiliation] OR victoria[Affiliation] OR northern territory[Affiliation] OR australian capital territory[Affiliation] OR queensland[Affiliation] OR tasmania[Affiliation]) AND (2013:2024[pdat])))) AND (rural health OR rural health services OR rural population OR rural nursing OR hospitals, rural[MeSH Terms] AND (2013:2024[pdat]))) AND (rural*[Title/Abstract] OR regional[Title/Abstract] OR remote*[Title/Abstract] AND (2013:2024[pdat]))

submitted by /u/Efficient_Mud_5072
[link] [comments]

Where Can I Get A Zillow Rentals Dataset?

I need a zillow dataset of rentals, along with all their details, for a research project. I know zillow is very possessive of their data, but it needs not be current – is there a way to get a dataset of old rental listings from somewhere?
Alternatively, is there a different dataset that I could use that would provide a similar level of details on rentals? I know there are probably a lot of sources where I could get a footage, bedrooms/bathrooms and a price, but zillow provides data such as laundry machine/drier unit availability, pet policy and pet rent, etc. Are there any datasets like that available?

Thank you in advance

submitted by /u/SofisticatiousRattus
[link] [comments]

NIST Ballistics Toolmark Research Database

The NIST Ballistics Toolmark Research Database (NBTRD) is an open-access research database of bullet and cartridge case toolmark data. The development of the database is sponsored by the U.S. Department of Justice’s National Institute of Justice. The database is being developed to:

foster the development and validation of measurement methods, algorithms, metrics, and quantitative confidence limits for objective firearm identification

improve the scientific knowledge base on the similarity of marks from different firearms and the variability of marks from the same firearm, and ease the transition to the application of three-dimensional surface topography data in firearms identification.

The database contains traditional reflectance microscopy images and three-dimensional surface topography data acquired by NIST or submitted by database users. The goal is a collection of data sets that:

-represents the large variety of ballistic toolmarks encountered by forensic examiners, and

-represents challenging identification scenarios, such as those posed by consecutively manufactured firearm components.

submitted by /u/lurklord_
[link] [comments]

Question: How To Find Individualized Datasets

Hey! Sorry if this is the wrong sub!
I’m doing a project for school and I just need a dataset that has individualized demographic data (as in each row refers to a different person and describes as many demographic traits as possible such as race, income, education etc). I don’t know why but it’s been impossible to find individualized data rather than aggregate data at the census tract level or something like that.
Does anyone have any recommendations on where to look or how to search for this? I don’t really care about the specifics of the data like what region it’s in or anything

submitted by /u/moose_on_a_hus
[link] [comments]

I Created Baseball.computer – An Open, Comprehensive Play-by-play Database You Can Query From Anywhere [self-promotion]

I’ve been working on this database for about a year during my sabbatical and released a preview version of it this week: https://baseball.computer/
I have two goals for the project – to facilitate reproducible baseball research and to create the most fun and interesting “toy dataset” possible for educational settings.

From a technical standpoint, the database runs entirely inside of your browser, which means that you can write SQL against event-level data and visualize the results directly on the website. The tables are all available to download as flat files, and there are instructions for connecting to the data in Python and R.
From a baseball standpoint, it contains thousands of individual columns that pre-calculate as many building blocks as possible for statistical analysis. These include:

Repeatable construction of WAR components like linear weights, win/run expectancy, and park factors An example of a Keras deep-and-cross deep learning model that can train using the entire dataset on a laptop Tables that correctly merge event-level, box-level, game-level, and season-level raw data Taxonomies and additional metadata for outcome types, batted balls, and pitches 100+ event-level atomic “counting stats” including granular information on traditional stats, baserunning advances, pitches, and batted-ball location/trajectory. Detailed event state tables that can be combined with the counting stats for calculating splits Inference/deduction for handling missing batted ball data, unknown fielders, and unusual scorekeeper tendencies

Extensive-but-spotty documentation is available for all tables on the site. This includes all of the source (SQL) code, the upstream and downstream dependencies of each table, and a link to directly download the table as a flat file (here is an example). There are also several hundred tests and data constraints. This is nowhere near enough coverage to guarantee ease of use or data integrity, but it will hopefully serve as a foundation for both as the project evolves.

A couple of requests for anyone interested in playing around with it – please send me any feedback (bugs, feature requests, use cases, etc.) and, if you find it interesting, please share with your other data communities!

submitted by /u/PaginatedSalmon
[link] [comments]

Screen Content Video Dataset With Descriptions/Captions

I’m looking for a dataset that has screen recording videos (either videos or video compressions) and (ideally) accompanying descriptions of the actions completed in the video (e.g. user adds a table to a Word document). The descriptions are optional, but the dataset must contain videos. This will be used to train a video-captioning model.

Does anyone know where I can download this kind of dataset?

submitted by /u/danh3
[link] [comments]

Spreadsheet Of US Solar Farms By State

https://app.gigasheet.com/spreadsheet/US-Large-Scale-Solar-Farms-By-State/4d4b9325_fa5c_475a_84e1_b31ea4f9348e

Source: https://eerscmap.usgs.gov/uspvdb/data/

The United States Large-Scale Solar Photovoltaic Database (USPVDB) provides the locations and array boundaries of U.S. ground-mounted photovoltaic (PV) facilities with capacity of 1 megawatt or more. Large-scale facility data are collected and compiled from various public and private sources, digitized and position-verified from aerial imagery, and quality checked. The USPVDB is available for download in a variety of tabular and geospatial file formats to meet a range of user/software needs. Cached and dynamic web services are available for users that wish to access the USPVDB as a Representational State Transfer Services (RESTful) web service.

submitted by /u/n1nja5h03s
[link] [comments]

Presenting Open Source Tool That Collects Reddit Data In A Snap! (for Academic Researchers)

Hi all!

For the past few months, after uploading this post in r/PushShift, I had a chance to have quite a lot of discussions with academic researchers with this. I soon noticed that sharing historical database often goes against universities’ IRB (and definitely the new Reddit’s t&c), so that project had to be shutdown. But based on the discussions, I worked on a new tool that adheres strictly to Reddit’s terms and conditions, and also maintaining alignment with the majority of Institutional Review Board (IRB) standards.

The tool is called RedditHarbor and it is designed specifically for researchers with limited coding backgrounds. While PRAW offers flexibility for advanced users, most researchers simply want to gather Reddit data without headaches. RedditHarbor handles all the underlying work needed to streamline this process. After the initial setup, RedditHarbor collects data through intuitive commands rather than dealing with complex clients.

Here’s what RedditHarbor does: – Connects directly to Reddit API and downloads submissions, comments, user profiles etc. – Stores everything in a Supabase database that you control – Handles pagination for large datasets with millions of rows – Customizable and configurable collection from subreddits – Exports the database to CSV/JSON formats for analysis

Why I think it could be helpful to other researchers: – No coding needed for the data collection after initial setup. (I tried maximizing simplicity for researchers without coding expertise.) – While it does not give you an access for entire historical data (like PushShift or Academic Torrents), it complies with most IRBs. By using approved Reddit API credentials tied to a user account, the data collection meets guidelines for most institutional research boards. This ensures legitimacy and transparency. – Fully open source Python library built using best practices – Deduplication checks before saving data – Custom database tables adjusted for reddit metadata

Please check it out and let me know your thoughts! I would love to hear any feedbacks and feature requests 🙂

Actively maintained and adding new features (i.e collect submissions by keywords)

submitted by /u/nickshoh
[link] [comments]