Category: Other Nonsense & Spam

Looking For Active List Of Domain Names

There seems to be about (if not more) 350 million registered domain names, but can’t seem to find any source that offers to download this data.

I am only interested in root domains eg dailynews.com I came across this repo https://github.com/tb0hdan/domains But after filtering the root domains I end up about 150 million. There is also paid service such as zonefiles. Io that offers about 260 millions domain. Anyone knows or aware of any other sources that provide the complete set?

Thanks in advance.

P.S. Is it worth it to setup your own crawlers for this type of thing?

submitted by /u/activelearning23
[link] [comments]

Washington D.C 2010-2020 Felony Offense And Sentence Overview

Hello again! I came across this dataset and found it to be interesting. It includes major felony crimes in mostly the D.C area between 2010-2020. The information also includes gender, race, year, felony charge, offense, time served, and a lot more!

Click here to view the dataset: https://app.gigasheet.com/spreadsheet/Felony-Sentence-2010-2020-csv/71dbef04_e629_43ca_b8c4_007de9244fd6

Looks like “drug” charges are usually the top over the course of the 10 years and 2012 was the worst year for crime between 2010-2020

Dataset Source: https://opendata.dc.gov/datasets/DCGIS::felony-sentences/explore

submitted by /u/sheetheadd
[link] [comments]

Any Sleep Quality Datasets Based On Lifestyle Factors?

For a data analysis project, I’m looking for a reliable dataset about how sleep quality is affected by different genetic and lifestyle qualities.

Things like: gender, age, caffeine/alcohol consumption, exercise frequency, etc.

Something with labels like this one would be optimal: https://www.kaggle.com/datasets/equilibriumm/sleep-efficiency – however I can’t confirm the authenticity of this data.

Any resources would be greatly appreciated!

submitted by /u/Ok_Afternoon_1720
[link] [comments]

Dataset Of Medical Case Scenarios And Appropriate Diagnosis

I’m looking for a dataset that contains a medical case examples and the diagnosis presented in the case.

Example of what I’m talking about:

[“Bob has been having issues with excessive thirst and blurry vision. He has elevated levels of glucose in the urine and blood.”, “Diabetes”]

I’m not too picky about the format as long as the diagnosis is seperate from the scenario and the formatting is consistent.

Artificial datasets are okay, maybe even preferred, as long as they’re accurate.

submitted by /u/flavorfulcherry
[link] [comments]

[Synthetic] DatasetGPT – A Command-line Tool To Generate Datasets By Inferencing LLMs At Scale. It Can Even Make Two ChatGPT Agents Talk With One Another.

GitHub: https://github.com/radi-cho/datasetGPT

It can generate texts by varying input parameters and using multiple backends. But, personally, the conversations dataset generation is my favorite: It can produce dialogues between two ChatGPT agents.

Possible use cases may include:

Constructing textual corpora to train/fine-tune detectors for content written by AI. Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc. Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command. Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.

What would you use it for?

submitted by /u/radi-cho
[link] [comments]

Suggestions For Ecology Dataset For Classification

I’m looking for a dataset similar to the Amphibians dataset from UCI for an undergraduate data science project. It should be a classification problem, i.e. presence/absence of a species dependent on habitat characteristics such as temperature, type of vegetation, size of water reservoir, amount of rainfall, distance to roads/civilisation, etc.

It should include

>15 numerical and categorical features >300 observations temporal and/or spatial data if possible, so I can play around with some heat maps or time series analysis.

Any hints are highly appreciated as I’m a beginner and I’ve been scrolling my eyes out on kaggle etc. all weekend.

submitted by /u/apex—-predator
[link] [comments]

Finding Datasets For Computer Vision

Hello! I’m a senior electronics engineering student. My friend trying to make a blind-assistant that helps blind people to differentiate same form-objects as like Coca-Cola vs Sprite. He design a hardware with esp8266 and uses a cloud for storing datasets. We create a dataset with taking photos of cokes however its hard to creating for all stuff. Is there any solution or resource for finding daily life datasets? We had dive a lot of open datasets CIFAR, Berkley, Kaggle, COCO, MNIST but we required 224×224 pixels for our ML model.

submitted by /u/yagmurxyildiz
[link] [comments]

Where We Actually Buy Big Data For Company?

Hi

I’m wondering where I can buy machine learning data directly for my project/product. Let’s say it’s a music or allergy app. I would like to connect a chat/predictor which, based on a few data, is able to indicate a certain percentage of something. However, large amounts of data are needed to train such algorithms. Where can you actually buy them?

submitted by /u/jackoborm
[link] [comments]

The Largest Dataset Of Graded Diamonds On Kaggle

Hi there!

I just put up a new dataset on Kaggle. It’s cryptically titled The largest diamond dataset currently on Kaggle

It has just under 220,000 diamonds and 25 columns of data making it about 3x larger than next largest. I think it’s perfect for regression models and there is an attached notebook.

This is my first submission to Kaggle so I’d be very much interested in any feedback you might have.

Thanks!

submitted by /u/hrokrin
[link] [comments]

Crimes In Boston During Covid-19 (2020-2021)

Interesting dataset pulled from Boston’s Official Government Site. I definately heard about the spike of crimes that occurred during height of Covid, so I decided to merge the two CSVs from 2021 and 2020. It also helps depict/infer the safest streets in Boston.

Curious, is anyone else interested in a specific location/city and it’s crime data? I see tons of datasets like this online. Would love to share and see some interesting ones!

Click here to view the dataset: https://app.gigasheet.com/spreadsheet/2020-2021-Covid-Crime-in-Boston/94982770_3c8c_48fb_9176_efeb72becdd8

submitted by /u/sheetheadd
[link] [comments]

Does Anyone Know Where I Can Find A Reliable Dataset That Lists All Airports With Geolocation?

Hey everyone,

I’m working on a map project that needs a list of all airports worldwide along with their geolocation coordinates. I’ve searched online, but I’m having trouble finding a reliable/up to date source.

I was wondering if anyone here knows of a dataset that has it? It would be great if the data included the airport IATA code, and latitude/longitude coordinates.

If anyone has any suggestions or recommendations, I’d greatly appreciate it.
Thank you in advance!

submitted by /u/px07x
[link] [comments]

Poker Hands (with Labels For Raise, Check And Fold)

I was wondering if anybody knows of a location I could get some form of dataset with the structure aforementioned in the above. I’m looking to create a supervised learning classification model that takes a set of poker hands (hold-em style I think) that predicts raise, check or fold based on the cards presented. If it were trained on a dataset from professional poker players I’d imagine it would make plays very similar to them, as such it could be rather successful.

My only other option for gathering this data, I thought, would be to host a simple web app that shows the user 5 cards and asks them whether they want to raise, check or fold, and post it on forums (here?) and and gather the data from the responses into a large database. This however may result in bad plays from users that don’t know how to play poker, and bogus answers, so I’d rather stay away from that.

submitted by /u/ryanward02
[link] [comments]

Looking For Galaxy Dataset Containing Celestial Object Location For A Snapshot In Time

Hi, I’m looking for a space dataset about a specific galaxy. Any galaxy will do. It needs to have spacial information for each celestial body (planet, star, black hole) for a snapshot in time, so I’m thinking an x, y, z value. I want to know each object’s location in the galaxy. It would also be nice if the dataset contained what each object is (star, planet, black hole). It could also go into more specifics about the class of the type of object it is like dwarf star, gas planet, etc & the size of the object or its radius. I’m planing on using this dataset for an art project for one of my classes. Thank you.

submitted by /u/michaelbschulte21
[link] [comments]

[Self-promo] Carbon Removal & Intensity Data From CDR.fyi And Our World In Data On Snowflake

Cybersyn data available on Snowflake Marketplace: https://app.snowflake.com/marketplace/listing/GZTSZAS2KEU/cybersyn-inc-environmental-tracking

Data sourced from CDR.fyi and Our World in Data.

Our World in Data publishes the carbon intensity of electricity in grams CO2e per kWh by country by year from 2000. This data measures how much CO2 it takes to produce a given amount of electricity. Determine which countries have improved their carbon footprint over time and compare which countries are the most efficient as it relates to carbon emissions from electric use.

cdr.fyi consolidates purchases, deliveries, and verification of carbon removed and stored for +100 years. Carbon dioxide removal (CDR) is the process of removing CO2 from the atmosphere and durably storing it to create negative emissions. This data set shows activity in the marketplace for carbon credits including CDR sales, deliveries, and price. The data shows which buyers and suppliers are most active in the CDR market as well as which types of CDRs are gaining and losing share. Note that all deals have CO2 tonnage associated with them, but only a subset of deals have dollar sales and price.

About Us: Cybersyn is a DaaS (data-as-a-service) company, whose mission is to make the world’s economic data transparent to governments, businesses, and entrepreneurs and enable a new generation of decision makers.

submitted by /u/aiatco2
[link] [comments]