Category: Other Nonsense & Spam

[Synthetic] DatasetGPT – A Command-line Tool To Generate Datasets By Inferencing LLMs At Scale. It Can Even Make Two ChatGPT Agents Talk With One Another.

GitHub: https://github.com/radi-cho/datasetGPT

It can generate texts by varying input parameters and using multiple backends. But, personally, the conversations dataset generation is my favorite: It can produce dialogues between two ChatGPT agents.

Possible use cases may include:

Constructing textual corpora to train/fine-tune detectors for content written by AI. Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc. Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command. Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.

What would you use it for?

submitted by /u/radi-cho
[link] [comments]

Suggestions For Ecology Dataset For Classification

I’m looking for a dataset similar to the Amphibians dataset from UCI for an undergraduate data science project. It should be a classification problem, i.e. presence/absence of a species dependent on habitat characteristics such as temperature, type of vegetation, size of water reservoir, amount of rainfall, distance to roads/civilisation, etc.

It should include

>15 numerical and categorical features >300 observations temporal and/or spatial data if possible, so I can play around with some heat maps or time series analysis.

Any hints are highly appreciated as I’m a beginner and I’ve been scrolling my eyes out on kaggle etc. all weekend.

submitted by /u/apex—-predator
[link] [comments]

Finding Datasets For Computer Vision

Hello! I’m a senior electronics engineering student. My friend trying to make a blind-assistant that helps blind people to differentiate same form-objects as like Coca-Cola vs Sprite. He design a hardware with esp8266 and uses a cloud for storing datasets. We create a dataset with taking photos of cokes however its hard to creating for all stuff. Is there any solution or resource for finding daily life datasets? We had dive a lot of open datasets CIFAR, Berkley, Kaggle, COCO, MNIST but we required 224×224 pixels for our ML model.

submitted by /u/yagmurxyildiz
[link] [comments]

Where We Actually Buy Big Data For Company?

Hi

I’m wondering where I can buy machine learning data directly for my project/product. Let’s say it’s a music or allergy app. I would like to connect a chat/predictor which, based on a few data, is able to indicate a certain percentage of something. However, large amounts of data are needed to train such algorithms. Where can you actually buy them?

submitted by /u/jackoborm
[link] [comments]

The Largest Dataset Of Graded Diamonds On Kaggle

Hi there!

I just put up a new dataset on Kaggle. It’s cryptically titled The largest diamond dataset currently on Kaggle

It has just under 220,000 diamonds and 25 columns of data making it about 3x larger than next largest. I think it’s perfect for regression models and there is an attached notebook.

This is my first submission to Kaggle so I’d be very much interested in any feedback you might have.

Thanks!

submitted by /u/hrokrin
[link] [comments]

Looking For Data On Chinese Solar PV Subsidies

Hi all,

I’m a college student working on an econometric research project trying to determine the effect of Chinese government subsidies on solar PV manufacturing share. I’m having trouble finding data on

the $ or yuan amount of subsidy available for Chinese solar PV manufacturing each year Chinese solar PV manufacturing revenue each year

If anyone can recommend how I can go about finding this data, I would really appreciate the help. I do have access to several paid/subscription data sources through my university. Thank you!

submitted by /u/evacuatethepremises
[link] [comments]

How To Find A Great Data Set? How To Nail A Data Project?

So my Stats class requires a data project as a final project( which is about 40% worth, so I’ll have to nail it to get an A in the class). I’ve been looking for data sets but I can’t find much and nothing that jolts my strings of interests. I’m wondering if anyone has suggestions of where I could find data sets and what type of data would be cool to analyze. Also, I’ll highly appreciate any advice on how to do an exceptional data project:)

submitted by /u/Ancient_Ad_5430
[link] [comments]

Find All Utility And Public Works Buildings For Three States?

Finding all utility and public works addresses in three states?

How might I go about finding the locations above? Is there a big data set out there? I attempted using open street map with big query. I can’t say if I did the query correctly. Additionally tried using a place query with ESRI geocoder city by city for each of the states but that was a disaster. I have 6 years of GIS experience and am semi proficient in python and other coding langauges.

submitted by /u/Different_Camp4002
[link] [comments]

WebScraping Specific Zip Code Data From Zillow

Hello, I have a data science project I’m interested in doing. I want to web-scrape housing data from the Zillow website within a 15-mile radius of a potential career location. I don’t have much experience in web scraping but, I know I need to use selenium (an automated browser) and python’s beautiful soup library to execute this part of my project. Does anyone have experience in web scraping Zillow’s website specifically? Any advice or Youtube videos to help me get started?

P.S. I was informed to check to see if Zillow has an API. I checked and it looks like the best I’ll be able to get from an API is using RapidAPI: 40 records of data per GET request with a one-month limit of 20 GET REquest (800 records).

submitted by /u/juangui37
[link] [comments]

CleanVision: Audit Your Image Datasets For Better Computer Vision

To all my computer vision friends working on real-world applications with messy image data, I just open-sourced a Python library you may find useful!

CleanVision audits any image dataset to automatically detect common issues such as images that are blurry, under/over-exposed, oddly sized, or near duplicates of others. It’s just 3 lines of code to discover what issues lurk in your data before you dive into modeling, and CleanVision can be used for any image dataset — regardless of whether your task is image generation, classification, segmentation, object detection, etc.

from cleanvision.imagelab import Imagelab imagelab = Imagelab(data_path=”path_to_dataset”) imagelab.find_issues() imagelab.report()

As leaders like Andrew Ng and OpenAI have lately repeated: models can only be as good as the data they are trained on. Before diving into modeling, quickly run your images through CleanVision to make sure they are ok — it’s super easy!

Github: https://github.com/cleanlab/cleanvision

Disclaimer: I am affiliated with Cleanlab.

submitted by /u/jonas__m
[link] [comments]

Scrape Thousands Of Records Of Housing Data Using Python [Self-Promotion]

Hey r/datasets,

I originally posted this library earlier this week, but it got downvoted once within 10 minutes and was never heard from again. And I get it, this is a place for posting/requesting datasets.

So, here’s an actual dataset of CA housing data I generated using the RedfinScraper library. Scraping these 47,000 records took just over 3 minutes.

While this data may be useful today, the fact is, it will only be useful for about a week longer. The high-velocity nature of housing data means that datasets need to be updated frequently.

This issue was the driving force for sharing this library publically: to allow users to quickly scrape the latest housing data at their leisure.

I hope you find this library useful, and I am excited to see what you create with it.

submitted by /u/ryan_s007
[link] [comments]