Category: Other Nonsense & Spam

[Synthetic] DatasetGPT – A Command-line Tool To Generate Datasets By Inferencing LLMs At Scale. It Can Even Make Two ChatGPT Agents Talk With One Another.

GitHub: https://github.com/radi-cho/datasetGPT

It can generate texts by varying input parameters and using multiple backends. But, personally, the conversations dataset generation is my favorite: It can produce dialogues between two ChatGPT agents.

Possible use cases may include:

Constructing textual corpora to train/fine-tune detectors for content written by AI. Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc. Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command. Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.

What would you use it for?

submitted by /u/radi-cho
[link] [comments]

Suggestions For Ecology Dataset For Classification

I’m looking for a dataset similar to the Amphibians dataset from UCI for an undergraduate data science project. It should be a classification problem, i.e. presence/absence of a species dependent on habitat characteristics such as temperature, type of vegetation, size of water reservoir, amount of rainfall, distance to roads/civilisation, etc.

It should include

>15 numerical and categorical features >300 observations temporal and/or spatial data if possible, so I can play around with some heat maps or time series analysis.

Any hints are highly appreciated as I’m a beginner and I’ve been scrolling my eyes out on kaggle etc. all weekend.

submitted by /u/apex—-predator
[link] [comments]

Finding Datasets For Computer Vision

Hello! I’m a senior electronics engineering student. My friend trying to make a blind-assistant that helps blind people to differentiate same form-objects as like Coca-Cola vs Sprite. He design a hardware with esp8266 and uses a cloud for storing datasets. We create a dataset with taking photos of cokes however its hard to creating for all stuff. Is there any solution or resource for finding daily life datasets? We had dive a lot of open datasets CIFAR, Berkley, Kaggle, COCO, MNIST but we required 224×224 pixels for our ML model.

submitted by /u/yagmurxyildiz
[link] [comments]

Where We Actually Buy Big Data For Company?

Hi

I’m wondering where I can buy machine learning data directly for my project/product. Let’s say it’s a music or allergy app. I would like to connect a chat/predictor which, based on a few data, is able to indicate a certain percentage of something. However, large amounts of data are needed to train such algorithms. Where can you actually buy them?

submitted by /u/jackoborm
[link] [comments]

The Largest Dataset Of Graded Diamonds On Kaggle

Hi there!

I just put up a new dataset on Kaggle. It’s cryptically titled The largest diamond dataset currently on Kaggle

It has just under 220,000 diamonds and 25 columns of data making it about 3x larger than next largest. I think it’s perfect for regression models and there is an attached notebook.

This is my first submission to Kaggle so I’d be very much interested in any feedback you might have.

Thanks!

submitted by /u/hrokrin
[link] [comments]

[Self-promo] Carbon Removal & Intensity Data From CDR.fyi And Our World In Data On Snowflake

Cybersyn data available on Snowflake Marketplace: https://app.snowflake.com/marketplace/listing/GZTSZAS2KEU/cybersyn-inc-environmental-tracking

Data sourced from CDR.fyi and Our World in Data.

Our World in Data publishes the carbon intensity of electricity in grams CO2e per kWh by country by year from 2000. This data measures how much CO2 it takes to produce a given amount of electricity. Determine which countries have improved their carbon footprint over time and compare which countries are the most efficient as it relates to carbon emissions from electric use.

cdr.fyi consolidates purchases, deliveries, and verification of carbon removed and stored for +100 years. Carbon dioxide removal (CDR) is the process of removing CO2 from the atmosphere and durably storing it to create negative emissions. This data set shows activity in the marketplace for carbon credits including CDR sales, deliveries, and price. The data shows which buyers and suppliers are most active in the CDR market as well as which types of CDRs are gaining and losing share. Note that all deals have CO2 tonnage associated with them, but only a subset of deals have dollar sales and price.

About Us: Cybersyn is a DaaS (data-as-a-service) company, whose mission is to make the world’s economic data transparent to governments, businesses, and entrepreneurs and enable a new generation of decision makers.

submitted by /u/aiatco2
[link] [comments]

A Dataset Containing Baby Images, Preferably Annotated, And Containing Babies Who Are Both Awake And Asleep

I’m currently working on a project involving a baby care AI system. As part of my research, I’m looking for a dataset of annotated baby images that include both awake and asleep babies.

Ideally, I’m hoping to find a dataset labeled with whether they are awake or asleep in the images. It would also be great if the dataset included multiple images of each baby to account for variations in lighting, angles, and facial expressions.

If anyone knows of a dataset that fits this description or has access to a collection of baby images that they would be willing to share, I would be extremely grateful. This project is important to me, and having a high-quality dataset would be incredibly beneficial.

Even if the images are unlabeled, or only contain sleeping/awake babies, that would be great.

Thank you in advance for any leads or suggestions you can offer!

submitted by /u/sapomh
[link] [comments]

Supply Chain Location Factors- Free Datasets

Im writing my masters thesis and I’m struggling to find decent data for my analysis. As I need my variables by country and year, I found it hard to get free data with a good number of years and countries to have a robust analysis.

My topic is about supply chain location factors, whether cost based or security based/ geopolitical, they’re both relevant to my research question.

These are some of the variables for which I have some very bad datasets with lots of missing data and would like some suggestions: – Logistics performance, or infrastructure performance. – Energy cost, or price of gasoline – Economic policy uncertainty

Any other relevant variables that are accessible for free would also be great!

If you know any free online source for this data (other than World Bank data), please let me know :))

Thanks in advance !!’

submitted by /u/Pleasant_Savings_256
[link] [comments]

Looking For Data On Chinese Solar PV Subsidies

Hi all,

I’m a college student working on an econometric research project trying to determine the effect of Chinese government subsidies on solar PV manufacturing share. I’m having trouble finding data on

the $ or yuan amount of subsidy available for Chinese solar PV manufacturing each year Chinese solar PV manufacturing revenue each year

If anyone can recommend how I can go about finding this data, I would really appreciate the help. I do have access to several paid/subscription data sources through my university. Thank you!

submitted by /u/evacuatethepremises
[link] [comments]

How To Find A Great Data Set? How To Nail A Data Project?

So my Stats class requires a data project as a final project( which is about 40% worth, so I’ll have to nail it to get an A in the class). I’ve been looking for data sets but I can’t find much and nothing that jolts my strings of interests. I’m wondering if anyone has suggestions of where I could find data sets and what type of data would be cool to analyze. Also, I’ll highly appreciate any advice on how to do an exceptional data project:)

submitted by /u/Ancient_Ad_5430
[link] [comments]