Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

I Built A Free Tool That Auto-generates Scrapers For Any Website With AI

I got frustrated with the time and effort required to code and maintain custom web scrapers for collecting data, so me and my friends built an LLM-based solution for data extraction from websites. AI should automate tedious and un-creative work, and web scraping definitely fits this description.

Try it out for free on our playground https://kadoa.com/playground and let me know what you think!

We’re leveraging LLMs to understand the website structure and generate the DOM selectors for it. Using LLMs for every data extraction, as most comparable tools do, would be way too expensive and very slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient and maintenance-free.

How it works (the playground uses a simplified version of this):

Loading the website: automatically decide what kind of proxy and browser we need Analyzing network calls: Try to find the desired data in the network calls Preprocessing the DOM: remove all unnecessary elements, compress it into a structure that GPT can understand Selector generation: Use an LLM to find the desired information with the corresponding selectors Data extraction in the desired format Validation: Hallucination checks and verification that the data is actually on the website and in the right format Data transformation: Clean and map the data (e.g. if we need to aggregate data from multiple sources into the same format). LLMs are great at this task too

The vision is fully autonomous and maintenance-free data processing from sources like websites or PDFs, basically “prompt-to-data” 🙂 It’s far from perfect yet, but we’ll get there.

submitted by /u/madredditscientist
[link] [comments]

Spanish LaLiga And Premier League Historical Dataset

Is anyone aware of places that have a complete dataset of matches, players, and their relative actions in said matches like, goal kicks, kicks that went into a goal, how many yellows, red cards, etc.

It can be websites where the data is readily available, APIs or blogs, I would prefer La Liga more than Premier League.

I’ve been searching around but could only reliably find sofascore and marca as sources of information.

Thanks!

submitted by /u/Technopulse
[link] [comments]

How To Pull Fixed And Floating Coupon Details On Eikon

Hi all,

Looking to find the Data Item Codes on Refinitiv Eikon for the fixed and floating segments of fixed income coupons. Pulling the data on plain vanilla fixed coupons is quite easy and straightforward but as there are appear to be no Data Item Codes for the 2nd leg (usually floating) for fields like frequency, accrual basis or even the rate. I was thinking of using the cash flows schedule but got stuck with tr.fifirstcoupondate and tr.filastcoupondate. at best its giving me the dates of the first and last coupons when I’m trying to get the first and last rates to capture Data for both payment legs.

submitted by /u/mossackfonseca1656
[link] [comments]

What Dataset And How To Get That To Link My Analysis To EEOC Dataset?

Hi,

I have a current dataset, and it looks like this:

Year Nation Region Division State County Sector Race Sex Job Number of Employee
https://www.eeoc.gov/data/job-patterns-minorities-and-women-private-industry-eeo-1-0

What additional dataset that I can add to support my analysis?

I’m trying to find the salaries by state, gender, occupation , level. But, it seems to hard too find the csv one.

submitted by /u/GliGli991
[link] [comments]

Real Estate Scraping Library For Zillow, Realtor.com & Redfin

Hey everyone,

My friend and I put together a python real estate scraper that aggregates listings from Zillow, Realtor.com & Redfin. It’s requests-based, and quite fast (relative to the search size). You can search for rentals, properties for sale, or those recently sold. And it’s super easy to output to csv /excel with to_csv() or to_excel()

Feel free to give feedback in the comments, we would love to hear your suggestions.

https://github.com/ZacharyHampton/HomeHarvest

submitted by /u/kevinc9
[link] [comments]

[PAID] [SELF-PROMOTION] 84K TikTok Influencer/creator Profiles And 1.9M Of Their Videos

Self-promoting this dataset as well as the tools I developed to generate the dataset.

This dataset contains 84,000 TikTok UGC influencer/creator profiles and 1,900,000 of their videos. The data was gathered by collecting data from networks of UGC creators using my TikTok Following Export Tool. It is intended to be used for digital marketing, and creator discovery. It can also be used for ML purposes, for example to determine which videos perform well/go viral.

Link to dataset: https://sellagen.com/item/6509bc2f10cf50605711c5e0

Each profile has the following info:

User ID Sec UID Nickname Number of followers Number of following Verified Video count Private account Seller account Region

Each video comes with the following data:

Video ID Caption Diggs Shares Comments Plays Duration Creation date and time Hashtag list Mentions list Music details

submitted by /u/jankybiz
[link] [comments]

Database Of 10,000+ Keyword Ideas For Programmatic SEO From 1,000+ Different Niches

The pSEO keywords database has the following 16 data points (columns):

Topics Main industries Niches Examples Use cases Searcher’s persona Datasets ideas Content lifecycle stage Search intent Suggested image types Interactive elements Update frequency Related queries Alternate page titles Rough outline FAQs

KW ideas are from 1,000+ different niches.

You can get it from the link https://untalkedseo.com/store/pseo-keyword-ideas-database/

The database is available as a Google Sheets file and also as a Microsoft Excel file.

submitted by /u/bikashkampo
[link] [comments]

[self-promotion] Global Weather From 100K Stations Direct To Your Snowflake Instance

Cybersyn Weather & Environmental Essentials now includes weather events from over 100K stations across 180 countries. Data is sourced from NOAA’s National Centers for Environmental Information (NCEI).

Access on Snowflake Marketplace

Example use cases:

Track prevalence of severe weather in a given region Assess climate-related risks in an area or validate insurance claims related to weather events Inform real estate investment decisions and retail location planning by analyzing weather trends within specific zips Enrich location data with historical weather events and trends

submitted by /u/aiatco2
[link] [comments]

DoltHub Data Bounties Are No More. Thanks To R/datasets For All The Support Over The Years.

Hi r/datasets,

Over the years, this subreddit has been a great supporter of Data Bounties both for bounty hunters and usage of the datasets created. We are ending the data bounty program. Thanks for all the support.

https://www.dolthub.com/blog/2023-09-18-bye-bye-bounties/

That blog explains our rationale and what we learned from the experiment. We may bring bounties back eventually.

submitted by /u/timsehn
[link] [comments]

Remote Sensing: High Resolution/UHR Dataset Of Sub-Saharan African Cities W. Ground Truth Labels For Semantic Segmentation

Hi. I’m working on a supervised learning computer vision project to segment green spaces in sub-Saharan African Cities. I saw the OpenCities challenge dataset but this lacks labels for anything but building footprints.

I can’t seem to find any datasets that meet my need. Ideally this would have around 5 labels (e.g., road, building, vegetation, water, background etc.) but anything you may know of helps. I know there are various ones for cities around the world but these aren’t useful for my project, unfortunately.

Would really appreciate any help! Can’t find anything on huggingface, google, or kaggle.

submitted by /u/Atticus_ass
[link] [comments]

Looking For A Dataset Regarding Australian Businesses Affected By Rainfall

I was looking in the energy, solar, and public transport sectors for publicly available datasets and trying to correlate them with the weather and rainfall datasets from the Bureau Of Meteorology (http://www.bom.gov.au/climate/data/index.shtml?bookmark=136&zoom=3&lat=-32.5355&lon=147.74&layers=B00000TFFFFFFFTFFFFFFFFFFFFFFFFFFFFTTT&dp=IDC10002-d).

But the correlation between number of passengers in public transport and rainfall seems to be weak across the few states that I looked at (NSW, ACT, QLD), as are the correlations between rainfall and demand for electricity as well as the correlation between demand for electricity and weather, demand for electricity in peak, off-peak times. I’m guessing because if its cold weather people will just turn on their heaters and in hot weather their ACs. Similarly, with population increases it would make sense why rainfall doesn’t affect public transport much.

So, I was wondering if anyone knew of any public datasets where I can make a somewhat strong relation with rainfall and revenue/or something similar. Maybe retail and restaurants, but they don’t really have their datasets out on display.

submitted by /u/D3V1RG1NATOR
[link] [comments]

Check Out The New Global Crypto Currency Price Database!

Dataset Link: https://www.kaggle.com/datasets/lasaljaywardena/global-cryptocurrency-price-database. This Dataset has 7500+ Crypto Currencies against USD, and it gets updated daily. This dataset is an invaluable resource for anyone interested in exploring the world of digital currencies and analyzing their market behavior. These not only include popular coins such as BTC, ETH, and SOL but it also captures newly released coins as well.

submitted by /u/Common_Protection667
[link] [comments]

I Am Trying Trying To Get ANY Open-source Datasets Created By You Guys!

I’ve just launched a website repository where people can share and access free datasets, with the goal of making datasets more accessible. I’m also planning to integrate a donation feature to encourage people to support contributors if they wish. If you have a dataset you’d like to share, please don’t hesitate to reach out—I’m interested, it’s super easy to post/list!

submitted by /u/nobilis_rex_
[link] [comments]