Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Exploring Opportunities: How To Utilize A 25 Million-Product E-commerce Dataset For Tools And Dashboards?

As a back-end developer, I’ve scraped a dataset of 25 million products from the largest e-commerce websites in the Middle East with no duplicate products. This dataset includes basic information of each product, price history, descriptions, specifications, image links, category and breadcrumbs, recommended products, and more for each product. How can I leverage this data, and what tools and dashboards can I develop and potentially offer to other e-commerce websites?

submitted by /u/HajiIman
[link] [comments]

[Request] Big Dataset Of Fiction With Titles?

I’m looking for a dataset of short stories or novellas full texts with their titles (clearly delimited and everything in English) to train a model for title generation by abstractive summarization. The bigger the better.

Preferably erotica, thriller or drama but everything that isn’t sci-fi would work. Any ideas of where could I find that?

submitted by /u/SCP_radiantpoison
[link] [comments]

[self-promotion] Company Index Mapped To Public Identifiers (CIKs, LEIs, EINs) And Identifiers From Market Data Providers (PermID, OpenFIGI)

Cybersyn is building a Company Index (“security master” for finance nerds) to support joining companies, subsidiaries, and their brands together in a hierarchy. This is a persistent problem across companies and a major missing join key.

Our recent SEC Filings release on Snowflake Marketplace marks a first, small, step towards building a reference spine which we refer to as our Company Index. We map our Company Index to public identifiers (e.g. CIKs, LEIs, EINs) and identifiers from market data providers (PermID, OpenFIGI).

To start, we’re working with public companies but this will soon extend.

submitted by /u/aiatco2
[link] [comments]

Seeking Dataset: NAICS Codes Vs. Business Descriptions

I’m in search of a dataset that pairs NAICS codes with business descriptions, but not the standard generic descriptions. I’m interested in how businesses describe themselves in relation to NAICS codes. Ideally, I’d like around 500 descriptions for each NAICS code. I’ve scoured various sources without success. Does anyone know where I can find such a dataset? Any leads or suggestions would be greatly appreciated!

submitted by /u/coder903
[link] [comments]

I Have A Massive Dataset Of Flirting / Dating-app Messages. What To Do?

Without going into specifics, my company has legally, internally (through our app) acquired a massive dataset of millions of flirting-related conversations through dating apps / Instagram DMs / text messages.

How much do you think these transcripts are worth? What interesting projects / AI models could I train with this data? Let me know if you have any other recommendations about what to do with this dataset!

***not interested in any nefarious, illegal, or immoral recommendations***

Thanks!

submitted by /u/Blake_CS_Fit
[link] [comments]

I Built A Free Tool That Auto-generates Scrapers For Any Website With AI

I got frustrated with the time and effort required to code and maintain custom web scrapers for collecting data, so me and my friends built an LLM-based solution for data extraction from websites. AI should automate tedious and un-creative work, and web scraping definitely fits this description.

Try it out for free on our playground https://kadoa.com/playground and let me know what you think!

We’re leveraging LLMs to understand the website structure and generate the DOM selectors for it. Using LLMs for every data extraction, as most comparable tools do, would be way too expensive and very slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient and maintenance-free.

How it works (the playground uses a simplified version of this):

Loading the website: automatically decide what kind of proxy and browser we need Analyzing network calls: Try to find the desired data in the network calls Preprocessing the DOM: remove all unnecessary elements, compress it into a structure that GPT can understand Selector generation: Use an LLM to find the desired information with the corresponding selectors Data extraction in the desired format Validation: Hallucination checks and verification that the data is actually on the website and in the right format Data transformation: Clean and map the data (e.g. if we need to aggregate data from multiple sources into the same format). LLMs are great at this task too

The vision is fully autonomous and maintenance-free data processing from sources like websites or PDFs, basically “prompt-to-data” 🙂 It’s far from perfect yet, but we’ll get there.

submitted by /u/madredditscientist
[link] [comments]

Spanish LaLiga And Premier League Historical Dataset

Is anyone aware of places that have a complete dataset of matches, players, and their relative actions in said matches like, goal kicks, kicks that went into a goal, how many yellows, red cards, etc.

It can be websites where the data is readily available, APIs or blogs, I would prefer La Liga more than Premier League.

I’ve been searching around but could only reliably find sofascore and marca as sources of information.

Thanks!

submitted by /u/Technopulse
[link] [comments]

How To Pull Fixed And Floating Coupon Details On Eikon

Hi all,

Looking to find the Data Item Codes on Refinitiv Eikon for the fixed and floating segments of fixed income coupons. Pulling the data on plain vanilla fixed coupons is quite easy and straightforward but as there are appear to be no Data Item Codes for the 2nd leg (usually floating) for fields like frequency, accrual basis or even the rate. I was thinking of using the cash flows schedule but got stuck with tr.fifirstcoupondate and tr.filastcoupondate. at best its giving me the dates of the first and last coupons when I’m trying to get the first and last rates to capture Data for both payment legs.

submitted by /u/mossackfonseca1656
[link] [comments]

What Dataset And How To Get That To Link My Analysis To EEOC Dataset?

Hi,

I have a current dataset, and it looks like this:

Year Nation Region Division State County Sector Race Sex Job Number of Employee
https://www.eeoc.gov/data/job-patterns-minorities-and-women-private-industry-eeo-1-0

What additional dataset that I can add to support my analysis?

I’m trying to find the salaries by state, gender, occupation , level. But, it seems to hard too find the csv one.

submitted by /u/GliGli991
[link] [comments]

Real Estate Scraping Library For Zillow, Realtor.com & Redfin

Hey everyone,

My friend and I put together a python real estate scraper that aggregates listings from Zillow, Realtor.com & Redfin. It’s requests-based, and quite fast (relative to the search size). You can search for rentals, properties for sale, or those recently sold. And it’s super easy to output to csv /excel with to_csv() or to_excel()

Feel free to give feedback in the comments, we would love to hear your suggestions.

https://github.com/ZacharyHampton/HomeHarvest

submitted by /u/kevinc9
[link] [comments]

[PAID] [SELF-PROMOTION] 84K TikTok Influencer/creator Profiles And 1.9M Of Their Videos

Self-promoting this dataset as well as the tools I developed to generate the dataset.

This dataset contains 84,000 TikTok UGC influencer/creator profiles and 1,900,000 of their videos. The data was gathered by collecting data from networks of UGC creators using my TikTok Following Export Tool. It is intended to be used for digital marketing, and creator discovery. It can also be used for ML purposes, for example to determine which videos perform well/go viral.

Link to dataset: https://sellagen.com/item/6509bc2f10cf50605711c5e0

Each profile has the following info:

User ID Sec UID Nickname Number of followers Number of following Verified Video count Private account Seller account Region

Each video comes with the following data:

Video ID Caption Diggs Shares Comments Plays Duration Creation date and time Hashtag list Mentions list Music details

submitted by /u/jankybiz
[link] [comments]

Database Of 10,000+ Keyword Ideas For Programmatic SEO From 1,000+ Different Niches

The pSEO keywords database has the following 16 data points (columns):

Topics Main industries Niches Examples Use cases Searcher’s persona Datasets ideas Content lifecycle stage Search intent Suggested image types Interactive elements Update frequency Related queries Alternate page titles Rough outline FAQs

KW ideas are from 1,000+ different niches.

You can get it from the link https://untalkedseo.com/store/pseo-keyword-ideas-database/

The database is available as a Google Sheets file and also as a Microsoft Excel file.

submitted by /u/bikashkampo
[link] [comments]

[self-promotion] Global Weather From 100K Stations Direct To Your Snowflake Instance

Cybersyn Weather & Environmental Essentials now includes weather events from over 100K stations across 180 countries. Data is sourced from NOAA’s National Centers for Environmental Information (NCEI).

Access on Snowflake Marketplace

Example use cases:

Track prevalence of severe weather in a given region Assess climate-related risks in an area or validate insurance claims related to weather events Inform real estate investment decisions and retail location planning by analyzing weather trends within specific zips Enrich location data with historical weather events and trends

submitted by /u/aiatco2
[link] [comments]

DoltHub Data Bounties Are No More. Thanks To R/datasets For All The Support Over The Years.

Hi r/datasets,

Over the years, this subreddit has been a great supporter of Data Bounties both for bounty hunters and usage of the datasets created. We are ending the data bounty program. Thanks for all the support.

https://www.dolthub.com/blog/2023-09-18-bye-bye-bounties/

That blog explains our rationale and what we learned from the experiment. We may bring bounties back eventually.

submitted by /u/timsehn
[link] [comments]