Tool To Get Customer Review And Comment Data

Not sure if this is the right sub to ask, but we’re going for it anyways

I’m looking for a tool that can get us customer review and comment data from ecomm sites (Amazon, walmart.com, etc..), third party review sites like trustpilot, and social media type sources. Looking to have it loaded into a snowflake data warehouse or Azure BLOB container for snowflake ingestion.

Let me know what you have, like, don’t like… I’m starting from scratch

submitted by /u/Apprehensive-Ad-80
[link] [comments]

0

How Can I Get Chapter Data For Nonfiction Books Using API?

I am trying to create a books database and need an API that provides chapter data for books. I tried the Open Library and Google Books APIs, but neither of them offers consistent chapter data, it seems to be hit or miss. Is there any reliable source to get this data, especially for nonfiction books? I would appreciate any advice.

submitted by /u/Snorlax_lax
[link] [comments]

0

Dataset Of Simple English Conversations?

I’m looking for a dataset with easy English dialogues for beginner language learning -> basic topics like greetings, shopping, etc.

Any suggestions?

submitted by /u/Reasonable_Set_1615
[link] [comments]

0

Help Needed To Find A Dataset Example Comprising Of At Least 1000 Rows And At Least 5 Columns Which Contain Both Categorical (at Least 2) And Numerical (at Least 3) Variables.

Hi, I’m a bit stuck in an assignment where I have to use a dataset comprising of at least 1000 rows and at least 5 columns which contain both categorical (at least 2) and numerical (at least 3) variables. I also have to cite the source. It would be great if you guys please help me out…

submitted by /u/OkDark1310
[link] [comments]

0

[Synthetic] [self-promotion] We Build An Open-source Dataset To Test Spatial Pathfinding And Reasoning Skills In LLMs

Large language models often lack capabilities of pathfinding and reasoning skills. With the development of reasoning models, this got better, but we are missing the datasets to quantify these skills. Improving LLMs in this domain can be useful for robotics, as they often require some LLM to create an action plan to solve specific tasks. Therefore, we created the dataset Spatial Pathfinding and Reasoning Challenge (SPaRC) based on the game “The Witness”. This task requires the LLM to create a path from a given start point to an end point on a 2D Grid while satisfying specific rules placed on the grid.

More details, an interactive demonstration and the paper for the dataset can be found under: https://sparc.gipplab.org

In the paper, we compared the capabilities of current SOTA reasoning models with a human baseline:

Human baseline: 98% accuracy
o4-mini: 15.8% accuracy
QwQ 32B: 5.8% accuracy

This shows that there is still a large gap between humans and the capabilities of reasoning model.

Each of these puzzles is assigned a difficulty score from 1 to 5. While humans solve 100% of level 1 puzzles and 94.5% of level 5 puzzles, LLMs struggle much more: o4-mini solves 47.7% of level 1 puzzles, but only 1.1% of level 5 puzzles. Additionally, we found that these models fail to increase their reasoning time proportionally to puzzle difficulty. In some cases, they use less reasoning time, even though the human baseline requires a stark increase in reasoning time.

submitted by /u/Sral248
[link] [comments]

0

Looking For A Collection Of Images Of Sleep Deprived Individuals

Preferably categorically divided on the level of sleep debt or number of hours.

Would appreciate it, as I have not been able to find any at all which are publicly available.

I am not looking for fatigue detection datasets as mainly that is what I have found.

Thanks so much!

submitted by /u/One_Tonight9726
[link] [comments]

0

Looking For Skilled ‘romantic’ Texting Dataset, From Either Gender.

Title

submitted by /u/VastMaximum4282
[link] [comments]

0

NLSY97 Data – In NLSY97 I See Weeks Marked “employed” But No Job Record Has Anyone Else Run Into This?

Hi all,

I’m working with NLSY97 and ran into something that’s confusing me. I’ve built employment status spells (based on weekly employment status) and job spells (based on start/end dates and employer IDs), and then merged them to see how things line up.

Most of it looks great. The job spells and employment spells match up really well. But in a few places a person is marked as “employed” for a week, but there’s no corresponding job record. ( from 3 days up to 2-3 weeks) No start date, no end date, no employer ID.

Is this normal in NLSY97? Could it have something to do with how the interviews were conducted, like status being carried over between interviews, or data being lagged?

I’ve checked my code and raw event files, and it doesn’t seem like I’m dropping rows or mismatching things. The issue only shows up occasionally, which makes me wonder if it’s just part of how the data is structured rather than an error on my end.

If anyone has seen this or knows how to handle it, I’d really appreciate your thoughts. I’m happy to share code snippets if that helps.

Thanks so much in advance!

submitted by /u/Exciting-Skin3341
[link] [comments]

0

Looking For Uncommon / Niche Time Series Datasets (Updated Daily & Free)

Hi everyone,

I’m starting a side project where I compile and transform time series data from different sources. I’m looking for interesting datasets or APIs with the following characteristics:

Must be downloadable (e.g., via cronjob or script-friendly API)
Updated at least daily
Includes historical data
Free to use
Not crypto or stock trading-related
Related to human activity (directly or indirectly)
The more niche or unusual, the better!

Here’s an example of something I really liked:
🔗 Queue Times API — it provides live and historical queue times for theme parks.

Some ideas I had (but haven’t found sources for yet):

Number of Amazon orders per day
Electricity consumption by city or country
Cars in a specific parking lot
Foot traffic in a shopping mall

Basically, I’m after uncommon but fun time series datasets—things you wouldn’t usually see in mainstream data science projects.

Any suggestions, links, or ideas to explore would be hugely appreciated. Thanks!

submitted by /u/JdeHK45
[link] [comments]

0

Do You Know A Datasets Containing Users’ Spotyfi Song Histories.

Hi, do you know of any datasets containing users’ song histories?
I found one, but it doesn’t include information about which user is listening to which songs—or whether it’s just data from a single user.

submitted by /u/Moistlos
[link] [comments]

0

Are There Good Datasets On Lifespan Of Various Animals.

I am looking for something like this – given a species there should be the recorded ages of animals belonging to that species.

submitted by /u/Exciting_Point_702
[link] [comments]

0

Open 3D Architecture Dataset For Radiance Fields

submitted by /u/MasterPa
[link] [comments]

0

Can You Help Me Find A Copy Of The Reddit Comment Dataset

I recall a long time back you could download the reddit comment dataset, it was huge. I lost my hard drive to gravity a few weeks ago and was hoping someone knew where I could I get my hands on another copy?

submitted by /u/CarbonAlpine
[link] [comments]

0

My Dream Project Is Finally Live: An Open-source AI Voice Agent Framework.

Hey community,

I’m Sagar, co-founder of VideoSDK.

I’ve been working in real-time communication for years, building the infrastructure that powers live voice and video across thousands of applications. But now, as developers push models to communicate in real-time, a new layer of complexity is emerging.

Today, voice is becoming the new UI. We expect agents to feel human, to understand us, respond instantly, and work seamlessly across web, mobile, and even telephony. But developers have been forced to stitch together fragile stacks: STT here, LLM there, TTS somewhere else… glued with HTTP endpoints and prayer.

So we built something to solve that.

Today, we’re open-sourcing our AI Voice Agent framework, a real-time infrastructure layer built specifically for voice agents. It’s production-grade, developer-friendly, and designed to abstract away the painful parts of building real-time, AI-powered conversations.

We are live on Product Hunt today and would be incredibly grateful for your feedback and support.

Product Hunt Link: https://www.producthunt.com/products/video-sdk/launches/voice-agent-sdk

Here’s what it offers:

Build agents in just 10 lines of code
Plug in any models you like – OpenAI, ElevenLabs, Deepgram, and others
Built-in voice activity detection and turn-taking
Session-level observability for debugging and monitoring
Global infrastructure that scales out of the box
Works across platforms: web, mobile, IoT, and even Unity
Option to deploy on VideoSDK Cloud, fully optimized for low cost and performance
And most importantly, it’s 100% open source

Most importantly, it’s fully open source. We didn’t want to create another black box. We wanted to give developers a transparent, extensible foundation they can rely on, and build on top of.

Here is the Github Repo: https://github.com/videosdk-live/agents
(Please do star the repo to help it reach others as well)

This is the first of several launches we’ve lined up for the week.

I’ll be around all day, would love to hear your feedback, questions, or what you’re building next.

Thanks for being here,

Sagar

submitted by /u/videosdk_live
[link] [comments]

0

Just Started Learning Data Analysis. It’s Tough, But I’m Enjoying It So Far.

submitted by /u/ManufacturerFar2134
[link] [comments]

0

Just Started Learning Data Analysis. It’s Tough, But I’m Enjoying It So Far.

Hey everyone, I recently started learning data analysis. Right now I’m going through Excel, SQL, and Python (Pandas is confusing but interesting).

I come from a non-tech background, so everything feels new. Some days are frustrating, but I’m slowly getting the hang of it.

If anyone here has tips for beginners or good free resources, I’d really appreciate it. Also, if you’ve switched careers into data — how was your journey?

Thanks in advance

submitted by /u/ManufacturerFar2134
[link] [comments]

0

Help Needed! UK Traffic Videos For ALPR

I am currently working on a ALPR (Automatic License Plate Recognition) system but it is made exclusively for UK traffic as the number plates follow a specific coding system. As i don’t live in the UK, can someone help me in obtaining the dataset needed for this.

submitted by /u/Moonwolf-
[link] [comments]

0

Wikipedia Integration Added – Comprehensive Dataset Collection Tool

Demo video: https://www.reddit.com/r/SideProject/comments/1ltlzk8/tool_built_a_web_crawling_tool_for_public_data/

Major Update

Our data crawling platform has added Wikipedia integration with advanced filtering, metadata extraction, and bulk export capabilities. Ideal for NLP research, knowledge graph construction, and linguistic analysis.

Why This Matters for Researchers

Large-Scale Dataset Collection

Bulk Wikipedia Harvesting: Systematically collect thousands of articles
Structured Output: Clean, standardized data format with rich metadata
Research-Ready Format: Excel/CSV export with comprehensive metadata fields

Advanced Collection Methods

Random Sampling – Unbiased dataset generation for statistical research
Targeted Collection – Topic-specific datasets for domain research
Category-Based Harvesting – Systematic collection by Wikipedia categories

Technical Architecture

Comprehensive Wikipedia API Integration

Dual API Approach: REST API + MediaWiki API for complete data access
Real-time Data: Fresh content with latest revisions and timestamps
Rich Metadata Extraction: Article summaries, categories, edit history, link analysis
Intelligent Parsing: Clean text extraction with HTML entity handling

Data Quality Features

Automatic Filtering: Removes disambiguation pages, stubs, and low-quality content
Content Validation: Ensures substantial article content and metadata
Duplicate Detection: Prevents redundant entries in large datasets
Quality Scoring: Articles ranked by content depth and editorial quality

Research Applications

Natural Language Processing

Text Classification: Category-labeled datasets for supervised learning
Language Modeling: Large-scale text corpora
Named Entity Recognition: Entity datasets with Wikipedia metadata
Information Extraction: Structured knowledge data generation

Knowledge Graph Research

Structured Knowledge Extraction: Categories, links, semantic relationships
Entity Relationship Mapping: Article interconnections and reference networks
Temporal Analysis: Edit history and content evolution tracking
Ontology Development: Category hierarchies and classification systems

Computational Linguistics

Corpus Construction: Domain-specific text collections
Comparative Analysis: Topic-based document analysis
Content Analysis: Large-scale text mining and pattern recognition
Information Retrieval: Search and recommendation system training data

Dataset Structure and Metadata

Each collected article provides comprehensive structured data:

Core Content Fields

Title and Extract: Clean article title and summary text
Full Content: Complete article text with formatting preserved
Timestamps: Creation date, last modified, edit frequency

Rich Metadata Fields

Categories: Wikipedia category classifications for labeling
Edit History: Revision count, contributor information, edit patterns
Link Analysis: Internal/external link counts and relationship mapping
Media Assets: Image URLs, captions, multimedia content references
Quality Metrics: Article length, reference count, content complexity scores

Research-Specific Enhancements

Citation Networks: Reference and bibliography extraction
Content Classification: Automated topic and domain labeling
Semantic Annotations: Entity mentions and concept tagging

Advanced Collection Features

Smart Sampling Methods

Stratified Random Sampling: Balanced datasets across categories
Temporal Sampling: Time-based collection for longitudinal studies
Quality-Weighted Sampling: Prioritize high-quality, well-maintained articles

Systematic Category Harvesting

Complete Category Trees: Recursive collection of entire category hierarchies
Cross-Category Analysis: Multi-category intersection studies
Category Evolution Tracking: How categorization changes over time
Hierarchical Relationship Mapping: Parent-child category structures

Scalable Collection Infrastructure

Batch Processing: Handle large-scale collection requests efficiently
Rate Limiting: Respectful API usage with automatic throttling
Resume Capability: Continue interrupted collections seamlessly
Export Flexibility: Multiple output formats (Excel, CSV, JSON)

Research Use Case Examples

NLP Model Training

Target: Text classification model for scientific articles Method: Category-based collection from "Category:Science" Output: 10,000+ labeled scientific articles Applications: Domain-specific language models, scientific text analysis

Knowledge Representation Research

Target: Topic-based representation analysis in encyclopedic content Method: Systematic document collection from specific subject areas Output: Structured document sets showing topical perspectives Applications: Topic modeling, knowledge gap identification

Temporal Knowledge Evolution

Target: How knowledge representation changes over time Method: Edit history analysis with systematic sampling Output: Longitudinal dataset of article evolution Applications: Knowledge dynamics, collaborative editing patterns

Collection Methodology

Input Flexibility for Research Needs

Random Sampling: [Leave empty for unbiased collection] Topic-Specific: "Machine Learning" or "Climate Change" Category-Based: "Category:Artificial Intelligence" URL Processing: Direct Wikipedia URL processing

Quality Control and Validation

Content Length Thresholds: Minimum word count for substantial articles
Reference Requirements: Articles with adequate citation networks
Edit Activity Filters: Active vs. abandoned article identification

Value for Academic Research

Methodological Rigor

Reproducible Collections: Standardized methodology for dataset creation
Transparent Filtering: Clear quality criteria and filtering rationale
Version Control: Track collection parameters and data provenance
Citation Ready: Proper attribution and sourcing for academic use

Scale and Efficiency

Bulk Processing: Collect thousands of articles in single operations
API Optimization: Efficient data retrieval without rate limiting issues
Automated Quality Control: Systematic filtering reduces manual curation
Multi-Format Export: Ready for immediate analysis in research tools

Getting Started at pick-post.com

Quick Setup

Access Tool: Visit https://pick-post.com
Select Wikipedia: Choose Wikipedia from the site dropdown
Define Collection Strategy:
- Random sampling for unbiased datasets (leave input field empty)
- Topic search for domain-specific collections
- Category harvesting for systematic coverage
Set Collection Parameters: Size, quality thresholds
Export Results: Download structured dataset for analysis

Best Practices for Academic Use

Document Collection Methodology: Record all parameters and filters used
Validate Sample Quality: Review subset for content appropriateness
Consider Ethical Guidelines: Respect Wikipedia’s terms and contributor rights
Enable Reproducibility: Share collection parameters with research outputs

Perfect for Academic Publications

This Wikipedia dataset crawler enables researchers to create high-quality, well-documented datasets suitable for peer-reviewed research. The combination of systematic collection methods, rich metadata extraction, and flexible export options makes it ideal for:

Conference Papers: NLP, computational linguistics, digital humanities
Journal Articles: Knowledge representation research, information systems
Thesis Research: Large-scale corpus analysis and text mining
Grant Proposals: Demonstrate access to substantial, quality datasets

Ready to build your next research dataset? Start systematic, reproducible, and scalable Wikipedia data collection for serious academic research at pick-post.com.

submitted by /u/PerspectivePutrid665
[link] [comments]

0

Thoughts On This Data Cleaning Project?

Hi all, I’m working on a data cleaning project and I was wondering if I could get some feedback on this approach.

Step 1: Recommendations are given for data type for each variable and useful columns. User must confirm which columns should be analyzed and the type of variable (numeric, categorical, monetary, dates, etc)

Step 2: The chatbot gives recommendations on missingness, impossible values (think dates far in the future or homes being priced at $0 or $5), and formatting standardization (think different currencies or similar names such as New York City or NYC). User must confirm changes.

Step 3: User can preview relevant changes through a before and after of summary statistics and graph distributions. All changes are updated in a version history that can be restored.

Thank you all for your help!

submitted by /u/Academic_Meaning2439
[link] [comments]

0

Question About Podcast Dataset On Hugging Face

Hey everyone!

A little while ago, I released a conversation dataset on Hugging Face (linked if you’re curious), and to my surprise, it’s become the most downloaded one of its kind on the platform. A lot of people have been using it to train their LLMs, which is exactly what I was hoping for!

Now I’m at a bit of a crossroads — I’d love to keep improving it or even spin off new variations, but I’m not sure what the community actually wants or needs.

So, a couple of questions for you all:

Is there anything you’d love to see added to a conversation dataset that would help with your model training?
Are there types or styles of datasets you’ve been searching for but haven’t been able to find?

Would really appreciate any input. I want to make stuff that’s genuinely useful to the data community.

submitted by /u/ready_ai
[link] [comments]

0

Announcing The Launch Of The Startup Catalyst Program For Early-stage AI Teams.

We’re started a Startup Catalyst Program at Future AGI for early-stage AI teams working on things like LLM apps, agents, or RAG systems – basically anyone who’s hit the wall when it comes to evals, observability, or reliability in production.

This program is built for high-velocity AI startups looking to:

Rapidly iterate and deploy reliable AI products with confidence
Validate performance and user trust at every stage of development
Save Engineering bandwidth to focus more on product development instead of debugging

The program includes:

$5k in credits for our evaluation & observability platform
Access to Pro tools for model output tracking, eval workflows, and reliability benchmarking
Hands-on support to help teams integrate fast
Some of our internal, fine-tuned models for evals + analysis

It’s free for selected teams – mostly aimed at startups moving fast and building real products. If it sounds relevant for your stack (or someone you know), here’s the link: Apply here: https://futureagi.com/startups

submitted by /u/bubbless__16
[link] [comments]

0

[For Sale] 🔥 500 GB De‑identified Facial CT Dataset + Expert Segmentations 🚀

Hello !

I’m Anjan Boro, a Biomedical Engineer and freelance Imaging‑AI specialist. I’ve curated a 500 GB collection of de‑identified DICOM CT scans—complete with voxel‑accurate, technician‑validated segmentations of mandible, maxilla, teeth, and sinuses.

🔍 Dataset Highlights

Modality & Scale: ~500 GB of head CT volumes, DICOM format
Anatomical Coverage: Mandible, maxilla, full dentition, & virtual sinus models
Segmentation Quality: Expert-reviewed masks generated with industry‑standard tools
Compliance: Fully anonymized (HIPAA/GDPR‑ready), zero PHI in metadata or voxels
Metadata Included: Scanner make/model, slice thickness, reconstruction kernels, segmentation protocols

🚀 Why This Matters

AI Development: Accelerate training of orthodontic‑planning and surgical‑guide models
Academic Research: Support morphometric studies, biomechanics simulations, and teaching
Clinical Tooling: Build robust templates for automated maxillofacial analysis

💰 Pricing & Licensing

Preview Pack: 10 cases + metadata — $500 USD
Full Dataset: All 500 GB — $5,000 USD
Custom Licenses: Flexible terms for commercial vs. research use. Let’s discuss!

📩 Interested?

• Comment below or DM me for sample previews under NDA
• Or email: [anjanbme@gmail.com](mailto:anjanbme@gmail.com)

submitted by /u/B4R069
[link] [comments]

0

Sharing My Google Trends API For Keyword & Trend Data

I put together a simple API that lets you access Google Trends data — things like keyword interest over time, trending searches by country, and related topics.

Nothing too fancy. I needed this for a personal project and figured it might be useful to others here working with datasets or trend analysis. It abstracts the scraping and formatting, so you can just query it like any regular API.

It’s live on RapidAPI here (has a free tier): https://rapidapi.com/shake-chillies-shake-chillies-default/api/google-trends-insights

Let me know if you’ve worked on something similar or if you think any specific endpoint would be useful.

submitted by /u/Small-Hope-9388
[link] [comments]

0

I’m Analyzing 300k Remote Job Postings: Trends And Opportunities.

I realized many roles are only posted on internal career pages and never appear on classic job boards. So I built an AI script that scrapes listings from 70k+ corporate websites.

Then I wrote an ML matching script that filters only the jobs most aligned with your CV, and yes, it actually works.

You can try it here (for free).

(If you’re still skeptical but curious to test it, you can just upload a CV with fake personal information, those fields aren’t used in the matching anyway.)

submitted by /u/Elieroos
[link] [comments]

0

Dataset For Ad Classification (multi Class)

I’m looking for a dataset that contains ad description (text) and it’s corresponding label based on the business type/category.

submitted by /u/Alanuhoo
[link] [comments]

0

Where Can I Find APIs (or Legal Ways To Scrape) All Physics Research Papers, Recent And Historical?

I’m working on a personal tool that needs access to a large dataset of research papers, preferably focused on physics (but ideally spanning all fields eventually).

I’m looking for any APIs (official or public) that provide access to:

Recent and old research papers
Metadata (title, authors,, etc.)
PDFs if possible

Are there any known APIs or sources I can legally use?

I’m also open to scraping, but want to know what the legal implications are, especially if I just want this data for personal research.

Any advice appreciated 🙂 especially from academics or data engineers who’ve built something similar!

submitted by /u/SeriousTruth
[link] [comments]

0

Data Sets From The History Of Statistics And Data Visualization

submitted by /u/cavedave
[link] [comments]

0

South-Asian Urban Mobility Sensor Dataset: 2.5 Hours High Density Multi-Sensor Data

Data Collection Context

Location: Metropolitan city of India (Kolkata) Duration: 2 hours 30 minutes of continuous logging Event Context: Travel to/from a local gathering Collection Type: Round-trip journey data Urban Environment: Dense metropolitan area with mixed transportation modes

Dataset Overview

This unique sensor logger dataset captures 2.5 hours of continuous multi-sensor data collected during urban mobility patterns in Kolkata, India, specifically during travel to and from a large social gathering event with approximately 500 attendees. The dataset provides valuable insights into urban transportation dynamics, wifi networks pattern in a crowd movement, human movement, GPS data and gyroscopic data

DM if interested

submitted by /u/Original_Celery_1306
[link] [comments]

0

Tldarc: Common Crawl Domain Names – 200 Million Domain Names

I wanted the zone files to create a namechecker MCP service, but they aren’t freely available. So, I spent the last 2 weeks downloading Common Crawl’s 10TB of indexes, streaming the org-level domains and deduped them. After ~50TB of processing, and my laptop melting my legs, I’ve published them to Zenodo.

all_domains.tsv.gz contains the main list in dns,first_seen,last_seen format, from 2008 to 2025. Dates are in YYYYMMDD format. The intermediate tar.gz files (duplicate domains for each url with dates) are CC-MAIN.tar.gz.tar

Source code can be found in the github repo: https://github.com/bitplane/tldarc

submitted by /u/david-song
[link] [comments]

0

Is Looker Studio Still In Demand In 2025? Real Use Cases And Career Impact?

submitted by /u/lookerstudioexpert
[link] [comments]

0

18+ Content