Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Looking For Worldwide First Names Dataset By Country

Hi everyone,
I’m trying to find a dataset that contains first names by country, ideally sorted by popularity or frequency – something similar to what census.name offers (they have a paid database of 1.5M+ names across 200+ countries).

Does anyone know of:

  • A free alternative
  • A mirror or archived version of the census.name database
  • Or any large dataset with realistic global first names?

Open to Kaggle, GitHub, or even academic/public resources.
Thanks in advance for any leads!

submitted by /u/flavvius1
[link] [comments]

New Research Shows The Impact Of Inflation, Tariffs On Consumer Spending

Sharing original research recently collected by a quant + qual survey of 1,000 consumers nationwide (US) trying to better understand current consumer sentiment, and how consumer spending habits have or have not changed in the past year due to things like inflation/shrinkflation, tariff concerns, higher cost of living and more.

In a Highlight survey taken the week of July 7, 2025, we polled our proprietary panel of nationwide consumers, achieving 1,000 completions with an even gender split (500 men and 500 women).

Among other questions, we asked them: In terms of your personal finances, how do you feel today compared with this time last year?

62% of respondents said money feels somewhat or much tighter than a year ago, while only 10% said money feels somewhat or much easier than a year ago. Over a quarter of respondents (28%) say that money feels about the same as compared with this time last year.

In an open-ended question, respondents were given the opportunity to describe how their consumption habits and saving strategies have changed in their own words. Highlight asked: Thinking about your everyday routines, purchases, or habits–is there anything you’re doing now that you weren’t doing a year ago? Here’s the full breakdown of respondents’ qualitative responses:

No/Not really: This or similar phrases like “Nope it’s the same,” “No changes,” “nothing,” “I don’t think so,” or “everything is basically the same” appears 93 times. This indicates a significant portion of the respondents haven’t changed their habits much.

“I shop the same overall.” – She/her, 47 years old, North Carolina

Exercising more/Working out more: This theme appears 47 times. Many respondents mentioned exercising, working out, going to the gym, walking more, or increasing physical activity.

“Drinking more iced coffee, working out more, traveling less, reading audiobooks more.” – He/him, 36 years old, Illinois

Eating healthier/Better food choices: This theme appears 39 times. Responses include eating healthier, eating more vegetables, focusing on protein, buying organic, or making healthier food choices.

“I’m eating better. I’m putting better stuff in my body. I’m working out more. Also I’m buying different things that I need for a healthier life.” – He/him, 43 years old, Texas

Budgeting/Saving money/More conscious of spending/Looking for sales: This broad category appears 65 times. Many people are trying to save money, be more budget-conscious, look for sales, use coupons, or buy less.

“[I’m] budgeting better. Picked up a second job.” – He/him, 39 years old, Tennessee

Shopping online more: This response appears 25 times.

“I visit Sam’s Club more often for bulk purchases and savings. I also shop online more frequently for pick up or shipped items from CVS.” – She/her, 61 years old, Florida

Cooking more/Eating at home more: This theme appears 14 times.

“I’m watching my money more as things get more expensive. We’re also eating out less as restaurant prices have risen tremendously.” – She/her, 58 years old, Pennsylvania

In this same Highlight survey of 1,000 Americans, we also asked respondents: What are you doing to better manage your spending?

In a multiple choice question where respondents were invited to select all that apply, this is how panelists responded, from most popular to least popular responses:

  • 67% of respondents are eating at home more often
  • 57% are shopping sales more actively
  • 55% are buying fewer non-essential products
  • 54% are holding off on major purchases (e.g., tech, furniture)
  • 43% are avoiding eating out
  • 39% are switching to more affordable brands
  • 33% are canceling subscriptions
  • 32% are traveling less
  • 30% are choosing private label/store brands
  • 29% are buying in bulk
  • 23% are using budgeting apps or tracking spending more closely
  • 17% are cutting back on wellness and/or beauty spending
  • 9% said none of the above

In a multiple choice question, Highlight asked respondents: Which of the following, if any, are you not willing to sacrifice–even when budgets are tight? (Select up to three.) These were their answers, from most to least popular:

  • 42% of respondents are not willing to give up high-quality food & beverages
  • 39% say they are not willing to give up their self-care and wellness routines
  • 31% don’t want to give up their streaming services or other entertainment
  • 30% say they won’t part with their preferred brands
  • 29% won’t give up travel or experiences
  • 23% said they won’t give up products that make them feel good or confident
  • 15% said they won’t give up conveniences like delivery
  • 7% said they won’t give up products that support sustainability of ethics

Highlight also gave respondents the opportunity to say what habits they are not willing to change or products they are not willing to give up in their own words.

Overall, the qualitative results mirrored the quantitative: Consumers mentioned over and over again that they are unwilling to give up buying food, especially healthy, quality, or favorite foods.

While respondents across genders agreed high-quality food is their non-negotiable item, women most frequently mentioned their unwillingness to give up coffee specifically. Their open-ended responses mentioned iced coffee, Starbucks, Dunkin, “good coffee,” “homemade coffee,” and other specific brands.

“I MUST have my favorite coffee even though it’s more expensive even now.” – She/her, 61 years old, Iowa

Women respondents were also more likely to mention these topics in their open-ended answers:

  • Specifically, healthy food was mentioned approximately 40 times, often paired with words like “quality,” “organic,” and “produce.”
  • Personal care and self-care purchases were mentioned approximately 30 times, including terms like manicures, skincare, hair care, beauty, and nails.
  • Pets and pet products (dog food, cat food, vet care, pet supplies and more) were mentioned approximately 30 times.

“I still buy extra healthy food. The healthier the food, the more it will cost. I will not buy cheap food.” – She/her, 66 years old, Arizona

“Hair color and nail appointments.” – She/her, 55 years old, Texas

“My dog’s food and heartworm medication. I will always make sure to buy her the good healthy food she is on and make sure she has her heartworm medication to take each month.” – She/her, 25 years old, Florida

Male respondents also placed a premium on high-quality food and eating well. When it comes to themes that were repeated most frequently in their open-ended responses, nothing else came close to quality food, which was mentioned upwards of 60 times.

“I will still purchase organic produce and look for items that are healthier.” – He/him, 43 years old, Arizona

But when we look at the honorable mentions, a few stand out:

  • Men do not want to part with their streaming services, television, and other entertainment (mentioned approximately 20 times)
  • Men also mentioned travel, vacations, and getaways as a non-negotiable (mentioned approximately 20 times)
  • Men mentioned not wanting to give up purchases that support a healthy lifestyle (eating, gym, working out), but mentioned this less frequently than female respondents did (approximately 15 times versus 40 for women)

“I pay for a number of TV streaming services that I would feel deprived not to have.” – He/him, 55 years old, Texas

“My grocery bill and gym membership.” – He/him, 47 years old, Oregon

“We still go on trips and vacations.” – He/him, 50 years old, New York

“My kid’s favorite snack: She loves Takis. They’re a bit expensive but I give up things for her. She is all that matters.” – He/him, 40 years old, North Carolina

Original source

submitted by /u/lets_highlight
[link] [comments]

Faster Datasets With Parquet Content Defined Chunking

A gold mine of info on optimizing Parquet: https://huggingface.co/blog/parquet-cdc

Here is the idea: chunk and deduplicate your data and you will speed up uploads and downloads

Hugging Face uses this to speed up data workflows on their platform (they use a dedupe-based storage called Xet).

Pretty excited by this. It looks like it can really speed up data workflows, especially operations like append/delete/edit/insert. Happy to have this enabled for Hugging Face where the AI datasets community is amazing too. What do you think ?

submitted by /u/qlhoest
[link] [comments]

Looking For LFM‑2b Or LFM‑1b Last.fm Listening Dataset (No Longer Available)

I’m a researcher working on model-agnostic meta-learning (MAML) for personalized music recommendation. I urgently need access to either the LFM‑2b or LFM‑1b dataset, which used to be hosted by JKU Linz but has since been removed due to licensing constraints.

I’ve already checked Kaggle, GitHub, Zenodo, and official sources, no mirrors exist.

If anyone has a copy and is willing to share (for research use only), please DM me or point me to a working archive/mirror.
Alternatively, any help with locating subsets or working alternatives would also be appreciated.

Thanks in advance.

submitted by /u/hugeballssmolpp
[link] [comments]

Where Do You Usually Get High-quality Web Data For Scraping Projects?

I’ve been working on a few projects recently where I needed structured data from e-commerce and social media sites (like prices, product descriptions, user reviews, etc.). I used to rely on my own scrapers with BeautifulSoup or Scrapy, but as you know, many sites now have rate-limiting, bot detection, or constantly changing layouts.

Lately, I’ve experimented with Bright Data to access web data from different regions/IPs — mostly for testing, not large-scale production. It worked surprisingly well, but I’m curious:

🔹 What sources or services are you all using when you need consistent or hard-to-access datasets from the web?

🔹 Any experiences with open APIs, rotating proxies, or maybe even public datasets that saved you a ton of work?

Would love to hear your approach, especially for projects where the public datasets don’t quite cut it.

submitted by /u/ysn_annaimi
[link] [comments]

Panicking And Need Help Finding Data Sets

Finishing a data visualization class and I need to find two separate, but related data sets. One has to have at least 300 records and 4 fields, the other has to have 100 records and 3 fields. I have to show something happening over time, and a geographical component. I’ve been searching for hours and am obviously not creative enough. Any help is deeply appreciated.

submitted by /u/ConclusionOld5538
[link] [comments]

Helping You Get Export Import DATA Customer/buyer Direct Leads , The Choice Of Your HSN Code Or Product Name [PAID]

I deal in import-export data and have direct sources with customs, allowing me to provide accurate and verified data based on your specific needs.

You can get a sample dataset, based on your product or HSN code. This will help you understand what kind of information you’ll receive. If it’s beneficial, I can then share the complete data as per your requirement—whether it’s for a particular company, product, or all exports/imports to specific countries.

This data is usually expensive due to its value, but I offer it at negotiable prices based on the number of rows your HSN code fetches in a given month

If you want a clearer picture, feel free to dm. I can also search specific companies—who they exported to, what quantity, and which countries what amount.

Let me know how you’d like to proceed, lets grow our business together.

I pay huge yearly fees for getting the import export data for my own company and thought if I could recover a small bit by helping others. And get the service in a winwin

submitted by /u/Outside_Eagle_5527
[link] [comments]

How Do I Structure My Dataset To Train My Model To Generate Questions?

I am trying to train a T5 model to be able to learn and generate Data Structure questions but I am not sure if the format of the data I scraped is correctly formatted. I’ve trained it without context and its generating questions that are barebones or not properly formatted and it is also not generating questions that make sense. What do I need to do to fix this problem?

Im training my model with this code:

from transformers import T5ForConditionalGeneration from transformers import T5Tokenizer from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments from datasets import Dataset import json def main(): global tokenizer with open('./datasets/final.json', 'r', encoding='utf-8') as f: data = json.load(f) dataset = Dataset.from_list(data) dataset = dataset.train_test_split(test_size=0.1) tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base") model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base") tokenized = dataset.map(tokenize, batched=True) tokenized_train = tokenized["train"].shuffle(seed=42) tokenized_eval = tokenized["test"].shuffle(seed=42) training_args = Seq2SeqTrainingArguments( output_dir="./outputs_T5", per_device_train_batch_size=4, per_device_eval_batch_size=4, num_train_epochs=10, save_strategy="epoch", learning_rate=5e-5, predict_with_generate=True, logging_dir="./logs_bart", ) trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_eval, tokenizer=tokenizer, compute_metrics=compute_metrics ) trainer.train() eval_results = trainer.evaluate() print(eval_results) def compute_metrics(eval_preds): predictions, labels = eval_preds decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) exact_matches = sum(p.strip() == l.strip() for p, l in zip(decoded_preds, decoded_labels)) return {"accuracy": exact_matches / len(decoded_preds)} def tokenize(examples): global tokenizer model_inputs = tokenizer(examples["input_text"], max_length=128, truncation=True, padding="max_length") with tokenizer.as_target_tokenizer(): labels = tokenizer(examples["target_text"], max_length=128, truncation=True, padding="max_length") model_inputs["labels"] = labels["input_ids"] return model_inputs if __name__ == "__main__": main() 

and heres how my dataset currently looks like

{ "input_text": "Topic: GraphnDifficulty: EasynContext: The kth power of an adjacency matrix gives the number of paths of length k between any two vertices in a graph. Each entry A^k[i][j] equals the number of such paths from i to j.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "Let A be an adjacency matrix of a graph G. The ijth entry in the matrix AK , gives, , Choices: ['A\nThe number of paths of length K from vertex Vi to vertex \n Vj.', 'B\nShortest path of K edges from vertex Vi to vertex Vj.', 'C\nLength of a Eulerian path from vertex Vi to vertex Vj.', 'D\nLength of a Hamiltonian cycle from vertex Vi to vertex \n Vj.'], Answer: BnShortest path of K edges from vertex Vi to vertex Vj." }, { "input_text": "Topic: TreenDifficulty: EasynContext: In an AVL tree, after inserting a node, the balance factor of nodes along the path to the root may need to be updated. This ensures the tree remains balanced.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "To restore the AVL property after inserting a element, we start at the insertion point and move towards root of that tree. is this statement true?na) truenb) falsennnAnswer: a" }, { "input_text": "Topic: TreenDifficulty: EasynContext: AA-Trees and Red-Black Trees are both self-balancing binary search trees. They have similar properties and performance characteristics.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "Which of the following trees is similar to that of an AA-Tree?na) Splay Treenb) B+ Treenc) AVL Treend) Red-Black TreennnAnswer: d" }, { "input_text": "Topic: TheorynDifficulty: EasynContext: In hashing theory, probe sequences like linear and quadratic probing determine how collisions are resolved. Expression evaluation and conversion also fall under theory topics, such as converting infix to postfix using stacks.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "What would be the Prefix notation for the given equation?nna) ^^^ABCDnb) ^A^B^CDnc) ABCD^^^nd) AB^C^DnnAnswer: b" }, { "input_text": "Topic: TheorynDifficulty: EasynContext: Linked list manipulations require careful updates of pointers. The given code removes the first node in a circular list and returns its value.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "What is the functionality of the following code? Choose the most appropriate answer.nnpublic int function() {n if(head == null) return Integer.MIN_VALUE;n int var;n Node temp = head;n while(temp.getNext() != head) temp = temp.getNext();n if(temp == head) {n var = head.getItem();n head = null;n return var;n }n temp.setNext(head.getNext());n var = head.getItem();n head = head.getNext();n return var;n}nna) Return data from the end of the listnb) Returns the data and deletes the node at the end of the listnc) Returns the data from the beginning of the listnd) Returns the data and deletes the node from the beginning of the listnnAnswer: d" }, { "input_text": "Topic: ArraynDifficulty: EasynContext: Breadth First Traversal (BFS) is implemented using a queue. This data structure allows level-order traversal in graphs or trees.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "The data structure required for Breadth First Traversal on a graph is?na) Stacknb) Arraync) Queuend) TreennnAnswer: c" }, 

submitted by /u/Loud-Dream-975
[link] [comments]

Tool To Get Customer Review And Comment Data

Not sure if this is the right sub to ask, but we’re going for it anyways

I’m looking for a tool that can get us customer review and comment data from ecomm sites (Amazon, walmart.com, etc..), third party review sites like trustpilot, and social media type sources. Looking to have it loaded into a snowflake data warehouse or Azure BLOB container for snowflake ingestion.

Let me know what you have, like, don’t like… I’m starting from scratch

submitted by /u/Apprehensive-Ad-80
[link] [comments]

Help Needed To Find A Dataset Example Comprising Of At Least 1000 Rows And At Least 5 Columns Which Contain Both Categorical (at Least 2) And Numerical (at Least 3) Variables.

Hi, I’m a bit stuck in an assignment where I have to use a dataset comprising of at least 1000 rows and at least 5 columns which contain both categorical (at least 2) and numerical (at least 3) variables. I also have to cite the source. It would be great if you guys please help me out…

submitted by /u/OkDark1310
[link] [comments]

[Synthetic] [self-promotion] We Build An Open-source Dataset To Test Spatial Pathfinding And Reasoning Skills In LLMs

Large language models often lack capabilities of pathfinding and reasoning skills. With the development of reasoning models, this got better, but we are missing the datasets to quantify these skills. Improving LLMs in this domain can be useful for robotics, as they often require some LLM to create an action plan to solve specific tasks. Therefore, we created the dataset Spatial Pathfinding and Reasoning Challenge (SPaRC) based on the game “The Witness”. This task requires the LLM to create a path from a given start point to an end point on a 2D Grid while satisfying specific rules placed on the grid.

More details, an interactive demonstration and the paper for the dataset can be found under: https://sparc.gipplab.org

In the paper, we compared the capabilities of current SOTA reasoning models with a human baseline:

  • Human baseline: 98% accuracy
  • o4-mini: 15.8% accuracy
  • QwQ 32B: 5.8% accuracy

This shows that there is still a large gap between humans and the capabilities of reasoning model.

Each of these puzzles is assigned a difficulty score from 1 to 5. While humans solve 100% of level 1 puzzles and 94.5% of level 5 puzzles, LLMs struggle much more: o4-mini solves 47.7% of level 1 puzzles, but only 1.1% of level 5 puzzles. Additionally, we found that these models fail to increase their reasoning time proportionally to puzzle difficulty. In some cases, they use less reasoning time, even though the human baseline requires a stark increase in reasoning time.

submitted by /u/Sral248
[link] [comments]

NLSY97 Data – In NLSY97 I See Weeks Marked “employed” But No Job Record Has Anyone Else Run Into This?

Hi all,

I’m working with NLSY97 and ran into something that’s confusing me. I’ve built employment status spells (based on weekly employment status) and job spells (based on start/end dates and employer IDs), and then merged them to see how things line up.

Most of it looks great. The job spells and employment spells match up really well. But in a few places a person is marked as “employed” for a week, but there’s no corresponding job record. ( from 3 days up to 2-3 weeks) No start date, no end date, no employer ID.

Is this normal in NLSY97? Could it have something to do with how the interviews were conducted, like status being carried over between interviews, or data being lagged?

I’ve checked my code and raw event files, and it doesn’t seem like I’m dropping rows or mismatching things. The issue only shows up occasionally, which makes me wonder if it’s just part of how the data is structured rather than an error on my end.

If anyone has seen this or knows how to handle it, I’d really appreciate your thoughts. I’m happy to share code snippets if that helps.

Thanks so much in advance!

submitted by /u/Exciting-Skin3341
[link] [comments]

Looking For Uncommon / Niche Time Series Datasets (Updated Daily & Free)

Hi everyone,

I’m starting a side project where I compile and transform time series data from different sources. I’m looking for interesting datasets or APIs with the following characteristics:

  • Must be downloadable (e.g., via cronjob or script-friendly API)
  • Updated at least daily
  • Includes historical data
  • Free to use
  • Not crypto or stock trading-related
  • Related to human activity (directly or indirectly)
  • The more niche or unusual, the better!

Here’s an example of something I really liked:
🔗 Queue Times API — it provides live and historical queue times for theme parks.

Some ideas I had (but haven’t found sources for yet):

  • Number of Amazon orders per day
  • Electricity consumption by city or country
  • Cars in a specific parking lot
  • Foot traffic in a shopping mall

Basically, I’m after uncommon but fun time series datasets—things you wouldn’t usually see in mainstream data science projects.

Any suggestions, links, or ideas to explore would be hugely appreciated. Thanks!

submitted by /u/JdeHK45
[link] [comments]

My Dream Project Is Finally Live: An Open-source AI Voice Agent Framework.

Hey community,

I’m Sagar, co-founder of VideoSDK.

I’ve been working in real-time communication for years, building the infrastructure that powers live voice and video across thousands of applications. But now, as developers push models to communicate in real-time, a new layer of complexity is emerging.

Today, voice is becoming the new UI. We expect agents to feel human, to understand us, respond instantly, and work seamlessly across web, mobile, and even telephony. But developers have been forced to stitch together fragile stacks: STT here, LLM there, TTS somewhere else… glued with HTTP endpoints and prayer.

So we built something to solve that.

Today, we’re open-sourcing our AI Voice Agent framework, a real-time infrastructure layer built specifically for voice agents. It’s production-grade, developer-friendly, and designed to abstract away the painful parts of building real-time, AI-powered conversations.

We are live on Product Hunt today and would be incredibly grateful for your feedback and support.

Product Hunt Link: https://www.producthunt.com/products/video-sdk/launches/voice-agent-sdk

Here’s what it offers:

  • Build agents in just 10 lines of code
  • Plug in any models you like – OpenAI, ElevenLabs, Deepgram, and others
  • Built-in voice activity detection and turn-taking
  • Session-level observability for debugging and monitoring
  • Global infrastructure that scales out of the box
  • Works across platforms: web, mobile, IoT, and even Unity
  • Option to deploy on VideoSDK Cloud, fully optimized for low cost and performance
  • And most importantly, it’s 100% open source

Most importantly, it’s fully open source. We didn’t want to create another black box. We wanted to give developers a transparent, extensible foundation they can rely on, and build on top of.

Here is the Github Repo: https://github.com/videosdk-live/agents
(Please do star the repo to help it reach others as well)

This is the first of several launches we’ve lined up for the week.

I’ll be around all day, would love to hear your feedback, questions, or what you’re building next.

Thanks for being here,

Sagar

submitted by /u/videosdk_live
[link] [comments]

Just Started Learning Data Analysis. It’s Tough, But I’m Enjoying It So Far.

Hey everyone, I recently started learning data analysis. Right now I’m going through Excel, SQL, and Python (Pandas is confusing but interesting).

I come from a non-tech background, so everything feels new. Some days are frustrating, but I’m slowly getting the hang of it.

If anyone here has tips for beginners or good free resources, I’d really appreciate it. Also, if you’ve switched careers into data — how was your journey?

Thanks in advance

submitted by /u/ManufacturerFar2134
[link] [comments]