Natural Language Translation Dataset In A Specified Domain

Is a natural language translation dataset from ENG to another language in a very specific domain worthwhile to curate for conference submission?

I am a part-time translator working in this specific domain who is originally a student wondering if this could be a potential submission. I have quite several peers who are willing to put in the effort to curate a decent sized dataset (~2k) translated scripts for research use for conference submission.

However, I am not quite confident as to how useful or meaningful of a contribution this will be to the community.

submitted by /u/AdGlittering3010
[link] [comments]

0

I Need Datasets For An Academic Project About Housing , Renting And Buying

Hello everyone,
I’m an engineering student currently taking a course called Applied Machine Learning. As part of the course, I need to develop a web application that demonstrates key machine learning concepts such as segregation and classification. I’m looking for datasets related to housing markets or middle-class neighborhoods. Additionally, I’d appreciate any review-based datasets, as I plan to incorporate NLP into my project.
Thank you in advance!

submitted by /u/mendaX20
[link] [comments]

0

Does Anybody Have Car-1000 Dataset For FGVC Task?

I’m currently working on a car classification project for a university-level neural network course. The Car-1000 dataset is the ideal candidate for our fine-grained visual categorization task.

The official paper cites a GitHub repository for the dataset’s release (toggle1995/Car-1000), but unfortunately, the repository appears to contain only the README.md and no actual data files.

Has anyone successfully downloaded or archived the full Car-1000 image dataset (140,312 images across 1,000 models)? If so, I would be very grateful if you could share a link or guide me to an alternative download source.

Any help with this academic project is highly appreciated! Thank you.

submitted by /u/Porsche_Lover2002
[link] [comments]

0

Dataset About Diplomatic Visits By Chinese Leaders

I created a dataset for a research project to get data about the diplomatic visits by Chinese leaders form 1950 to 2025.

submitted by /u/janethelame_
[link] [comments]

0

Need A Dataset Of Videos Or Images Of Swifts Feeding And Not Feeding From Birdbox Cams

Hi guys,

Doing a bit of research here for school but i really need a dataset of images/videos of swifts in their nests/birdboxes getting fed or not fed, or just videos from birdbox cams of swifts in general. Not really that urgent but any help is appreciated.

Thanks

submitted by /u/Horror-Tower2571
[link] [comments]

0

Where Can I Find Reliable, Up-to-date U.S. Businesses Data?

Looking out for a free/open source/publicly available data for US businesses data for my project.

The project is a weather engine, connecting affected customers to nearby prospects.

submitted by /u/BrilliantSea8202
[link] [comments]

0

I Need Two Datasets, Each >100mb That I Can Draw Correlations From

Any ideas =(

Everything i’ve liked has been under a 100mb so far.

submitted by /u/TokkiJK
[link] [comments]

0

Leading Websites Homepage Images Dataset – Constantly Expanding

A little bird from mangoblogger.com told me that all the images from world’s leading website homepages can be found here – http://cdn.mangoblogger.com

Maybe good for training models or running experiments. Not sure how long this will be public but users of mangoblogger.com can always access this. The dataset drills down from the top level domains to individual websites.

submitted by /u/Pristine-Arachnid-41
[link] [comments]

0

Japanese Language Difficulty Dataset

https://huggingface.co/datasets/ronantakizawa/japanese-text-difficulty

This dataset gathered texts from Aozora Bunko (A corpus of Japanese texts) and marked them with jReadability scores, plus detailed metrics on kanji density, vocabulary, grammar, and sentence structure.

This is an excellent dataset if you want to train your LLM to understand the complexities of the Japanese language 👍

submitted by /u/Ok_Employee_6418
[link] [comments]

0

Looking For [PAID] Large-scale B2B Or Firmographic Dataset For Behavioral Research

Hi everyone, I’m conducting a research project on business behavior patterns and looking for recommendations on legally licensed, large-scale firmographic or B2B datasets.

Purpose: strictly for data analysis and AI behavioral modeling and not for marketing, lead generation, or outreach.

What I’m looking for:

Basic business contact structure (first name, last name, job title, company name)
Optional firmographics like industry, company size, or revenue range
Ideally, a dataset with millions of records from a verified or commercial source

Requirements:

Must be legally licensed or open for research use
GDPR/CCPA compliant or anonymized
I’m open to [PAID] licensed vendors or public/open datasets

If anyone has experience with trusted data providers or knows of reputable sources that can deliver at this scale, I’d really appreciate your suggestions.

Mods: this post does not request PII, only guidance on compliant data sources. Happy to adjust wording if needed.

submitted by /u/Axiata244
[link] [comments]

0

[self-promotion] Every Number On The Internet, Structured And Queryable.

Hi, datasets!

Want to know France’s GDP growth? You’re checking Eurostat, World Bank, OECD… then wrestling with CSVs, different formats, inconsistent naming. It’s 2025, and we’re still doing this manually.

qoery.com makes every time-series statistic queryable in plain English or SQL. Just ask “What’s the GDP growth rate for France?” and get structured data back instantly:

... "id": "14256", "entity": { "id": "france", "name": "France" }, "metric": { "id": "gdp_growth_rate", "name": "GDP change percent" }, ... "observations": [ { "timestamp": "1993-12-31T00:00:00+00:00", "value": "1670080000000.0000000000" }, { "timestamp": "1994-12-31T00:00:00+00:00", "value": "1709890000000.0000000000" }, { "timestamp": "1995-12-31T00:00:00+00:00", "value": "1749300000000.0000000000" }, ...

We’ve indexed 50M observations across 1.2M series from ~10,000 sources, including the World Bank, Our World in Data, and more.

Right now we’re focused on economic/demographic data, but I’m curious:
– What statistics do YOU constantly need but struggle to access?

We have a free tier (250 queries/month) so you can try it today. Would love your feedback on what data sources to prioritize next!

submitted by /u/SammieStyles
[link] [comments]

0

Leetcode Python Solutions Code Dataset

submitted by /u/big_hole_energy
[link] [comments]

0

Leetcode Solutions In Python Dataset

submitted by /u/big_hole_energy
[link] [comments]

0

Best Sites To Get Free And Copyright-free Images Per Category (e.g. Dog Breeds, Instruments, Etc)?

Hey folks! 👋

I’m looking for good websites where I can find free, copyright-free (or Creative Commons) images that are already organized or easy to browse by category — for example: • Dog breeds 🐶 • Musical instruments 🎸 • Football teams ⚽️ • Landmarks, foods, etc.

Basically, something I could use for an educational or guessing-style game project. I’ve checked Unsplash and Pexels, but they’re quite general — not very structured by category.

Any recommendations for sites or archives that have structured collections or datasets of free images? They should be easy to scrap or download.

Bonus points if they allow attribution-free use or have clear licensing info.

I have found something but usually they ask to pay a subscription.

Thanks in advance! 🙌

submitted by /u/Vanals
[link] [comments]

0

May I Ask Where I Can Find The Network Datasets In The Thesis?

Recently, I have been reading papers on social networks, in which some social network datasets were used for experiments（Email、NetScience、Facebook、Wiki-Vote、PGP、NetHEPT、CondMat、NetPHY）. I couldn’t find several of these network data on the Stanford nasp or the networkrepository website, such as NetHEPT, NetPHY, and CondMat. May I ask where I can find these social network data?

submitted by /u/Remarkable-Scale2170
[link] [comments]

0

Looking For Food Images Dataset For Ai

submitted by /u/lasxavier
[link] [comments]

0

Any Affordable API That Actually Gives Flight Data Like Terminals, Gates, And Real-time Departure Or Arrival Info?

Hey Guys, I’m building a small dashboard that shows live flight information, and I really need terminal and gate data for each flight.

Does anyone know of an API that actually provides that kind of airport-level detail? I’m looking for an affordable but reliable option.

submitted by /u/Glum_Buyer_9777
[link] [comments]

0

I Scraped Thousands Of Guitar Gear Sales And Turned It Into Monthly CSV Packs (indie Data Project)

Hey folks 👋,
I’ve been working on a side project where I collect sales data for music gear and package it into clean CSV datasets. The idea is to help musicians, collectors, and resellers spot trends — like which guitars/pedals are moving fastest, average used vs new prices, etc.

I’m putting them up as monthly “data packs” — each one’s thousands of real-world listings, cleaned and formatted. They cover new/used guitars, pedals, and more.

If you’re curious, you can check them out here:
👉 Automaton Labs on Etsy

Would love feedback on what you’d find most useful (specific brands? types of gear? pricing breakdowns?).

submitted by /u/KaleidoscopeSafe747
[link] [comments]

0

I Built A Claude MCP That Lets You Query Real Behavioral Data

(self promotion disclaimer, but I truly believe the dataset is cool!)

I just built an MCP server you can connect to Claude that turns it into a real-time market research assistant.

Instead of AI making things up, it uses actual behavioral data collected from our live panel. so you can ask questions like:

What are Gen Z watching on YouTube right now?

Which cosmetics brands are trending in the past week?

What do people who read The New York Times also buy online?

How to try it (takes <1 min): 1. Add the MCP to Claude — instructions here → https://docs.generationlab.org/getting-started/quickstart 2. Ask Claude any behavioral question.

Example output: https://claude.ai/public/artifacts/2c121317-0286-40cb-97be-e883ceda4b2e

It’s free! I’d love your feedback or cool examples of what you discover.

submitted by /u/AdTemporary2475
[link] [comments]

0

Vogue Or Other Datasets With The Magazine Covers

Hi everyone,

I wanted to ask here if anyone knows whether there is a dataset with vogue covers or other magazine covers. This is because I have a university exam about Artificial Intelligence for Multimedia and I have to create a model on Google Colab and train it on a dataset and I thought about making a Vogue Cover generator.

I already saw that the archive does not provide APIs or anything useful for AI training and development

Thank you so much in advance for your replies 😀

submitted by /u/HauteGina
[link] [comments]

0

Skip Kaggle Hunting. Free And Open Source AI Data Generator

We built this AI data generator for our own demos, then realized everyone needed it.

So here it is, free and hosted: realistic business datasets from simple dropdowns. No account required, unlimited exports. Perfect for testing, prototyping, or when Kaggle feels stale.

Open source repo included if you want to hack on it.

O

submitted by /u/Ramirond
[link] [comments]

0

Are You Using Synthetic Data From ML/LLM To Enrich Your Datasets ?

Hey, I recently started working with ML and needed to expand my dataset. I was wondering how common it is to use synthetic data.

Also, I noticed some companies use external services like Gretel or Mostly (for CTGAN/TVAE), but why not run models locally? Is it a cost thing, convenience, or something else?

Curious to hear about your experiences.

submitted by /u/Tall_Insect7119
[link] [comments]

0

Hear AI Papers, A Podcast That Summarise AI Papers

https://open.spotify.com/show/33HniLxQd1QdYzSdwFQs2u?si=F4Qp5K-7QxiTrIrHn6T5MA

submitted by /u/ayoubelma
[link] [comments]

0

Collecting News Headlines From The Last 2 Years

Hey Everyone,

So we are working on our Masters Thesis and need to collect the data of News Headlines in the Scandinavian market. More precisely: Newsheadlines from Norway, Denmark, and Sweden. We have never tried webscraping before but we are positive on taking on a challenge. Does anyone know the easiest way to gather this data? Is it possible to find it online, without doing our own webscraping?

submitted by /u/hiddenman12345
[link] [comments]

0

[Research] [Question] & [Carreer] Is There A Good Source For The Average NFL Ticket Prices Of All Teams Since 2015?

I need this data for my thesis, please help

submitted by /u/Flaky-Ad-234
[link] [comments]

0

Offering Free Jobs Dataset Covering Thousands Of Companies, 1 Million+ Active/expired Job Postings Over Last 1 Year

Hi all, I run a job search engine (Meterwork) that I built from the ground up and over the last year I’ve scraped jobs data almost daily directly from the career pages of thousands of companies. My db has well over a million active and expired jobs.

I fee like there’s a lot of potential to create some cool data visualizations so I was wondering if anyone was interested in the data I had. My only request would be to cite my website if you plan on publishing any blog posts or infographics using the data I share.

I’ve tried creating some tools using the data I have (job duration estimator, job openings tracker, salary tool – links in footer of the website) but I think there’s a lot more potential for interesting use of the data.

So if you have any ideas you’d like to use the data for just let me know and I can figure out how to get it to you.

submitted by /u/jjzwork
[link] [comments]

0

Best Approach For Open-Ended VQA: Fine-tuning A VL Model Vs. Using An Agentic Framework (LangChain)?

submitted by /u/Fit-Musician-8969
[link] [comments]

0

Title: Steam Dataset 2025 – 263K Games With Multi-modal Database Architecture (PostgreSQL + Pgvector)

I’ve been working on a modernized Steam dataset that goes beyond the typical CSV dump approach. My third data science project, and my first serious one that I’ve published on Zenodo. I’m a systems engineer, so I take a bit of a different approach and have extensive documentation.

Would love a star on the repo if you’re so inclined or get use from it! https://github.com/vintagedon/steam-dataset-2025

After collecting data on 263,890 applications from Steam’s official API (including games, DLC, software, and tools), I built a multi-modal database system designed for actual data science workflows. Both as an exercise, a way to ‘show my work’ and also to prep for my own paper on the dataset.

What makes this different:

Architecture-first approach: Instead of flat CSV files, this uses PostgreSQL 16 for normalized relational data, Neo4j for publisher/developer relationship graphs, and pgvector for semantic search on game descriptions. The goal was to make it analytically-native from the start.

Comprehensive coverage: 263K applications compared to the 27K in the popular 2019 Kaggle dataset. Includes rich HTML descriptions with embedded media, international pricing, detailed metadata, and Steam’s full application catalog as of January 2025.

Semantic search ready: Game descriptions are vectorized using sentence-transformers, enabling queries like “find games similar to Baldur’s Gate 3” based on actual content similarity rather than just tags.

Use cases: – NLP projects using game descriptions (avg 270 words) – Price prediction models with international market data – Semantic search and recommendation systems – Time-series analysis of gaming trends

Data quality notes: – ~56% API success rate (Steam delists games, regional restrictions, content type diversity) – Conservative rate limiting (1.5s delays) for sustainable collection – All data from official Steam Web API only (no third-party scrapers) – Comprehensive error handling and retry logic

The dataset is fully documented with setup guides, analysis examples, and architectural decision rationale. Built using Python 3.12+, all collection and processing code included.

Repository: https://github.com/vintagedon/steam-dataset-2025

Zenodo Release: https://zenodo.org/records/17266923

Quick stats: – 263,890 total applications – ~150K successful detailed records – International pricing across 40+ currencies – 50+ metadata fields per game – Vector embeddings for 100K+ descriptions

This is an active project – still refining collection strategies and adding analytical examples. Open to feedback on what analysis would be most useful to include.

Technical stack: Python, PostgreSQL 16, Neo4j, pgvector, sentence-transformers, official Steam Web API

submitted by /u/vintagedon
[link] [comments]

0

Here’s A Relational DB Of All Space Biology Papers Since 2010 (with Author Links, Text & More)

I just compiled every space biology publication from 2010–2025 into a clean SQLite dataset (with full text, authors, and author–publication links). 📂 Download the dataset on Kaggle 💻 See the code on GitHub

Here are some highlights 👇

🔬 Top 5 Most Prolific Authors

Name	Publications
Kasthuri Venkateswaran	54
Christopher E Mason	49
Afshin Beheshti	29
Sylvain V Costes	29
Nitin K Singh	24

👉 Kasthuri Venkateswaran and Christopher Mason are by far the most prolific contributors to space biology in the last 15 years.

👥 Top 5 Publications with the Most Authors

Title	Author Count
The Space Omics and Medical Atlas (SOMA) and international consortium to advance space biology	109
Cosmic kidney disease: an integrated pan-omic, multi-organ, and multi-species view	105
Molecular and physiologic changes in the Spaceflight-Associated Neuro-ocular Syndrome	59
Single-cell multi-ome and immune profiles of the International Space Station crew	50
NASA GeneLab RNA-Seq Consensus Pipeline: Standardization for spaceflight biology	45

👉 The SOMA paper had 109 authors, a clear example of how massive collaborations in space biology research have become.

📈 Publications per Year

Year	Publications
2010	9
2011	16
2012	13
2013	20
2014	30
2015	35
2016	28
2017	36
2018	43
2019	33
2020	57
2021	56
2022	56
2023	51
2024	66
2025	23

👉 Notice the surge after 2020, likely tied to Artemis missions, renewed ISS research, and a broader push in space health.

Disclaimer: This dataset was authored by me. Feedback is very welcome! 📂 Dataset on Kaggle 💻 Code on GitHub

submitted by /u/union4breakfast
[link] [comments]

0

Open-source Bluesky Social Activity Monitoring Pipeline!

The AT Protocol from 🦋 Bluesky Social is an open-source networking paradigm made for social app builders. More information here: https://docs.bsky.app/docs/advanced-guides/atproto

The OSS community has shipped a great 🐍 Python SDK with a data firehose endpoint, documented here: https://atproto.blue/en/latest/atproto_firehose/index.html

🧠 MOSTLY AI users can now access this streaming endpoint whilst chatting with the MOSTLY AI Assistant!Check out the public dataset here: https://app.mostly.ai/d/datasets/9e915b64-93fe-48c9-9e5c-636dea5b377e

This is a great tool to monitor and analyze social media and track virality trends as they are happening!

Check out the analysis the Assistant built for me here: https://app.mostly.ai/public/artifacts/c3eb4794-9de4-4794-8a85-b3f2ab717a13

Disclosure: MOSTLY AI Affiliate

submitted by /u/SyllabubNo626
[link] [comments]

0

Category: Datatards

Natural Language Translation Dataset In A Specified Domain

I Need Datasets For An Academic Project About Housing , Renting And Buying

Does Anybody Have Car-1000 Dataset For FGVC Task?

Dataset About Diplomatic Visits By Chinese Leaders

Need A Dataset Of Videos Or Images Of Swifts Feeding And Not Feeding From Birdbox Cams

Where Can I Find Reliable, Up-to-date U.S. Businesses Data?

I Need Two Datasets, Each >100mb That I Can Draw Correlations From

Leading Websites Homepage Images Dataset – Constantly Expanding

Japanese Language Difficulty Dataset

Looking For [PAID] Large-scale B2B Or Firmographic Dataset For Behavioral Research

[self-promotion] Every Number On The Internet, Structured And Queryable.

Leetcode Python Solutions Code Dataset

Leetcode Solutions In Python Dataset

Best Sites To Get Free And Copyright-free Images Per Category (e.g. Dog Breeds, Instruments, Etc)?

May I Ask Where I Can Find The Network Datasets In The Thesis?

Looking For Food Images Dataset For Ai

Any Affordable API That Actually Gives Flight Data Like Terminals, Gates, And Real-time Departure Or Arrival Info?

I Scraped Thousands Of Guitar Gear Sales And Turned It Into Monthly CSV Packs (indie Data Project)

I Built A Claude MCP That Lets You Query Real Behavioral Data

Vogue Or Other Datasets With The Magazine Covers

Skip Kaggle Hunting. Free And Open Source AI Data Generator

Are You Using Synthetic Data From ML/LLM To Enrich Your Datasets ?

Hear AI Papers, A Podcast That Summarise AI Papers

Collecting News Headlines From The Last 2 Years

[Research] [Question] & [Carreer] Is There A Good Source For The Average NFL Ticket Prices Of All Teams Since 2015?

Offering Free Jobs Dataset Covering Thousands Of Companies, 1 Million+ Active/expired Job Postings Over Last 1 Year

Best Approach For Open-Ended VQA: Fine-tuning A VL Model Vs. Using An Agentic Framework (LangChain)?

Title: Steam Dataset 2025 – 263K Games With Multi-modal Database Architecture (PostgreSQL + Pgvector)

Here’s A Relational DB Of All Space Biology Papers Since 2010 (with Author Links, Text & More)

Here are some highlights 👇

🔬 Top 5 Most Prolific Authors

👉 Kasthuri Venkateswaran and Christopher Mason are by far the most prolific contributors to space biology in the last 15 years.

👥 Top 5 Publications with the Most Authors

👉 The SOMA paper had 109 authors, a clear example of how massive collaborations in space biology research have become.

📈 Publications per Year

👉 Notice the surge after 2020, likely tied to Artemis missions, renewed ISS research, and a broader push in space health.

Open-source Bluesky Social Activity Monitoring Pipeline!

Recent Posts

Recent Comments

18+ Content

Here are some highlights 👇

🔬 Top 5 Most Prolific Authors

👉 Kasthuri Venkateswaran and Christopher Mason are by far the most prolific contributors to space biology in the last 15 years.

👥 Top 5 Publications with the Most Authors

👉 The SOMA paper had 109 authors, a clear example of how massive collaborations in space biology research have become.

📈 Publications per Year

👉 Notice the surge after 2020, likely tied to Artemis missions, renewed ISS research, and a broader push in space health.

Recent Posts

Recent Comments