I have never tried to train an ai model before .I need some datasets on car sounds and images ,damaged and good .this is for a personal project. Also any advice on how to approach this field 😅?
submitted by /u/soojobless
[link] [comments]
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
I have never tried to train an ai model before .I need some datasets on car sounds and images ,damaged and good .this is for a personal project. Also any advice on how to approach this field 😅?
submitted by /u/soojobless
[link] [comments]
Hi, everyone. which is a good free tool for Dataset Analizer?
submitted by /u/CodeStackDev
[link] [comments]
Finishing a data visualization class and I need to find two separate, but related data sets. One has to have at least 300 records and 4 fields, the other has to have 100 records and 3 fields. I have to show something happening over time, and a geographical component. I’ve been searching for hours and am obviously not creative enough. Any help is deeply appreciated.
submitted by /u/ConclusionOld5538
[link] [comments]
Preferably categorically divided on the level of sleep debt or number of hours.
Would appreciate it, as I have not been able to find any at all which are publicly available.
I am not looking for fatigue detection datasets as mainly that is what I have found.
Thanks so much!
submitted by /u/One_Tonight9726
[link] [comments]
I deal in import-export data and have direct sources with customs, allowing me to provide accurate and verified data based on your specific needs.
You can get a sample dataset, based on your product or HSN code. This will help you understand what kind of information you’ll receive. If it’s beneficial, I can then share the complete data as per your requirement—whether it’s for a particular company, product, or all exports/imports to specific countries.
This data is usually expensive due to its value, but I offer it at negotiable prices based on the number of rows your HSN code fetches in a given month
If you want a clearer picture, feel free to dm. I can also search specific companies—who they exported to, what quantity, and which countries what amount.
Let me know how you’d like to proceed, lets grow our business together.
I pay huge yearly fees for getting the import export data for my own company and thought if I could recover a small bit by helping others. And get the service in a winwin
submitted by /u/Outside_Eagle_5527
[link] [comments]
I am trying to train a T5 model to be able to learn and generate Data Structure questions but I am not sure if the format of the data I scraped is correctly formatted. I’ve trained it without context and its generating questions that are barebones or not properly formatted and it is also not generating questions that make sense. What do I need to do to fix this problem?
Im training my model with this code:
from transformers import T5ForConditionalGeneration from transformers import T5Tokenizer from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments from datasets import Dataset import json def main(): global tokenizer with open('./datasets/final.json', 'r', encoding='utf-8') as f: data = json.load(f) dataset = Dataset.from_list(data) dataset = dataset.train_test_split(test_size=0.1) tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base") model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base") tokenized = dataset.map(tokenize, batched=True) tokenized_train = tokenized["train"].shuffle(seed=42) tokenized_eval = tokenized["test"].shuffle(seed=42) training_args = Seq2SeqTrainingArguments( output_dir="./outputs_T5", per_device_train_batch_size=4, per_device_eval_batch_size=4, num_train_epochs=10, save_strategy="epoch", learning_rate=5e-5, predict_with_generate=True, logging_dir="./logs_bart", ) trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_eval, tokenizer=tokenizer, compute_metrics=compute_metrics ) trainer.train() eval_results = trainer.evaluate() print(eval_results) def compute_metrics(eval_preds): predictions, labels = eval_preds decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) exact_matches = sum(p.strip() == l.strip() for p, l in zip(decoded_preds, decoded_labels)) return {"accuracy": exact_matches / len(decoded_preds)} def tokenize(examples): global tokenizer model_inputs = tokenizer(examples["input_text"], max_length=128, truncation=True, padding="max_length") with tokenizer.as_target_tokenizer(): labels = tokenizer(examples["target_text"], max_length=128, truncation=True, padding="max_length") model_inputs["labels"] = labels["input_ids"] return model_inputs if __name__ == "__main__": main()
and heres how my dataset currently looks like
{ "input_text": "Topic: GraphnDifficulty: EasynContext: The kth power of an adjacency matrix gives the number of paths of length k between any two vertices in a graph. Each entry A^k[i][j] equals the number of such paths from i to j.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "Let A be an adjacency matrix of a graph G. The ijth entry in the matrix AK , gives, , Choices: ['A\nThe number of paths of length K from vertex Vi to vertex \n Vj.', 'B\nShortest path of K edges from vertex Vi to vertex Vj.', 'C\nLength of a Eulerian path from vertex Vi to vertex Vj.', 'D\nLength of a Hamiltonian cycle from vertex Vi to vertex \n Vj.'], Answer: BnShortest path of K edges from vertex Vi to vertex Vj." }, { "input_text": "Topic: TreenDifficulty: EasynContext: In an AVL tree, after inserting a node, the balance factor of nodes along the path to the root may need to be updated. This ensures the tree remains balanced.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "To restore the AVL property after inserting a element, we start at the insertion point and move towards root of that tree. is this statement true?na) truenb) falsennnAnswer: a" }, { "input_text": "Topic: TreenDifficulty: EasynContext: AA-Trees and Red-Black Trees are both self-balancing binary search trees. They have similar properties and performance characteristics.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "Which of the following trees is similar to that of an AA-Tree?na) Splay Treenb) B+ Treenc) AVL Treend) Red-Black TreennnAnswer: d" }, { "input_text": "Topic: TheorynDifficulty: EasynContext: In hashing theory, probe sequences like linear and quadratic probing determine how collisions are resolved. Expression evaluation and conversion also fall under theory topics, such as converting infix to postfix using stacks.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "What would be the Prefix notation for the given equation?nna) ^^^ABCDnb) ^A^B^CDnc) ABCD^^^nd) AB^C^DnnAnswer: b" }, { "input_text": "Topic: TheorynDifficulty: EasynContext: Linked list manipulations require careful updates of pointers. The given code removes the first node in a circular list and returns its value.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "What is the functionality of the following code? Choose the most appropriate answer.nnpublic int function() {n if(head == null) return Integer.MIN_VALUE;n int var;n Node temp = head;n while(temp.getNext() != head) temp = temp.getNext();n if(temp == head) {n var = head.getItem();n head = null;n return var;n }n temp.setNext(head.getNext());n var = head.getItem();n head = head.getNext();n return var;n}nna) Return data from the end of the listnb) Returns the data and deletes the node at the end of the listnc) Returns the data from the beginning of the listnd) Returns the data and deletes the node from the beginning of the listnnAnswer: d" }, { "input_text": "Topic: ArraynDifficulty: EasynContext: Breadth First Traversal (BFS) is implemented using a queue. This data structure allows level-order traversal in graphs or trees.nTask: Generate a multiple-choice question on the given topic and difficulty using the provided context.", "target_text": "The data structure required for Breadth First Traversal on a graph is?na) Stacknb) Arraync) Queuend) TreennnAnswer: c" },
submitted by /u/Loud-Dream-975
[link] [comments]
Not sure if this is the right sub to ask, but we’re going for it anyways
I’m looking for a tool that can get us customer review and comment data from ecomm sites (Amazon, walmart.com, etc..), third party review sites like trustpilot, and social media type sources. Looking to have it loaded into a snowflake data warehouse or Azure BLOB container for snowflake ingestion.
Let me know what you have, like, don’t like… I’m starting from scratch
submitted by /u/Apprehensive-Ad-80
[link] [comments]
I am trying to create a books database and need an API that provides chapter data for books. I tried the Open Library and Google Books APIs, but neither of them offers consistent chapter data, it seems to be hit or miss. Is there any reliable source to get this data, especially for nonfiction books? I would appreciate any advice.
submitted by /u/Snorlax_lax
[link] [comments]
I’m looking for a dataset with easy English dialogues for beginner language learning -> basic topics like greetings, shopping, etc.
Any suggestions?
submitted by /u/Reasonable_Set_1615
[link] [comments]
Hi, I’m a bit stuck in an assignment where I have to use a dataset comprising of at least 1000 rows and at least 5 columns which contain both categorical (at least 2) and numerical (at least 3) variables. I also have to cite the source. It would be great if you guys please help me out…
submitted by /u/OkDark1310
[link] [comments]
Large language models often lack capabilities of pathfinding and reasoning skills. With the development of reasoning models, this got better, but we are missing the datasets to quantify these skills. Improving LLMs in this domain can be useful for robotics, as they often require some LLM to create an action plan to solve specific tasks. Therefore, we created the dataset Spatial Pathfinding and Reasoning Challenge (SPaRC) based on the game “The Witness”. This task requires the LLM to create a path from a given start point to an end point on a 2D Grid while satisfying specific rules placed on the grid.
More details, an interactive demonstration and the paper for the dataset can be found under: https://sparc.gipplab.org
In the paper, we compared the capabilities of current SOTA reasoning models with a human baseline:
This shows that there is still a large gap between humans and the capabilities of reasoning model.
Each of these puzzles is assigned a difficulty score from 1 to 5. While humans solve 100% of level 1 puzzles and 94.5% of level 5 puzzles, LLMs struggle much more: o4-mini solves 47.7% of level 1 puzzles, but only 1.1% of level 5 puzzles. Additionally, we found that these models fail to increase their reasoning time proportionally to puzzle difficulty. In some cases, they use less reasoning time, even though the human baseline requires a stark increase in reasoning time.
submitted by /u/Sral248
[link] [comments]
Preferably categorically divided on the level of sleep debt or number of hours.
Would appreciate it, as I have not been able to find any at all which are publicly available.
I am not looking for fatigue detection datasets as mainly that is what I have found.
Thanks so much!
submitted by /u/One_Tonight9726
[link] [comments]
Hi all,
I’m working with NLSY97 and ran into something that’s confusing me. I’ve built employment status spells (based on weekly employment status) and job spells (based on start/end dates and employer IDs), and then merged them to see how things line up.
Most of it looks great. The job spells and employment spells match up really well. But in a few places a person is marked as “employed” for a week, but there’s no corresponding job record. ( from 3 days up to 2-3 weeks) No start date, no end date, no employer ID.
Is this normal in NLSY97? Could it have something to do with how the interviews were conducted, like status being carried over between interviews, or data being lagged?
I’ve checked my code and raw event files, and it doesn’t seem like I’m dropping rows or mismatching things. The issue only shows up occasionally, which makes me wonder if it’s just part of how the data is structured rather than an error on my end.
If anyone has seen this or knows how to handle it, I’d really appreciate your thoughts. I’m happy to share code snippets if that helps.
Thanks so much in advance!
submitted by /u/Exciting-Skin3341
[link] [comments]
Hi everyone,
I’m starting a side project where I compile and transform time series data from different sources. I’m looking for interesting datasets or APIs with the following characteristics:
Here’s an example of something I really liked:
🔗 Queue Times API — it provides live and historical queue times for theme parks.
Some ideas I had (but haven’t found sources for yet):
Basically, I’m after uncommon but fun time series datasets—things you wouldn’t usually see in mainstream data science projects.
Any suggestions, links, or ideas to explore would be hugely appreciated. Thanks!
submitted by /u/JdeHK45
[link] [comments]
Hi, do you know of any datasets containing users’ song histories?
I found one, but it doesn’t include information about which user is listening to which songs—or whether it’s just data from a single user.
submitted by /u/Moistlos
[link] [comments]
I am looking for something like this – given a species there should be the recorded ages of animals belonging to that species.
submitted by /u/Exciting_Point_702
[link] [comments]
I recall a long time back you could download the reddit comment dataset, it was huge. I lost my hard drive to gravity a few weeks ago and was hoping someone knew where I could I get my hands on another copy?
submitted by /u/CarbonAlpine
[link] [comments]
Hey community,
I’m Sagar, co-founder of VideoSDK.
I’ve been working in real-time communication for years, building the infrastructure that powers live voice and video across thousands of applications. But now, as developers push models to communicate in real-time, a new layer of complexity is emerging.
Today, voice is becoming the new UI. We expect agents to feel human, to understand us, respond instantly, and work seamlessly across web, mobile, and even telephony. But developers have been forced to stitch together fragile stacks: STT here, LLM there, TTS somewhere else… glued with HTTP endpoints and prayer.
So we built something to solve that.
Today, we’re open-sourcing our AI Voice Agent framework, a real-time infrastructure layer built specifically for voice agents. It’s production-grade, developer-friendly, and designed to abstract away the painful parts of building real-time, AI-powered conversations.
We are live on Product Hunt today and would be incredibly grateful for your feedback and support.
Product Hunt Link: https://www.producthunt.com/products/video-sdk/launches/voice-agent-sdk
Most importantly, it’s fully open source. We didn’t want to create another black box. We wanted to give developers a transparent, extensible foundation they can rely on, and build on top of.
Here is the Github Repo: https://github.com/videosdk-live/agents
(Please do star the repo to help it reach others as well)
This is the first of several launches we’ve lined up for the week.
I’ll be around all day, would love to hear your feedback, questions, or what you’re building next.
Thanks for being here,
Sagar
submitted by /u/videosdk_live
[link] [comments]
Hey everyone, I recently started learning data analysis. Right now I’m going through Excel, SQL, and Python (Pandas is confusing but interesting).
I come from a non-tech background, so everything feels new. Some days are frustrating, but I’m slowly getting the hang of it.
If anyone here has tips for beginners or good free resources, I’d really appreciate it. Also, if you’ve switched careers into data — how was your journey?
Thanks in advance
submitted by /u/ManufacturerFar2134
[link] [comments]
I am currently working on a ALPR (Automatic License Plate Recognition) system but it is made exclusively for UK traffic as the number plates follow a specific coding system. As i don’t live in the UK, can someone help me in obtaining the dataset needed for this.
submitted by /u/Moonwolf-
[link] [comments]
Demo video: https://www.reddit.com/r/SideProject/comments/1ltlzk8/tool_built_a_web_crawling_tool_for_public_data/
Our data crawling platform has added Wikipedia integration with advanced filtering, metadata extraction, and bulk export capabilities. Ideal for NLP research, knowledge graph construction, and linguistic analysis.
Each collected article provides comprehensive structured data:
Target: Text classification model for scientific articles Method: Category-based collection from "Category:Science" Output: 10,000+ labeled scientific articles Applications: Domain-specific language models, scientific text analysis
Target: Topic-based representation analysis in encyclopedic content Method: Systematic document collection from specific subject areas Output: Structured document sets showing topical perspectives Applications: Topic modeling, knowledge gap identification
Target: How knowledge representation changes over time Method: Edit history analysis with systematic sampling Output: Longitudinal dataset of article evolution Applications: Knowledge dynamics, collaborative editing patterns
Random Sampling: [Leave empty for unbiased collection] Topic-Specific: "Machine Learning" or "Climate Change" Category-Based: "Category:Artificial Intelligence" URL Processing: Direct Wikipedia URL processing
This Wikipedia dataset crawler enables researchers to create high-quality, well-documented datasets suitable for peer-reviewed research. The combination of systematic collection methods, rich metadata extraction, and flexible export options makes it ideal for:
Ready to build your next research dataset? Start systematic, reproducible, and scalable Wikipedia data collection for serious academic research at pick-post.com.
submitted by /u/PerspectivePutrid665
[link] [comments]
Hi all, I’m working on a data cleaning project and I was wondering if I could get some feedback on this approach.
Step 1: Recommendations are given for data type for each variable and useful columns. User must confirm which columns should be analyzed and the type of variable (numeric, categorical, monetary, dates, etc)
Step 2: The chatbot gives recommendations on missingness, impossible values (think dates far in the future or homes being priced at $0 or $5), and formatting standardization (think different currencies or similar names such as New York City or NYC). User must confirm changes.
Step 3: User can preview relevant changes through a before and after of summary statistics and graph distributions. All changes are updated in a version history that can be restored.
Thank you all for your help!
submitted by /u/Academic_Meaning2439
[link] [comments]
Hey everyone!
A little while ago, I released a conversation dataset on Hugging Face (linked if you’re curious), and to my surprise, it’s become the most downloaded one of its kind on the platform. A lot of people have been using it to train their LLMs, which is exactly what I was hoping for!
Now I’m at a bit of a crossroads — I’d love to keep improving it or even spin off new variations, but I’m not sure what the community actually wants or needs.
So, a couple of questions for you all:
Would really appreciate any input. I want to make stuff that’s genuinely useful to the data community.
submitted by /u/ready_ai
[link] [comments]
We’re started a Startup Catalyst Program at Future AGI for early-stage AI teams working on things like LLM apps, agents, or RAG systems – basically anyone who’s hit the wall when it comes to evals, observability, or reliability in production.
This program is built for high-velocity AI startups looking to:
The program includes:
It’s free for selected teams – mostly aimed at startups moving fast and building real products. If it sounds relevant for your stack (or someone you know), here’s the link: Apply here: https://futureagi.com/startups
submitted by /u/bubbless__16
[link] [comments]
Hello !
I’m Anjan Boro, a Biomedical Engineer and freelance Imaging‑AI specialist. I’ve curated a 500 GB collection of de‑identified DICOM CT scans—complete with voxel‑accurate, technician‑validated segmentations of mandible, maxilla, teeth, and sinuses.
• Comment below or DM me for sample previews under NDA
• Or email: [anjanbme@gmail.com](mailto:anjanbme@gmail.com)
submitted by /u/B4R069
[link] [comments]
I put together a simple API that lets you access Google Trends data — things like keyword interest over time, trending searches by country, and related topics.
Nothing too fancy. I needed this for a personal project and figured it might be useful to others here working with datasets or trend analysis. It abstracts the scraping and formatting, so you can just query it like any regular API.
It’s live on RapidAPI here (has a free tier): https://rapidapi.com/shake-chillies-shake-chillies-default/api/google-trends-insights
Let me know if you’ve worked on something similar or if you think any specific endpoint would be useful.
submitted by /u/Small-Hope-9388
[link] [comments]