submitted by /u/Fabulous-Rub-7301
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
(Not self-promotion, unless you count open-sourcing a tool as self promotion. This is a free resource, an attempt to make a government service more available, and I don’t make any money from it.)
Howdy folks,
I wanted to share a project I’ve been working on, called Lance-TS.
It’s an opinionated TypeScript client for the U.S. Census Geocoder API, which is a free resource for geocoding U.S.-based addresses based on the TIGER/Line census geospatial database. It has no posted rate limits that I can find, and handles single and batch-address geocoding. Currently it handles address-to-coords, and I’ll implement coordinates-to-geography shortly.
My repo for this tool is attached, and the package can be installed from the npm registry with:
npm i lance-ts pnpm add lance-ts yarn add lance-ts
Happy Geocoding! I’ve been working with map data a lot as I build some a platform for my company, and thought I would make this resource easy to access for more people.
Kindly submit any issues or edge cases you encounter while using LANCE, and I will fix them ASAP. Cheers!
submitted by /u/Fredrickjonjones
[link] [comments]
I’m trying to build my own research / signal pipeline and I’m looking for something closer to Unusual Whales but without paying for a full subscription.
What I want is less dashboards and more raw data access.
Ideally:
Options / unusual flow / F&O activity
Insider trades
Politician disclosures
Hedge fund / 13F data
Dark pool / institutional signals
Near real-time or at least updated frequently
API / CSV / exportable data
Free or generous free tier
Right now I’m testing Finnhub and Tastytrade API but they don’t feel complete enough for this use case.Q
My goal is basically:
Raw data → Claude / custom filtering → synthesis → useful signals
Curious what people here actually use to assemble this stack. Open datasets, APIs, GitHub repos, hidden gems, anything.
submitted by /u/AVFrinkler
[link] [comments]
Hey everyone,
I’m Sameer, a Business Analytics graduate currently building my data portfolio. I’m offering one free project to anyone who has messy or disorganized data they’ve been meaning to fix.
Here’s what I can do for you, completely free:
Clean and organize your Excel/CSV data (remove duplicates, fix formats, fill gaps)
Build a simple Power BI or Excel dashboard so you can actually see what’s in your data
Deliver everything back to you in a clean, usable format
All I ask in return is a short testimonial once we’re done.
Ideal if you’re a small business owner, logistics/supply chain manager, or anyone sitting on data they don’t know what to do with.
Drop a comment or DM me if you’re interested. I’ll respond quickly.
submitted by /u/No_Cranberry6808
[link] [comments]
Sharing a data angle in case it’s useful.
US public companies disclose disaggregated revenue (by product and by geography) in their 10-K/10-Q/20-F filings, tagged as XBRL dimensional facts. It’s all free and public on SEC EDGAR, but it’s genuinely hard to use raw:
the geography axis is tagged inconsistently (some filers use ISO country codes, some US state codes, some their own “rest of world” catch-alls), companies mix subtotals and leaves on the product axis, and 10-Qs report cumulative half-year/nine-month figures instead of standalone quarters.
If you’re assembling this yourself, the things that bit me: keep single-axis facts only (the filings rarely tag product×geography as one crossed fact), preserve subtotal members rather than pruning them, and reconstruct standalone quarters by subtracting the cumulative periods. Period-classify each fact against the company’s real fiscal-year end, not the calendar.
I maintain a cleaned-up version of this as the StockFit API, but the underlying data is all on EDGAR if you want to parse it yourself with Arelle.
Happy to answer any questions.
submitted by /u/Either_Door_5500
[link] [comments]
So, a city where I live has recently decided to quadruple public transport fares and me and my friend group from university are making a study of consequences of rapid transport fares increase. We hope to get a credible correlation model or a heuristic at best. We have already acquired a list of 106 cities with close population density and now we need to get data on the price history of public transportation fare to then see which ones have seen comparable increase. Any additional advises are welcome.
submitted by /u/Least-Example-9308
[link] [comments]
I’ve been spending more time thinking about the dataset side of AI development and wondering where most teams encounter the biggest challenges.
A lot of discussions focus on model architecture and training techniques, but many production issues seem to trace back to the data itself:
• inconsistent annotations between labelers
• difficulty collecting rare edge cases
• balancing dataset diversity without introducing noise
• maintaining quality as datasets grow larger
• keeping training data aligned with real deployment environments
For those who work with datasets regularly:
• What is your biggest bottleneck today?
• How do you measure annotation quality?
• At what scale do dataset management problems become significant?
Interested in hearing real-world experiences from people dealing with data collection, labeling, and dataset maintenance.
submitted by /u/Vane1st
[link] [comments]
Explanation and link to more datasets there. Actual data is at https://huggingface.co/datasets/ThGaskin/Migration_flows
submitted by /u/cavedave
[link] [comments]
All right, I’ve read ten articles on this and I still don’t think I understand it.
I have a small project to scrape product prices from some sites.
Nothing crazy, just e-commerce stuff. someone told me i need residential proxies but they’re like 5x the price of datacenter and i don’t understand why i would need them for something this basic Like, what actually happens if i just use datacenter? Will I be blocked right away or is it fine for most normal sites? And what the hell is an ISP proxy? is different again.
I’m just trying to not spend money on something I don’t need. any help appreciated.
submitted by /u/aaru101
[link] [comments]
Looking for free English audio datasets which I can use for transcription purposes.
I have searched on hugging face but didnt find any useful most had audio less than 10 seconds.
I have created a transcription tool and want to test it on longer audios like 5 mins and also with multiple speakers so i can test diarization as well.
Any help is appreciated.
submitted by /u/FallEnvironmental330
[link] [comments]
I’ve spent the last year maintaining a public longitudinal self-tracking archive covering wearables, sleep, recovery, training, body composition, biomarkers, and weekly reporting.
The repository includes:
– raw and processed datasets
– longitudinal sleep and wearable records
– weekly reports
– audit trails
– prediction tracking and model-error analysis
– changelog and governance documentation
My goal isn’t optimization as much as documenting what long-term observation of a single subject looks like when treated like a data project.
I’m particularly interested in feedback on:
– dataset structure
– governance
– reproducibility
– longitudinal analysis opportunities
– potential blind spots in methodology
Current archive size: ~1 year of daily observations, weekly reports, wearable records, biomarker snapshots, and prediction-tracking artifacts.
Repository:
submitted by /u/Intelligent-Arm-9001
[link] [comments]
hi everyone!
i want to share with you a little project i created a few months ago to solve a problem i was having with function calling. whenever i needed a good quality and specific dataset to train my models on function calling i couldn’t find a good repo for generation. i wanted a dataset that teaches the model not only how to call the tool but also when, in different contexts. i also wanted to have maniacal control on the results, i wanted to control how many tools in each convo, when the tool is called, errors in tool callings and in particular i wanted something that was flexible enought to include *PERSONALIZED* tools with personalized mock answers!!!
for example you can find some tools i made for the sample below in the repo under
synthfc/tools/eng
and
synthfc/tools/ita
i also wanted a way to check the results and auto-correct the pieces of data that have problems. here is the repo:
https://github.com/pierpierpy/FC-synth
here some examples i created with an open source model:
https://huggingface.co/datasets/pierjoe/function-calling-synthetic-2000
hope you find it useful!
happy tool calling!
submitted by /u/Logical_Delivery8331
[link] [comments]
hey guys,
currently i am making tdabc model costing for almunium extrusion company and i want to model a companies practical employee number,Machines,production time, Time it takes for each machine etc.. where could i find data to model. so to check if the model can work in industrial setting?
#dataset
submitted by /u/Curiosity9147
[link] [comments]
[disclosure – I work for Synthera, but as the datasets are free to download, posting here as there may be some interest]
Following my other post, we have added the datasets for download produced by the cloud version of the editor in the sample scenarios included.
These are richly annotated, including matching
- RGB images
- 2d/3d bounding boxes
- Segmentation
- Masks (Instance Segmentation)
- Distance/Depth information
- Surface Normals
- Keypoint information for skeleton, hand and face
It could be of interest to anyone who wants to experiment with different multi-modal/sensor models. We also use it as the basis for input to stable diffusion and Nvidia Cosmos for further adpatation.
I’d love any comments.
submitted by /u/Syrup1971
[link] [comments]
Hi!
I’m in the process of trying to calculate power for an analysis that I am planning on running.
I have 4 continuous DVs (related to each other), and then I get a bit lost as to what to put into g*power.
For IVs: I have 5 variables (continuous, subtests of one construct), and then two covariates (age – continuous, gender identity – 3 categories).
Does anyone know how I input that information into g*power to calculate? I’ve tried reading through online guides and YouTube videos but I’m still a bit stuck!
submitted by /u/SnooPeripherals1239
[link] [comments]
I just refreshed a free dataset I’ve been maintaining of federal enforcement records (OSHA, WHD, NLRB, EPA, SAM) joined to SEC parent-company financials. The Q3 cut covers about 104,000 US establishments across 1,826 publicly traded companies, with each row carrying its parent’s latest revenue, net income, and total assets.
Website: https://www.fastdol.com/datasets/public-company-federal-compliance/data.csv
Hugging Face: https://huggingface.co/datasets/FastDOL/public-companies-federal-compliance_q3
Disclaimer: The dataset is built on top of FastDOL, a project I run that pulls federal enforcement records from 15 agencies into queryable employer profiles. I publish free, new datasets every week at https://www.fastdol.com/datasets
If you’d like to try querying programmatically, sign up to receive a free API key at https://www.fastdol.com/signup. Keys with no limits are available to journalists for free, just shoot me an email: [ben@fastdol.com](mailto:ben@fastdol.com)
Let me know if you have any questions or feedback!
submitted by /u/chill-botulism
[link] [comments]
I’m looking for a dataset that includes order data (Order ID, Products within order, order date) over 3+ years. It’s difficult to find datasets with these requirements that span through a large date range
submitted by /u/nicktron10
[link] [comments]
Disclosure – I do work for Synthera, but posting this, as I believe of genuine interest to CV community and we do offer a free version, with no credit card details needed.
We have released a preview version of our editor, that whilst somewhat limited, should give you an idea if it is attractive to download our free Chameleon software.
We will add more features overtime, and plan to release a full cloud versiion in the near future.
Let me know what you think, or if you need any help to generate some useful data
submitted by /u/Syrup1971
[link] [comments]
Congressional trading data is relatively commoditized, but I couldn’t find any open-source version with the features I wanted.
The data is lagged (median 28 days from trade to disclosure, and 19% miss this deadline), but there’s still interesting patterns to explore.
I think it should be easy-to-access public data, so I built a fully open-source dataset for it.
Live app: https://congress.kadoa.com
submitted by /u/madredditscientist
[link] [comments]
from where can i get dataset for insides of tank barrel side view not annotated
submitted by /u/Sufficient_Ad8058
[link] [comments]