Are there any datasets for tracking OpenAPI or Swagger specifications – ideally with some semantic analysis and usages?
submitted by /u/CurveSoft799
[link] [comments]
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
Are there any datasets for tracking OpenAPI or Swagger specifications – ideally with some semantic analysis and usages?
submitted by /u/CurveSoft799
[link] [comments]
SEC data is submitted in a format called Standardized Generalized Markup Language. A SGML Submission may contain many different files. For example, this Form 4 contains xml and txt files. This isn’t really important unless you want to work with a lot of data, e.g. the entire SEC corpus.
If you do want to work with a lot of SEC data, your choice is either to buy the parsed SGML data or get it from the SEC’s website.
Scraping the data is slow. The SEC rate limits you to 5 request per second for extended durations. There are about 16,000,000 submissions so this takes awhile. A much faster approach is to download the bulk data files here. However, these files are in SGML form.
I’ve written a fast SGML parser here under the MIT License. The parser has been tested on the entire corpus, with > 99.99% correctness. This is about as good as it gets, as the remaining errors are mostly due to issues on the SEC’s side. For example, some files have errors, especially in the pre 2001 years.
Some stats about the corpus:
| File Type | Total Size (Bytes) | File Count | Average Size (Bytes) |
|---|---|---|---|
| htm | 7,556,829,704,482 | 39,626,124 | 190,703.23 |
| xml | 5,487,580,734,754 | 12,126,942 | 452,511.5 |
| jpg | 1,760,575,964,313 | 17,496,975 | 100,621.73 |
| 731,400,163,395 | 279,577 | 2,616,095.61 | |
| xls | 254,063,664,863 | 152,410 | 1,666,975.03 |
| txt | 248,068,859,593 | 4,049,227 | 61,263.26 |
| zip | 205,181,878,026 | 863,723 | 237,555.19 |
| gif | 142,562,657,617 | 2,620,069 | 54,411.8 |
| json | 129,268,309,455 | 550,551 | 234,798.06 |
| xlsx | 41,434,461,258 | 721,292 | 57,444.78 |
| xsd | 35,743,957,057 | 832,307 | 42,945.64 |
| fil | 2,740,603,155 | 109,453 | 25,039.09 |
| png | 2,528,666,373 | 119,723 | 21,120.97 |
| css | 2,290,066,926 | 855,781 | 2,676.0 |
| js | 1,277,196,859 | 855,781 | 1,492.43 |
| html | 36,972,177 | 584 | 63,308.52 |
| xfd | 9,600,700 | 2,878 | 3,335.89 |
| paper | 2,195,962 | 14,738 | 149.0 |
| frm | 1,316,451 | 417 | 3,156.96 |
The SGML parsing package, Stats on processing the corpus, convenience package for SEC data.
submitted by /u/status-code-200
[link] [comments]
Does anyone know where I can get a complete dataset of historical S&P 500 additions and deletions?
Something that includes:
Date of change
Company name and ticker
Replaced company (if any)
Or if someone already has such a dataset in CSV or JSON format, could you please share it?
Thanks in advance!
submitted by /u/mohit-patil
[link] [comments]
Can anyone give me free source dataset of lead acid battery. I want to build a predictive maintenance model for lead acid battery!
#dataset #leadacid #predicticemaintencne
submitted by /u/Fearless_Addendum_31
[link] [comments]
for creating dashboards for my resume other than kaggle ?
submitted by /u/ClimateBeautiful2615
[link] [comments]
In healthcare, data isn’t just numbers—it’s people.Every click, interaction, or response reflects someone’s health journey.When we build dashboards or models, we’re not just tracking KPIs—we’re supporting better care.The question isn’t “what’s performing?” but “who are we helping—and how?”Because real impact starts when we put patients at the center of our insights.Let’s not lose the human in the data.
submitted by /u/facele007
[link] [comments]
For as long as I remember I have been obsessed with the problem of event search online, the fact that despite solving so many problems with commons technology, from operating systems to geo-mapping to general knowledge and technical Q&A (stack exchange) we have not solved the problem of knowing what is happening around us in the physical world.
This has meant that huge numbers of consumer startups that wanted to orient us away from screens towards the real world have failed, and the whole space got branded by startup culture as a “tarpit”. Everyone has a cousin or someone in their network working on a “meetup alternative” or “travel planner” for some naive “meet people that share your interests” vision, fundamentally misunderstanding that they all fail due to the lack of a shared dataset like openstreetmap for events.
The best we have, ActivityPub, has failed to penetrate, because the event organisers post where their audience is and it would take huge amounts of man hours to manually curate this data, which is in a variety of language and media formats and apps, so that anyone looking for something to do can find it in a few clicks, with the comfort of knowing they are not missing anything because they are not in the right network or app or whatever.
All of that has changed because commercial LLMs and open sourced models can tell the difference between a price, a date, and a time, across all of the various formats that exist around the world, parsing unstructured data like a knife through butter.
I want to work on this, to build an open sourced software tool that will create a shared dataset like Openstreetmap, that will require minimal human intervention. I’m not a developer, but I can lead the project and contribute technically, although it would require a senior software architect. Full disclosure, I am working on my own startup that needs this to exist, so I will build the tooling myself into my own backend if I cannot find people who are willing to contribute and help me to build it the way it should be on a federated architecture.
Below is a Claude-generated white paper. I have read it and it is reasonably solid as a draft, but if you’re not interested in reading AI-generated content and are a senior software architect or someone who wants to muck in just skip it and dive into my DMs.
This is very very early, just putting feelers out to find contributors, I have not even bought the domain mentioned below (I don’t care about the name).
I also have a separate requirements doc for the event scouting system, which I can share.
If you want to work on something massive that fundamentally re-shapes the way people interact online, something that thousands of people have tried and failed to do because the timing was wrong, something that people dreamed of doing in the 90s and the 00s, lets talk. The phrase “changes everything” is thrown around too much, but this really would have huge downstream positive societal impacts when compared to the social internet we have today, optimised for increasing screen addiction rather than human fulfilment.
Do it for your kids.
Version 1.0
Date: June 2025
PublicSpaces.io is an open event dataset for real world events open to the public, comparable to OpenStreetMap.
For the first time in history, large language models and generative AI have made it economically feasible to automatically extract structured event data from the chaotic, unstructured information scattered across the web. This breakthrough enables a fundamentally new approach to building comprehensive, open event datasets that was previously impossible.
The event discovery space has been described as a “startup tar pit” where countless consumer-oriented companies have failed despite obvious market demand. The fundamental issue is the lack of an open, comprehensive event dataset comparable to OpenStreetMap for geographic data, combined with the massive manual overhead required to curate event information from unstructured sources.
PublicSpaces.io is only possible now because ubiquitous access to LLMs—both open-source models and commercial APIs—has finally solved the data extraction problem that killed previous attempts. PublicSpaces.io creates a decentralized network of AI-powered nodes that collaboratively discover, curate, and share public event data through a token-based incentive system, transforming what was once prohibitively expensive manual work into automated, scalable intelligence.
Unlike centralized platforms that hoard data for competitive advantage, EventNet creates a commons where participating nodes contribute computational resources and human curation in exchange for access to the collective dataset. This approach transforms event discovery from a zero-sum competition into a positive-sum collaboration, enabling innovation in event-related applications while maintaining data quality through distributed verification.
The event discovery space is littered with failed startups, earning it the designation of a “tar pit” in entrepreneurial circles. Event startups like SongKick.com to IRL.com have burned through billions of dollars in venture capital attempting to solve event discovery. The pattern is consistent:
Until recently, the fundamental blocker was data extraction. Event information exists everywhere—venue websites, social media posts, PDF flyers, images of posters, government announcements, email newsletters—but existed in unstructured formats that defied automation.
Traditional approaches failed because:
LLMs solve this extraction problem:
This is not an incremental improvement—it’s a phase change that makes the impossible suddenly practical.
The fundamental issue is infrastructural. Geographic applications succeeded because OpenStreetMap provided open, comprehensive geographic data. Wikipedia enabled knowledge applications through open, collaborative content curation. Event discovery lacks this foundational layer.
Existing solutions are inadequate:
Event data presents unique challenges compared to geographic or encyclopedic information, but the critical limitation was always the extraction bottleneck:
Pre-LLM Technical Barriers:
Compounding Problems:
The paradigm shift: LLMs eliminate the extraction bottleneck, making comprehensive event discovery economically viable for the first time.
PublicSpaces.io is specifically designed around the capabilities that LLMs and generative AI enable:
Automated Data Extraction: AI scouts can process any format—web pages, PDFs, images, social media posts—and extract structured event data with human-level accuracy.
Contextual Understanding: LLMs understand that “this Saturday” in a February blog post refers to a specific date, that “$25 advance, $30 door” indicates pricing tiers, and that venue descriptions can be matched to OpenStreetMap locations.
Quality Assessment: AI can evaluate whether event descriptions seem legitimate, venues exist, dates are reasonable, and information is internally consistent.
Multilingual and Cultural Adaptability: Modern LLMs handle international date formats, currencies, and cultural event description patterns without custom programming.
Cost Effectiveness: What previously required human teams now costs fractions of a penny per event processed.
PublicSpaces.io is a federated network of AI-powered nodes that collaboratively discover, curate, and share public event data. Each node runs standardized backend software that:
Rather than building another centralized platform, PublicSpaces.io adopts a federated model similar to email or Mastodon. This provides:
Resilience: No single point of failure or control Scalability: Computational load distributed across participants
Incentive Alignment: Participants benefit directly from network growth Innovation Space: Multiple interfaces and applications can build on shared data Regulatory Flexibility: Distributed architecture reduces regulatory burden
Each event receives a unique identifier composed of:
event_id = {osm_venue_id}_{start_date}_{last_update_timestamp}
Example: way_123456789_2025-07-15_1719456789
This identifier enables:
When a node receives conflicting data for an existing event, it can:
Nodes participate in a point-based economy where contributions earn tokens for data access. This ensures that active contributors receive proportional benefits while preventing free-riding.
Earning Points:
Spending Points:
Auto-Payment System: Nodes can establish automatic payment arrangements to access more data than they contribute:
{ "access_token": "temp_token_xyz", "known_events": [ {"id": "way_123_2025-07-15_1719456789", "timestamp": 1719456789}, {"id": "way_456_2025-07-20_1719456790", "timestamp": 1719456790} ], "filters": { "geographic_bounds": "bbox=-73.9857,40.7484,-73.9857,40.7484", "date_range": {"start": "2025-07-01", "end": "2025-08-01"}, "categories": ["music", "technology"], "trust_threshold": 0.7 } }
{ "events": [ { "id": "way_789_2025-07-25_1719456791", "venue_osm_id": "way_789", "title": "Open Source Conference 2025", "start_datetime": "2025-07-25T09:00:00Z", "end_datetime": "2025-07-25T17:00:00Z", "description": "Annual gathering of open source developers", "source_confidence": 0.9, "verification_status": "human_verified", "tags": ["technology", "software", "conference"], "last_updated": 1719456791, "source_node": "node_university_abc" } ], "usage_summary": { "events_provided": 25, "points_charged": 25, "remaining_balance": 475 } }
When a node receives an event it has already published to the network:
False or fraudulent events receive severe penalties:
Node Trust Ratings: Each node maintains trust scores for peers based on data quality history
Blacklist Sharing: Nodes can share labeled problematic events:
{ "event_id": "way_123_2025-07-15_1719456789", "labels": ["fake", "spam", "illegal"], "confidence": 0.95, "reporting_node": "node_city_officials", "evidence": "Event conflicts with official city calendar" }
Content Filtering: Receiving nodes can pre-filter based on:
A central aggregation node maintained by the foundation provides:
Building on the original requirements, EventNet implements an AI scout system for automated event discovery:
Web Scouts: Monitor websites, social media, and official sources for event announcements RSS/API Scouts: Pull from structured data sources like venue calendars and event APIs Social Scouts: Track social media platforms for event-related content Government Scouts: Monitor official sources for public events and announcements
Each node configures sources with associated trust levels:
{ "source_id": "venue_official_calendar", "url": "https://venue.com/events.json", "scout_type": "api", "trust_level": 0.9, "check_frequency": 3600, "validation_rules": ["requires_date", "requires_venue", "minimum_description_length"] }
Discovered events flow through action pipelines for processing:
Core functionality exposed through RESTful API:
/events – CRUD operations for event data/sources – Manage data sources and scouts/network – Peer node discovery and communication/verification – Human review queue and verification tools/analytics – Usage statistics and quality metricsMinimal web interface for:
Nodes are expected to build custom interfaces for:
EventNet operates under a non-profit foundation similar to the OpenStreetMap Foundation:
Responsibilities:
Funding Sources:
Open Source Development: All software released under AGPL license requiring contributions back to the commons
Community Standards: Developed through open process similar to IETF RFCs
Dispute Resolution: Multi-tier system from peer mediation to foundation arbitration
Technical Evolution: Protocol changes managed through community consensus process
EventNet shares some architectural concepts with Nostr (Notes and Other Stuff Transmitted by Relays) but differs in key ways:
Similarities:
Differences:
EventNet’s federation model resembles social networks like Mastodon but optimizes for structured data sharing rather than social interaction.
While these systems enable distributed file sharing, EventNet focuses on real-time structured data with quality verification rather than content distribution.
EventNet represents an opportunity to solve one of the internet’s most persistent infrastructure gaps. The event discovery problem affects millions of people daily and constrains innovation in location-based services, social applications, and civic engagement tools.
Contribution Opportunities:
Universities, civic organizations, venues, and businesses have immediate incentives to participate:
Universities: Aggregate campus events while accessing city-wide calendars Venues: Share their calendars while discovering nearby events for cross-promotion
Civic Organizations: Improve community engagement through comprehensive event discovery Businesses: Build innovative applications on reliable, open event data
PublicSpaces.io succeeds only with community adoption and stewardship. The network becomes more valuable as more participants contribute data, verification, and development effort.
Getting Started:
PublicSpaces.io addresses a fundamental infrastructure gap that has limited innovation in event discovery for decades. By creating a federated network with proper incentive alignment, quality control, and community governance, we can build the missing foundation that enables the next generation of event-related applications.
The technical challenges are solvable with current AI and distributed systems technology. The economic model provides sustainability without compromising the open data mission. The community governance approach has been proven successful by projects like OpenStreetMap and Wikipedia.
Success requires coordinated effort from developers, organizations, and communities who recognize that public event discovery is too important to be controlled by any single entity. PublicSpaces.io offers a path toward an open, comprehensive, and reliable public event dataset that serves everyone’s interests.
The question is not whether such a system is possible – it is whether we have the collective will to build it.
License: This white paper is released under Creative Commons Attribution-ShareAlike 4.0
submitted by /u/CiaranCarroll
[link] [comments]
Can anyone recommend a complete API dataset? Ideally a collection of OpenAPIs specs or Swaggers across as many services possible.
submitted by /u/apinference
[link] [comments]
Hello!
I was looking forward for any free trials or any free data sets of Real ESG data for EU Corporations.
Any recomendations would be useful!
Thanks !
submitted by /u/Exciting_Badger
[link] [comments]
Electric vehicles (EVs) are becoming some of the most data-rich hardware products on the road, collecting more information about users, journeys, driving behaviour, and travel patterns.
I’d say collecting more data on users than mobile phones.
If anyone has access to, or knows of, datasets extracted from EVs. Whether anonymised telematics, trip logs, user interactions, or in-vehicle sensor data , would be really interested to see what’s been collected, how it’s structured, and in what formats it typically exists.
Would appreciate any links, sources, or research papers or insighfull comments
submitted by /u/Winter-Lake-589
[link] [comments]
Hi everyone,
I’m working on a research project that requires a large dataset of Instagram and TikTok usernames. Ideally, it would also include metadata like follower count, or account creation date – but the usernames themselves are the core requirement.
Does anyone know of:
Public datasets that include this information
Licensed or commercial sources
Projects or scrapers that have successfully gathered this at scale
Any help or direction would be greatly appreciated!
submitted by /u/rockweller
[link] [comments]
I tried in some of the official sites but most are updated till 2023. I aant to make a small project of climate change predictor on any type. So appreciate the help.
submitted by /u/FastCommission2913
[link] [comments]
I’m working on a project where I need to identify all U.S. public companies listed on NYSE, NASDAQ, etc. that have over $5 million in annual revenue and operate in the following industries:
I’ve already completed Step 1, which was mapping out all relevant 2022 NAICS/SIC codes for these sectors (over 80 codes total, spanning manufacturing, mining, logistics, and R&D).
Now for Step 2, I want to build a dataset of companies that:
submitted by /u/Hour_Presentation657
[link] [comments]
Hi I want to build a project where I can train model to look at the video footages of past UCL matches, before VAR was introduced, and flag a play as an offside/foul according to modern rules and using VAR. Does anyone know where I can find this dataset?
submitted by /u/GiftBrilliant6983
[link] [comments]
Hi Dataseters
I’ve asked LLMs and scoured .. github etc for projects to no avail, but ideally if anyone knows of a fact/dimension style open source schema model (not unlike BMC/Service Now logical data CDM models) with dimensions pre-populated with typical vendors/makes/models both on hardware/software dimensions. Ideally in Postgres/Maria .. but if in Oracle etc, that’s fine too, easy conversion.
Anyone who has Snow/Flexera/ServiceNow .. might build such a skeleton frame with custom tables for midrange/networking .. w UNSPC codes etc
Sure I can subscribe to big ITSM vendors, but ideally id just fork something the community has already built, then ETL/ELT facts in our own use. Also DIY, it’s like reinventing the wheel, im sure many of you have already built this…
Its a shot in the dark .. but just seeing if anyone has seen useful projects
thanks in advance
submitted by /u/Laymans_Perspective
[link] [comments]
The dataset is here – https://www.statista.com/statistics/1420818/attendance-music-events-netherlands/
I would like to perform basic EDA on it, but any Statista dataset is locked under an insane paywall. Does anyone here a Statista account and is willing to help me out a bit? Much appreaciated!
submitted by /u/VovaViliReddit
[link] [comments]
I used to mix these up, but here’s the quick takeaway: BI is about overall business reporting, usually for execs and finance. Product analytics focuses on how users actually use the product and helps teams improve it.
Wrote a post that breaks it down more if you’re interested:
How do you separate them in your work?
submitted by /u/Still-Butterfly-3669
[link] [comments]
Further adding to the/my Ousia Bloom an attempt to catalog not just what I think, but what and how I did so! It’s for sure not a real thing
submitted by /u/JboyfromTumbo
[link] [comments]
I need polymarket data of users (pnl, %pnl, trades, market traded) if it is available, i see a lot of website to analyze these data but no api to download.
submitted by /u/Actual_Doubt5778
[link] [comments]
Hi r/datasets ,
I’m looking for datasets, either paid or unpaid, to create a benchmark for a specialised extraction pipeline.
Criteria:
Document types:
I’ve already seen: Atticus and UCSF Industry Document Library (which is the origin of Adam Harley’s dataset). I’ve seen a few posts below but they aren’t what I’m looking for. I’m honestly so happy to pay for the information and the datasets; dm me if you want to strike a deal.
submitted by /u/phililisaveslives
[link] [comments]
I am trying to adjust an object detection model to classify the components of a PCB (resistors, capacitors, etc) but I am having trouble finding a dataset of PCBs from a birds eye view to train the model on. Would anyone happen to have one or know where to find one?
submitted by /u/s0rryari1101
[link] [comments]
Hey everyone! I’m working on a university project where we’re trying to predict the direction of football penalty kicks based on the shooter’s body movement. To do that, we’re using pose estimation and machine learning on real-world footage.
Right now, I’m building a dataset of penalty shootouts — but I specifically need videos where the camera is placed behind the player, like the rear broadcast angle you usually see in World Cup coverage.
I already have all the penalty shootouts from the 2022 World Cup, but I’d love to collect more of this kind — from other tournaments or even club games. If you remember any videos (on YouTube or elsewhere) with that camera angle, please drop them here 🙏
Thanks in advance — you’d be helping a lot!
submitted by /u/tiagonob
[link] [comments]
Would love to see some examples of quality prompts, maybe something structured with Meta prompting. Does anyone know a place from where to download those? Or maybe some of you can share your own creations?
submitted by /u/Winter-Lake-589
[link] [comments]
hello! I wanted to share a tool that I created for making hand written fine tuning datasets, originally I built this for myself when I was unable to find conversational datasets formatted the way I needed when I was fine-tuning llama 3 for the first time and hand typing JSON files seemed like some sort of torture so I built a little simple UI for myself to auto format everything for me.
I originally built this back when I was a beginner so it is very easy to use with no prior dataset creation/formatting experience but also has a bunch of added features I believe more experienced devs would appreciate!
I have expanded it to support :
– many formats; chatml/chatgpt, alpaca, and sharegpt/vicuna
– multi-turn dataset creation not just pair based
– token counting from various models
– custom fields (instructions, system messages, custom ids),
– auto saves and every format type is written at once
– formats like alpaca have no need for additional data besides input and output as a default instructions are auto applied (customizable)
– goal tracking bar
I know it seems a bit crazy to be manually hand typing out datasets but hand written data is great for customizing your LLMs and keeping them high quality, I wrote a 1k interaction conversational dataset with this within a month during my free time and it made it much more mindless and easy
I hope you enjoy! I will be adding new formats over time depending on what becomes popular or asked for
Here is the demo to test out on Hugging Face
(not the full version/link at bottom of page for full version)
submitted by /u/abaris243
[link] [comments]
I need a dataset that’s not too complex or too simple to test a multi agent data science system that builds models for classification and regression.
I need to do some analytics and visualizations and pre-processing, so if you know any data that can helps me please share.
Thank you !
submitted by /u/No_Parking9675
[link] [comments]
Hi!
I’m trying to find a database that displays a current scrape of all rotten tomatoes movies along with audience review and genre. I took a look online and could only find some incomplete datasets. Does anyone have any more recent pulls?
submitted by /u/Jankowski576
[link] [comments]