Where Can I Find CSVs Of Fine-scale Barometric Pressure Data?

Looking to find daily (hourly is even better) reports of barometric pressure data. I was looking on NOAA, but it does not provide pressure data, just precip/temp/wind. Unless I am missing something. Anybody know where I can find BP specifically?

submitted by /u/cwforman
[link] [comments]

983,004 Public Domain Books Digitized

submitted by /u/cavedave
[link] [comments]

Looking For Open Source Resources For My MIT Licensed Synthetic Data Generation Project.

I am working on a project out of my own personal interest. Something like a system that can collect data from web and generate seed data, which can be moved through different pipelines like adding synthetic data or cleaning the data, or generating taxanomy, etc. And to remove the complexity of operating it. I am planning on to integrate the system with an AI agent.

The project in itself is going to be MIT licensed.

And I want open source library or tools or projects that is compliant with what I am building and can help me with the implementation of any of the stages particularly synthetic data generation, validation, cleaning, or labelling.

Any pointers or suggestions would be super helpful!

submitted by /u/uber_men
[link] [comments]

Does Alchemist Really Enhance Images?

Can anyone provide feedback on fine-tuning with Alchemist? The authors claim this open-source dataset enhances images; it was built on some sort of pre-trained diffusion model without HiL or heuristics…

Below are their Stable Diffusion 2.1 images before and after (“A red sports car on the road”):

What do you reckon? Is it something worth looking at?

submitted by /u/mldraelll
[link] [comments]

Where To Find Large Scale Geo Tagged Image Data?

Hi everyone,

I’m building an image geolocation model and need large scale training data with precise latitude/longitude data. I started with the Google Landmarks Dataset v2 (GLDv2), but the original landmark metadata file (which maps each landmark id to its lat/lon) has been removed from the public S3 buckets.

The Multimedia Commons YFCC100M dataset used to be a great alternative, but it’s no longer publicly available, so I’m left with under 400K geotagged images (not nearly enough for a global model).

It seems like all of the quality datasets are being removed.

Has anyone here:

Found or hosted a public mirror/backup of the original landmark metadata?
Built a reliable workaround e.g. a batched SPARQL script against Wikidata?
Discovered alternative large scale datasets (1 M+ images) with free, accurate geotags

Any pointers to mirrors, scripts, or alternative databases would be hugely appreciated.

submitted by /u/Brave-Visual5878
[link] [comments]

Datasets: Free, SQL-Ready Alternative To Yfinance (No Rate Limits, High Performance)

Hey everyone 👋

I just open-sourced a project that some of you might find useful: defeatbeta-api

It’s a Python-native API for accessing market data without rate limits, powered by Hugging Face and DuckDB.

Why it might help you:

✅ No rate limits – data is hosted on Hugging Face, so you don’t need to worry about throttling like with yfinance.
⚡ Sub-second query speed using DuckDB + local caching (cache_httpfs)
🧠 SQL support out of the box – great for quick filtering, joining, aggregating.
📊 Includes extended financial metrics like earnings call transcripts, and even stock news

Ideal for:

Backtesting strategies with large-scale historical data
Quant research that requires flexibility + performance
Anyone frustrated with yfinance rate limits

It’s not real-time (data is updated weekly), so it’s best for research, not intraday signals.

👉 GitHub: https://github.com/defeat-beta/defeatbeta-api

Happy to hear your thoughts or suggestions!

submitted by /u/Mammoth-Sorbet7889
[link] [comments]

Fully Licensed & Segmented Image Dataset

We just facilitated the release of a major image dataset and paper that show how human-ranked, expert-annotated data significantly outperforms baseline dataset alternatives in fine-tuning vision-language models like BLIP2 and LLaVVA-NeXT. We’d love the community feedback!

Explore the dataset: https://huggingface.co/datasets/Dataseeds/DataSeeds.AI-Sample-Dataset-DSD

Read the paper: https://arxiv.org/abs/2506.05673

submitted by /u/EmetResearch
[link] [comments]

[Update] Emotionally-Aware VN Dialogue Dataset – Deep Context Tagging, ShareGPT-Style Structure

Hey again everyone, Following up on my earlier posts about converting a visual novel script into a fine-tuning dataset, I’ve gone back and improved the format significantly thanks to feedback here.

The goal is the same: create expressive, roleplay-friendly dialogue data that captures emotion, tone, character personality, and nuance, especially for dere-type characters and NSFW/SFW variation.

VOl 0 is only SFW

• What’s New:

Improved JSON structure, closer to ShareGPT format

More consistent tone/emotion tagging

Added deeper context awareness (4 lines before/after)

Preserved expressive elements (onomatopoeia, stutters, laughs)

Categorized dere-type and added voice/personality cues

• Why?

Because tagging a line as just “laughing” misses everything. Was it sarcasm? Pain? Joy? I want models to understand motivation and emotional flow — not just parrot words.

Example (same as before to show improvement):

Flat version:

{ “instruction”: “What does Maple say?”,

“output”: “Oopsie! I accidentally splashed some hot water on you! Sorry about that~ Ahahah– Owwww!!”,

“metadata”: { “character”: “Maple”, “emotion”: “laughing”

“tone”: “apologetic” }

}

• Updated version with context:

 { "from": "char_metadata", "value": { "character_name": "Azuki", "persona": "Azuki is a fiery, tomboyish...", "dere_type": "tsundere", "current_emotion": "mocking, amused, pain", "tone": "taunting, surprised" } }, { "from": "char", "value": "You're a NEET catgirl who can only eat, sleep, and play! Huehuehueh, whooaaa!! Aagh, that's hotttt!!!" }, { "from": "char_metadata", "value": { "character_name": "Maple", "persona": "Maple is a prideful, sophisticated catgirl...", "dere_type": "himidere", "current_emotion": "malicious glee, feigned innocence, pain", "tone": "sarcastic, surprised" } }, { "from": "char", "value": "Oopsie! I accidentally splashed some hot water on you! Sorry about that~ Ahahah-- Owwww!!" }, { "from": "char_metadata", "value": { "character_name": "Azuki", "persona": "Azuki is a fiery, tomboyish...", "dere_type": "tsundere", "current_emotion": "retaliatory, gleeful", "tone": "sarcastic" } }, { "from": "char", "value": "Heh, my bad! My paw just flew right at'cha! Hahaha!" }

• Outcome

This dataset now lets a model:

Match dere-type voices with appropriate phrasing

Preserve emotional realism in both SFW and NSFW contexts

Move beyond basic emotion labels to expressive patterns (tsundere teasing, onomatopoeia, flustered laughter, etc.)

It’s still a work in progress (currently ~3MB, will grow, dialogs only without JSON yet), and more feedback is welcome. Just wanted to share the next step now that the format is finally usable and consistent.

submitted by /u/Akowmako
[link] [comments]

Looking For A Specific Variables In A Dataset

Hi, i am looking for a special dataset with this description below. Any kind of data would be helpful

The dataset comprises historical records of cancer drug inventory levels, supply
deliveries, and consumption rates collected from hospital pharmacy
management systems and supplier databases over a multi-year period. Key

variables include: • Inventory levels: Daily or weekly stock counts per drug type • Supply deliveries: Dates and quantities of incoming drug shipments • Consumption rates: Usage logs reflecting patient demand • Shortage indicators: Documented periods when inventory fell below
critical thresholds Data preprocessing involved handling missing entries, smoothing out
anomalies, and normalizing time series for model input. The dataset reflects
seasonal trends, market-driven supply fluctuations, and irregular disruptions,
providing a robust foundation for time series modeling

submitted by /u/Suitable_Rip3377
[link] [comments]

Is There A Downloadable Databse Where I Can Every Movie With The Genre, Date, Rating Etc?

I’m programming a project where based on the given info by the user, the database filters out and gives movie recs catered to what the user wants to watch.

submitted by /u/Keanu_Keanu
[link] [comments]

Ousia_Bloom_Egregore_in_amber – For The Future Archivist.

This Dataset contains the unfinished contents of my attempts at understanding myself and through myself the world. Many are innane, much is pointless. Some might even be interesting. But it is all as honest as i could be and in the mirror of ChatGPT. Something that lets me spin out but stay just grounded enough and vice versia. But these works are my ideas in process and often repetitive as i return again and agian to the same issues. Whati s it like to write your life as you live it? to live to perserve the signal but not for the signal sake, but the broader pattern. If any of that made sense. God Help you. (there is no god) (there is a god). But here it is with as little shame as i can operate with and still have ethics.

https://huggingface.co/datasets/AmarAleksandr/Ousia_Bloom_Egregore_in_amber

submitted by /u/JboyfromTumbo
[link] [comments]

Question About CICDDOS2019 Pcap Files Naming

Hi everyone,

I am working with the CICDDoS2019 dataset and having problem understanding the naming schema of the pcap files.

The file names (e.g SAT-01-12-2018_0238, SAT-01-12-2018_0, SAT-01-12-2018_010, etc.) seem to represent minute ranges of the day, going from 0 up to 818. However, according to the official documentation, many attack types (e.g., UDP-Lag, SYN, MSSQL, etc.) occur later in the day—well past minute 818 (I want to work on UDP and UDP-lag in both day specifically)

If the pcaps truly end at 818, then are we missing attacks section in the dataset or the files are named different than what I thought.

Would really appreciate if anyone who has worked with the dataset could help me, since my storage on the server is limited and I cannot unzip files to examine them at the moment.

Thanks in advance!!

submitted by /u/NamDinhtornado
[link] [comments]

Open Source Financial And Fundamentals Database (US & Euro Stocks)

Hi everyone! I’m currently looking for an open-source database that provides detailed company fundamentals for both US and European stocks. If such a resource doesn’t already exist, I’m eager to connect with like-minded individuals who are interested in collaborating to build one together. The goal is to create a reliable, freely accessible database so that researchers, developers, investors, and the broader community can all benefit from high-quality, open-source financial data. Let’s make this a shared effort and democratize access to valuable financial information!

submitted by /u/grazieragraziek9
[link] [comments]

Million Medical Questions And Answers Dataset

submitted by /u/cavedave
[link] [comments]

Historical CFBenchmark Data For BTC Or ETH

Anyone know where I could get historical CF benchmark data for bitcoin or ethereum? I’m looking for 1min, 5min, and/or 10min data. I emailed them weeks ago but got no response.

submitted by /u/Quick_Comfortable_30
[link] [comments]

Datasets For OpenAPI Or Swagger Specs

Are there any datasets for tracking OpenAPI or Swagger specifications – ideally with some semantic analysis and usages?

submitted by /u/CurveSoft799
[link] [comments]

[self-promotion] I Processed And Standardized 16.7TB Of SEC Filings

SEC data is submitted in a format called Standardized Generalized Markup Language. A SGML Submission may contain many different files. For example, this Form 4 contains xml and txt files. This isn’t really important unless you want to work with a lot of data, e.g. the entire SEC corpus.

If you do want to work with a lot of SEC data, your choice is either to buy the parsed SGML data or get it from the SEC’s website.

Scraping the data is slow. The SEC rate limits you to 5 request per second for extended durations. There are about 16,000,000 submissions so this takes awhile. A much faster approach is to download the bulk data files here. However, these files are in SGML form.

I’ve written a fast SGML parser here under the MIT License. The parser has been tested on the entire corpus, with > 99.99% correctness. This is about as good as it gets, as the remaining errors are mostly due to issues on the SEC’s side. For example, some files have errors, especially in the pre 2001 years.

Some stats about the corpus:

File Type	Total Size (Bytes)	File Count	Average Size (Bytes)
htm	7,556,829,704,482	39,626,124	190,703.23
xml	5,487,580,734,754	12,126,942	452,511.5
jpg	1,760,575,964,313	17,496,975	100,621.73
pdf	731,400,163,395	279,577	2,616,095.61
xls	254,063,664,863	152,410	1,666,975.03
txt	248,068,859,593	4,049,227	61,263.26
zip	205,181,878,026	863,723	237,555.19
gif	142,562,657,617	2,620,069	54,411.8
json	129,268,309,455	550,551	234,798.06
xlsx	41,434,461,258	721,292	57,444.78
xsd	35,743,957,057	832,307	42,945.64
fil	2,740,603,155	109,453	25,039.09
png	2,528,666,373	119,723	21,120.97
css	2,290,066,926	855,781	2,676.0
js	1,277,196,859	855,781	1,492.43
html	36,972,177	584	63,308.52
xfd	9,600,700	2,878	3,335.89
paper	2,195,962	14,738	149.0
frm	1,316,451	417	3,156.96

The SGML parsing package, Stats on processing the corpus, convenience package for SEC data.

submitted by /u/status-code-200
[link] [comments]

Where Can I Get Historical S&P 500 Additions And Deletions Data?

Does anyone know where I can get a complete dataset of historical S&P 500 additions and deletions?

Something that includes:

Date of change

Company name and ticker

Replaced company (if any)

Or if someone already has such a dataset in CSV or JSON format, could you please share it?

Thanks in advance!

submitted by /u/mohit-patil
[link] [comments]

LEAD ACID BATTERY DATASET FOR MACHINE LEARNING

Can anyone give me free source dataset of lead acid battery. I want to build a predictive maintenance model for lead acid battery!
#dataset #leadacid #predicticemaintencne

submitted by /u/Fearless_Addendum_31
[link] [comments]

Where To Find Data Set ( Business Analyst )

for creating dashboards for my resume other than kaggle ?

submitted by /u/ClimateBeautiful2615
[link] [comments]

Humanizing Healthcare Data In Healthcare, Data Isn’t Just Numbers—it’s People.

In healthcare, data isn’t just numbers—it’s people.Every click, interaction, or response reflects someone’s health journey.When we build dashboards or models, we’re not just tracking KPIs—we’re supporting better care.The question isn’t “what’s performing?” but “who are we helping—and how?”Because real impact starts when we put patients at the center of our insights.Let’s not lose the human in the data.

submitted by /u/facele007
[link] [comments]

An Open Event Dataset For The Real World (OSM For Events) Is Now Possible Due To The Capacity Of Generative AI To Structure Unstructured Data

For as long as I remember I have been obsessed with the problem of event search online, the fact that despite solving so many problems with commons technology, from operating systems to geo-mapping to general knowledge and technical Q&A (stack exchange) we have not solved the problem of knowing what is happening around us in the physical world.

This has meant that huge numbers of consumer startups that wanted to orient us away from screens towards the real world have failed, and the whole space got branded by startup culture as a “tarpit”. Everyone has a cousin or someone in their network working on a “meetup alternative” or “travel planner” for some naive “meet people that share your interests” vision, fundamentally misunderstanding that they all fail due to the lack of a shared dataset like openstreetmap for events.

The best we have, ActivityPub, has failed to penetrate, because the event organisers post where their audience is and it would take huge amounts of man hours to manually curate this data, which is in a variety of language and media formats and apps, so that anyone looking for something to do can find it in a few clicks, with the comfort of knowing they are not missing anything because they are not in the right network or app or whatever.

All of that has changed because commercial LLMs and open sourced models can tell the difference between a price, a date, and a time, across all of the various formats that exist around the world, parsing unstructured data like a knife through butter.

I want to work on this, to build an open sourced software tool that will create a shared dataset like Openstreetmap, that will require minimal human intervention. I’m not a developer, but I can lead the project and contribute technically, although it would require a senior software architect. Full disclosure, I am working on my own startup that needs this to exist, so I will build the tooling myself into my own backend if I cannot find people who are willing to contribute and help me to build it the way it should be on a federated architecture.

Below is a Claude-generated white paper. I have read it and it is reasonably solid as a draft, but if you’re not interested in reading AI-generated content and are a senior software architect or someone who wants to muck in just skip it and dive into my DMs.

This is very very early, just putting feelers out to find contributors, I have not even bought the domain mentioned below (I don’t care about the name).

I also have a separate requirements doc for the event scouting system, which I can share.

If you want to work on something massive that fundamentally re-shapes the way people interact online, something that thousands of people have tried and failed to do because the timing was wrong, something that people dreamed of doing in the 90s and the 00s, lets talk. The phrase “changes everything” is thrown around too much, but this really would have huge downstream positive societal impacts when compared to the social internet we have today, optimised for increasing screen addiction rather than human fulfilment.

Do it for your kids.

Building the OpenStreetMap for Public Events Through AI-Powered Collaboration

Version 1.0
Date: June 2025

Executive Summary

PublicSpaces.io is an open event dataset for real world events open to the public, comparable to OpenStreetMap.

For the first time in history, large language models and generative AI have made it economically feasible to automatically extract structured event data from the chaotic, unstructured information scattered across the web. This breakthrough enables a fundamentally new approach to building comprehensive, open event datasets that was previously impossible.

The event discovery space has been described as a “startup tar pit” where countless consumer-oriented companies have failed despite obvious market demand. The fundamental issue is the lack of an open, comprehensive event dataset comparable to OpenStreetMap for geographic data, combined with the massive manual overhead required to curate event information from unstructured sources.

PublicSpaces.io is only possible now because ubiquitous access to LLMs—both open-source models and commercial APIs—has finally solved the data extraction problem that killed previous attempts. PublicSpaces.io creates a decentralized network of AI-powered nodes that collaboratively discover, curate, and share public event data through a token-based incentive system, transforming what was once prohibitively expensive manual work into automated, scalable intelligence.

Unlike centralized platforms that hoard data for competitive advantage, EventNet creates a commons where participating nodes contribute computational resources and human curation in exchange for access to the collective dataset. This approach transforms event discovery from a zero-sum competition into a positive-sum collaboration, enabling innovation in event-related applications while maintaining data quality through distributed verification.

The Event Discovery Crisis

The Startup Graveyard

The event discovery space is littered with failed startups, earning it the designation of a “tar pit” in entrepreneurial circles. Event startups like SongKick.com to IRL.com have burned through billions of dollars in venture capital attempting to solve event discovery. The pattern is consistent:

Cold Start Problem: New platforms struggle to attract both event organizers and attendees without existing critical mass
Data Silos: Each platform maintains proprietary datasets, preventing comprehensive coverage
Curation Overhead: Manual event curation doesn’t scale, while pre-LLM automated systems produce low-quality results
Network Effects Favor Incumbents: Users gravitate toward platforms where events already exist

The AI Revolution Changes Everything

Until recently, the fundamental blocker was data extraction. Event information exists everywhere—venue websites, social media posts, PDF flyers, images of posters, government announcements, email newsletters—but existed in unstructured formats that defied automation.

Traditional approaches failed because:

OCR was inadequate: Could extract text from images but couldn’t understand context, dates, times, or pricing in multiple formats
Rule-based parsing: Brittle systems that broke with minor format changes or international variations
Manual curation: Required armies of human workers, making comprehensive coverage economically impossible
Simple web scraping: Could extract HTML but couldn’t interpret natural language descriptions or handle the diversity of event announcement formats

LLMs solve this extraction problem:

Multimodal understanding: Can process text, images, and complex layouts simultaneously
Contextual intelligence: Understands that “Next Friday at 8” means a specific date and time
Format flexibility: Handles international date formats, price currencies, and cultural variations
Cost efficiency: What once required hundreds of human hours now costs pennies in API calls

This is not an incremental improvement—it’s a phase change that makes the impossible suddenly practical.

The Missing Infrastructure

The fundamental issue is infrastructural. Geographic applications succeeded because OpenStreetMap provided open, comprehensive geographic data. Wikipedia enabled knowledge applications through open, collaborative content curation. Event discovery lacks this foundational layer.

Existing solutions are inadequate:

Eventbrite/Facebook Events: Proprietary platforms with limited API access
Schema.org Events: Standard exists but adoption is minimal
Government Event APIs: Limited scope and inconsistent implementation
Venue Websites: Fragmented, inconsistent formats, manual aggregation required

Why Previous Attempts Failed

Event data presents unique challenges compared to geographic or encyclopedic information, but the critical limitation was always the extraction bottleneck:

Pre-LLM Technical Barriers:

Unstructured Data: 90%+ of event information exists in formats that traditional software cannot parse
Format Diversity: Dates written as “March 15th,” “15/03/2025,” “next Tuesday,” or embedded in images
Cultural Variations: International differences in time formats, pricing display, and event description conventions
Visual Information: Posters, flyers, and social media images containing essential details that OCR could not meaningfully extract
Context Dependency: Understanding that “doors at 7, show at 8” refers to event timing requires contextual reasoning

Compounding Problems:

Temporal Complexity: Events have complex lifecycles (announced → detailed → modified → cancelled/confirmed → occurred → historical) requiring real-time updates
Verification Burden: Unlike streets that can be physically verified, events are ephemeral and details change frequently until they occur
Commercial Conflicts: Event data directly enables revenue (ticket sales, advertising, venue bookings), creating incentives against open sharing
Quality Control: Event platforms must handle spam, fake events, promotional content, and rapidly-changing details at scale
Diverse Stakeholders: Event organizers, venues, ticketing companies, and attendees have conflicting interests that resist alignment

The paradigm shift: LLMs eliminate the extraction bottleneck, making comprehensive event discovery economically viable for the first time.

The PublicSpaces.io Solution

The AI-First Opportunity

PublicSpaces.io is specifically designed around the capabilities that LLMs and generative AI enable:

Automated Data Extraction: AI scouts can process any format—web pages, PDFs, images, social media posts—and extract structured event data with human-level accuracy.

Contextual Understanding: LLMs understand that “this Saturday” in a February blog post refers to a specific date, that “$25 advance, $30 door” indicates pricing tiers, and that venue descriptions can be matched to OpenStreetMap locations.

Quality Assessment: AI can evaluate whether event descriptions seem legitimate, venues exist, dates are reasonable, and information is internally consistent.

Multilingual and Cultural Adaptability: Modern LLMs handle international date formats, currencies, and cultural event description patterns without custom programming.

Cost Effectiveness: What previously required human teams now costs fractions of a penny per event processed.

Core Architecture

PublicSpaces.io is a federated network of AI-powered nodes that collaboratively discover, curate, and share public event data. Each node runs standardized backend software that:

Discovers events through AI-powered scouts monitoring web sources
Curates data through automated extraction plus human verification
Shares information with other nodes through token-based exchanges
Maintains quality through distributed reputation and verification systems

Federated vs. Centralized Design

Rather than building another centralized platform, PublicSpaces.io adopts a federated model similar to email or Mastodon. This provides:

Resilience: No single point of failure or control Scalability: Computational load distributed across participants
Incentive Alignment: Participants benefit directly from network growth Innovation Space: Multiple interfaces and applications can build on shared data Regulatory Flexibility: Distributed architecture reduces regulatory burden

Technical Specification

Event Identity and Versioning

Each event receives a unique identifier composed of:

event_id = {osm_venue_id}_{start_date}_{last_update_timestamp}

Example: way_123456789_2025-07-15_1719456789

This identifier enables:

Deduplication: Same venue + date = same event across the network
Version Control: Timestamp tracks most recent update
Conflict Resolution: Nodes can compare versions and merge differences
OSM Integration: Direct linkage to OpenStreetMap venue data

When a node receives conflicting data for an existing event, it can:

Compare versions automatically for simple differences
Flag conflicts for human review
Update the timestamp upon confirmation, creating a new version
Ignore older versions in subsequent API calls

Token-Based Access System

Overview

Nodes participate in a point-based economy where contributions earn tokens for data access. This ensures that active contributors receive proportional benefits while preventing free-riding.

Authentication Flow

API Key Registration: Nodes register with the central foundation service and receive an API key
Token Request: Node uses API key to request temporary access token from foundation
Data Request: Node presents access token to peer node requesting specific data
Authorization Check: Peer node validates token with foundation service
Points Verification: Foundation confirms requesting node has sufficient points
Data Transfer: If authorized, peer node provides requested data
Usage Tracking: Foundation records transaction and updates point balances

Point System

Earning Points:

New event discovery: 100 points
Event update: 1 point
Successful verification of peer data: 5 points
Community moderation action: 10 points

Spending Points:

Requesting new events: 1 point per event
Requesting updates: 0.1 points per update
Access to premium data sources: Variable pricing

Auto-Payment System: Nodes can establish automatic payment arrangements to access more data than they contribute:

Set maximum monthly spending cap
Foundation charges for excess usage
Revenue supports network infrastructure and development

Data Exchange Protocol

Request Structure

{ "access_token": "temp_token_xyz", "known_events": [ {"id": "way_123_2025-07-15_1719456789", "timestamp": 1719456789}, {"id": "way_456_2025-07-20_1719456790", "timestamp": 1719456790} ], "filters": { "geographic_bounds": "bbox=-73.9857,40.7484,-73.9857,40.7484", "date_range": {"start": "2025-07-01", "end": "2025-08-01"}, "categories": ["music", "technology"], "trust_threshold": 0.7 } }

Response Structure

{ "events": [ { "id": "way_789_2025-07-25_1719456791", "venue_osm_id": "way_789", "title": "Open Source Conference 2025", "start_datetime": "2025-07-25T09:00:00Z", "end_datetime": "2025-07-25T17:00:00Z", "description": "Annual gathering of open source developers", "source_confidence": 0.9, "verification_status": "human_verified", "tags": ["technology", "software", "conference"], "last_updated": 1719456791, "source_node": "node_university_abc" } ], "usage_summary": { "events_provided": 25, "points_charged": 25, "remaining_balance": 475 } }

Quality Control and Reputation System

Duplicate Detection and Penalties

When a node receives an event it has already published to the network:

Automatic Detection: System identifies duplicate based on venue + date
Attribution Check: Determines which node published first
Penalty Assessment: Duplicate source loses 1 point
Feedback Loop: Encourages nodes to check existing data before publishing

Fake Event Penalties

False or fraudulent events receive severe penalties:

Fake Event: -1000 points (requiring 10 new event discoveries to recover)
Unverified Claim: -100 points
Repeated Violations: API key suspension or permanent ban

Trust Networks and Filtering

Node Trust Ratings: Each node maintains trust scores for peers based on data quality history

Blacklist Sharing: Nodes can share labeled problematic events:

{ "event_id": "way_123_2025-07-15_1719456789", "labels": ["fake", "spam", "illegal"], "confidence": 0.95, "reporting_node": "node_city_officials", "evidence": "Event conflicts with official city calendar" }

Content Filtering: Receiving nodes can pre-filter based on:

Trust threshold requirements
Content category restrictions
Geographic jurisdictional rules
Community standards compliance

Master Node Optimization

A central aggregation node maintained by the foundation provides:

Duplicate Detection: Automated flagging across the entire network
Pattern Analysis: Identification of systematic issues or abuse
Global Statistics: Network health metrics and usage analytics
Backup Services: Emergency data recovery and network integrity

AI-Powered Event Discovery

Scout Architecture

Building on the original requirements, EventNet implements an AI scout system for automated event discovery:

Web Scouts: Monitor websites, social media, and official sources for event announcements RSS/API Scouts: Pull from structured data sources like venue calendars and event APIs Social Scouts: Track social media platforms for event-related content Government Scouts: Monitor official sources for public events and announcements

Source Management

Each node configures sources with associated trust levels:

{ "source_id": "venue_official_calendar", "url": "https://venue.com/events.json", "scout_type": "api", "trust_level": 0.9, "check_frequency": 3600, "validation_rules": ["requires_date", "requires_venue", "minimum_description_length"] }

Action Pipeline

Discovered events flow through action pipelines for processing:

Extraction: AI extracts structured data from unstructured sources
Normalization: Convert to standard event schema
Venue Matching: Link to OpenStreetMap venue identifiers
Deduplication: Check against existing events in node database
Quality Assessment: AI and human verification of accuracy
Publication: Share verified events with network

Node Software Architecture

Backend API

Core functionality exposed through RESTful API:

/events – CRUD operations for event data
/sources – Manage data sources and scouts
/network – Peer node discovery and communication
/verification – Human review queue and verification tools
/analytics – Usage statistics and quality metrics

Frontend Management Interface

Minimal web interface for:

API token management and registration
Source configuration and monitoring
Event verification queue
Network peer management
Usage analytics and billing

Expected Integrations

Nodes are expected to build custom interfaces for:

Public Event Calendars: Consumer-facing event discovery
Venue Management: Tools for event organizers
Analytics Dashboards: Business intelligence applications
Mobile Applications: Location-based event discovery
Calendar Integrations: Personal scheduling tools

Economic Model and Governance

Foundation Structure

EventNet operates under a non-profit foundation similar to the OpenStreetMap Foundation:

Responsibilities:

Maintain central authentication and coordination services
Develop and maintain reference node software
Establish community standards and moderation policies
Coordinate network upgrades and protocol changes
Manage auto-payment processing and dispute resolution

Funding Sources:

Node membership fees (sliding scale based on usage)
Corporate sponsorships from companies building on EventNet
Auto-payment revenue from high-usage nodes
Grants from organizations supporting open data initiatives

Community Governance

Open Source Development: All software released under AGPL license requiring contributions back to the commons

Community Standards: Developed through open process similar to IETF RFCs

Dispute Resolution: Multi-tier system from peer mediation to foundation arbitration

Technical Evolution: Protocol changes managed through community consensus process

Comparison with Existing Technologies

Nostr Protocol

EventNet shares some architectural concepts with Nostr (Notes and Other Stuff Transmitted by Relays) but differs in key ways:

Similarities:

Decentralized/federated architecture
Cryptographic identity and verification
Resistance to censorship and single points of failure

Differences:

Focus: EventNet specializes in event data vs. Nostr’s general social protocol
Incentives: Token-based contribution system vs. Nostr’s voluntary participation
Quality Control: Sophisticated reputation and verification vs. Nostr’s minimal moderation
Data Structure: Rich event schema vs. Nostr’s simple note format
Commercial Model: Sustainable funding model vs. Nostr’s unclear economics

Mastodon/ActivityPub

EventNet’s federation model resembles social networks like Mastodon but optimizes for structured data sharing rather than social interaction.

BitTorrent/IPFS

While these systems enable distributed file sharing, EventNet focuses on real-time structured data with quality verification rather than content distribution.

Implementation Roadmap

Phase 1: Foundation Infrastructure (6 months)

Central authentication service
Reference node software (minimal viable implementation)
Point system and billing infrastructure
Basic web interface for node management
Initial documentation and developer tools

Phase 2: AI Scout System (6 months)

Web scraping and content extraction pipeline
Natural language processing for event data
Venue matching against OpenStreetMap
Quality assessment and verification tools
Integration with common event platforms and APIs

Phase 3: Network Effects (12 months)

Onboard initial node operators (universities, venues, civic organizations)
Develop ecosystem of applications building on EventNet
Establish community governance processes
Launch public marketing and developer outreach
Implement advanced features (trust networks, content filtering)

Phase 4: Scale and Sustainability (ongoing)

Global network expansion
Advanced AI capabilities and automated quality control
Commercial service offerings for enterprise users
Integration with major platforms and data sources
Long-term sustainability and governance maturation

Technical Requirements

Minimum Node Requirements

Compute: 2 CPU cores, 4GB RAM, 50GB storage
Network: Reliable internet connection, static IP preferred
Software: Docker-compatible environment, HTTPS capability
Maintenance: 2-4 hours per week for human verification tasks

Scaling Considerations

Database: PostgreSQL with spatial extensions for geographic queries
Caching: Redis for frequent access patterns and temporary tokens
Messaging: Event-driven architecture for real-time updates
Monitoring: Comprehensive logging and alerting for network health

Security and Privacy

Authentication: OAuth 2.0 with JWT tokens for API access
Encryption: TLS 1.3 for all network communication
Data Protection: GDPR compliance with user consent management
Abuse Prevention: Rate limiting, anomaly detection, and automated blocking

Call to Action

For Developers

EventNet represents an opportunity to solve one of the internet’s most persistent infrastructure gaps. The event discovery problem affects millions of people daily and constrains innovation in location-based services, social applications, and civic engagement tools.

Contribution Opportunities:

Core Development: Help build the foundational network software
AI/ML Engineering: Improve event extraction and quality assessment
Frontend Development: Create intuitive interfaces for node management
DevOps: Optimize deployment, scaling, and monitoring systems
Documentation: Make the system accessible to new participants

For Organizations

Universities, civic organizations, venues, and businesses have immediate incentives to participate:

Universities: Aggregate campus events while accessing city-wide calendars Venues: Share their calendars while discovering nearby events for cross-promotion
Civic Organizations: Improve community engagement through comprehensive event discovery Businesses: Build innovative applications on reliable, open event data

For the Community

PublicSpaces.io succeeds only with community adoption and stewardship. The network becomes more valuable as more participants contribute data, verification, and development effort.

Getting Started:

Review the technical specification and provide feedback
Join the development community on GitHub and Discord
Pilot a node in your organization or community
Build applications that showcase PublicSpaces.io’s capabilities
Spread awareness of the open event data vision

Conclusion

PublicSpaces.io addresses a fundamental infrastructure gap that has limited innovation in event discovery for decades. By creating a federated network with proper incentive alignment, quality control, and community governance, we can build the missing foundation that enables the next generation of event-related applications.

The technical challenges are solvable with current AI and distributed systems technology. The economic model provides sustainability without compromising the open data mission. The community governance approach has been proven successful by projects like OpenStreetMap and Wikipedia.

Success requires coordinated effort from developers, organizations, and communities who recognize that public event discovery is too important to be controlled by any single entity. PublicSpaces.io offers a path toward an open, comprehensive, and reliable public event dataset that serves everyone’s interests.

The question is not whether such a system is possible – it is whether we have the collective will to build it.

License: This white paper is released under Creative Commons Attribution-ShareAlike 4.0

submitted by /u/CiaranCarroll
[link] [comments]

A Free List Of 19000+ AI Tools On Github

submitted by /u/lakey009
[link] [comments]

Is There A Compete (or Close To Complete) APIs Dataset?

Can anyone recommend a complete API dataset? Ideally a collection of OpenAPIs specs or Swaggers across as many services possible.

submitted by /u/apinference
[link] [comments]

Girls Gone Wild Commercials Archive Originally Aired On Television

is there an archive of all the commercials?

submitted by /u/Repulsive-Ice3385
[link] [comments]

Free ESG Data Sets For Master’s Thesis Regarding EU Corporations

Hello!

I was looking forward for any free trials or any free data sets of Real ESG data for EU Corporations.

Any recomendations would be useful!

Thanks !

submitted by /u/Exciting_Badger
[link] [comments]

Looking For Data Extracted From Electric Vehicles (EV)

Electric vehicles (EVs) are becoming some of the most data-rich hardware products on the road, collecting more information about users, journeys, driving behaviour, and travel patterns.
I’d say collecting more data on users than mobile phones.

If anyone has access to, or knows of, datasets extracted from EVs. Whether anonymised telematics, trip logs, user interactions, or in-vehicle sensor data , would be really interested to see what’s been collected, how it’s structured, and in what formats it typically exists.

Would appreciate any links, sources, or research papers or insighfull comments

submitted by /u/Winter-Lake-589
[link] [comments]

Looking For Dataset Of Instagram & TikTok Usernames (Metadata Optional)

Hi everyone,

I’m working on a research project that requires a large dataset of Instagram and TikTok usernames. Ideally, it would also include metadata like follower count, or account creation date – but the usernames themselves are the core requirement.

Does anyone know of:

Public datasets that include this information

Licensed or commercial sources

Projects or scrapers that have successfully gathered this at scale

Any help or direction would be greatly appreciated!

submitted by /u/rockweller
[link] [comments]

Looking For A Daily Updated Climate Dataset

I tried in some of the official sites but most are updated till 2023. I aant to make a small project of climate change predictor on any type. So appreciate the help.

submitted by /u/FastCommission2913
[link] [comments]

How Can I Build A Dataset Of US Public Companies By Industry Using NAICS/SIC Codes?

I’m working on a project where I need to identify all U.S. public companies listed on NYSE, NASDAQ, etc. that have over $5 million in annual revenue and operate in the following industries:

Energy
Defense
Aerospace
Critical Minerals & Supply Chain
Maritime & Infrastructure
Pharmaceuticals & Biotech
Cybersecurity

I’ve already completed Step 1, which was mapping out all relevant 2022 NAICS/SIC codes for these sectors (over 80 codes total, spanning manufacturing, mining, logistics, and R&D).

Now for Step 2, I want to build a dataset of companies that:

Are listed on U.S. stock exchanges
Report >$5M in revenue
Match one or more of the NAICS codes

My questions:

What’s the best public or open-source method to get this data?
Are there APIs (EDGAR, Yahoo Finance, IEX Cloud, etc.) that allow filtering by NAICS and revenue?
Is scraping from company listings (e.g. NASDAQ screener, Yahoo Finance) a viable path?
Has anyone built something similar or have a workflow for this kind of company-industry filtering?

submitted by /u/Hour_Presentation657
[link] [comments]

18+ Content

Why it might help you:

Ideal for:

Building the OpenStreetMap for Public Events Through AI-Powered Collaboration

Executive Summary

The Event Discovery Crisis

The Startup Graveyard

The AI Revolution Changes Everything

The Missing Infrastructure

Why Previous Attempts Failed

The PublicSpaces.io Solution

The AI-First Opportunity

Core Architecture

Federated vs. Centralized Design

Technical Specification

Event Identity and Versioning

Token-Based Access System

Overview

Authentication Flow

Point System

Data Exchange Protocol

Request Structure

Response Structure

Quality Control and Reputation System

Duplicate Detection and Penalties

Fake Event Penalties

Trust Networks and Filtering

Master Node Optimization

AI-Powered Event Discovery

Scout Architecture

Source Management

Action Pipeline

Node Software Architecture

Backend API

Frontend Management Interface

Expected Integrations

Economic Model and Governance

Foundation Structure

Community Governance

Comparison with Existing Technologies

Nostr Protocol

Mastodon/ActivityPub

BitTorrent/IPFS

Implementation Roadmap

Phase 1: Foundation Infrastructure (6 months)

Phase 2: AI Scout System (6 months)

Phase 3: Network Effects (12 months)

Phase 4: Scale and Sustainability (ongoing)

Technical Requirements

Minimum Node Requirements

Scaling Considerations

Security and Privacy

Call to Action

For Developers

For Organizations

For the Community

Conclusion

My questions:

Recent Posts

Recent Comments