[self-promotion] I Scraped ~70k Geopolitical Risk Events From Public Feeds. Only About A Quarter Made The News. (Parquet + CSV On HF/Kaggle)

I’ve been building an open dataset of geopolitical and supply chain risk events scraped from public feeds (GDELT, ACLED, GDACS, NASA FIRMS, WHO DON) for the past few months. Around 70k events at this point. The thing that surprised me when I cross-checked against mainstream news coverage: only about a quarter of those events have any major-outlet article attached.

The other ~72% are silent. Flagged in at least one public feed but never picked up by major news. I’d assumed those would all be low-severity noise (small protests, minor weather flags, single-source rumors). They’re not. Roughly a quarter of the silent set is still rated critical or high severity by the source feed itself, which works out to ~14k events nobody covered. ACLED specifically dominates the silent set — local conflict events that don’t make English-language outlets.

The cross-check has obvious limits worth flagging up front: my “news coverage” is a Google News fetch (so paywalled or non-English coverage gets undercounted), and the severity is graded after the fact by an LLM step (so wrong angle on ambiguous events). Both are best-effort. But the headline gap — ~28% news overlap — is just a SQL join, not LLM-dependent. Events are geocoded by region, no PII. Actor names from ACLED are excluded per their license.

The deduplicated event/chokepoint/entity tables are up on Hugging Face and Kaggle as Parquet + a 10% CSV sample, CC-BY-NC-SA. Browsable map version is at tremorwatch.com if you want to poke at individual events first. Curious if anyone has tried something similar at this scale and how you’d refine the coverage definition (different news source mix, embedding-based fuzzy match, etc).

Disclosure: I built this — part of an early-stage startup (Volt AI). Dataset is free under CC-BY-NC-SA, no paid tier exists yet. Posting under r/datasets self-promo guidelines; happy to adjust format if mods prefer.

submitted by /u/Latter_Panda4439
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *