Hi everyone,
I’m sharing a metadata-only dataset of 7,000 news articles (extracted from a larger 700k core) designed specifically for NLP feature engineering and Media Intelligence. Instead of just standard sentiment (Positive/Negative), I’ve focused on “Narrative Alpha”, structural signals that quantify how a story is being told.
Why this is useful: If you’re building news classifiers, bias detectors, or financial sentiment models, standard text often isn’t enough. This set provides deterministic linguistic metrics you can’t get from a standard scrape.
What’s Inside (22 Columns):
- Structural Metrics: Passive Voice Ratio, Sentence/Word Counts.
- Narrative Signals: Hedging Rate (uncertainty cues), Claim Density per 1k words.
- Credibility & Alignment: Headline-Body Alignment Score, Primary Source Ratio (attribution).
- Traditional Labels: Topic, Political Orientation, Bias Strength, Credibility Level.
Technical Specs:
- Format: Tabular CSV (Clean, no text blobs to protect legal/copyright).
- Usability: 10.0/10.0 on Kaggle (fully documented columns).
- License: CC BY 4.0 (Open for research/commercial use).
Link: Kaggle
AMA about the methodology or the pipeline!
submitted by /u/Queasy_System9168
[link] [comments]