[self-promotion] Free 20-record Samples (CSV + JSON) Of 20 Dev/AI Datasets — Npm, MCP Servers, HuggingFace Models, Homebrew, Etc.

Hi r/datasets — disclosure first: I sell a paid version of these on Gumroad ($34, 83% off launch). I’m posting the free 20-record samples here because they’re genuinely useful on their own and the mod rules ask self-promotion to be labeled.

What’s in the free samples:

20 niche datasets, each with 20 fully-enriched records as CSV + JSON. ~55,000 records total in the paid version (54,958 as of today). Topics:

  • ai-tools, ai-agents, ai-prompts, ai-models-pricing (13 paid Llama 3.3 70B providers compared)
  • public-apis, mcp-servers (2,971), developer-tools, vscode-extensions
  • self-hosted-software, open-source-alternatives, no-code-lowcode
  • design-resources, cybersecurity-tools
  • npm-packages (top by weekly downloads), homebrew-formulae
  • huggingface-models (top 4,000 by downloads), huggingface-datasets (2,600+)
  • vector-db / RAG ecosystem, ai-agent-frameworks (1,324 records — grew 6.6x in 8 days)

Why I built them:

I kept needing structured, queryable lists of “all the X tools” for filterable directory builds. Awesome-lists and READMEs are great for browsing but useless for jq / SQL / search infrastructure. So I curate, normalize, validate (zero invalid records), enrich with stars/downloads/installs, and refresh.

Per-record fields are typed — categorizationTier rates each record 87-100% specific (vs vague “tool” labels). Open question for the sub: how do you handle tier-of-specificity in your own dataset categorization work? My current rubric is per-dataset config-driven but I’m curious what others do.

Free samples (CSV + JSON, MIT-style permissive): https://github.com/futdevpro/niche-datasets-free

Includes mega-sample.json (5 random records from each of the 20 datasets, 100 records total).

Paid version on Gumroad — $34 launch price (83% off $198 list), monthly refresh on AI Models Pricing because OpenRouter changes weekly, quarterly on the rest. Linked from the GitHub README if anyone wants the full thing.

Happy to answer questions about the catalog, methodology, or specific datasets.

submitted by /u/Jhonny_Ronnie
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *