Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Where Can I Find Recent Free Data For The Brazilian Série A Or The Premier League?

Hi everyone! I’m building some dashboards to practice my skills and I wanted to use data from something I really enjoy. I love football, and since I’m Brazilian, I’d really like to use data from the Campeonato Brasileiro Série A — but I haven’t been able to find this data anywhere.

If nobody knows where to find Brazilian league data, could someone help me find Premier League data instead? I’m looking for datasets that include things like:

  • match results
  • lineups
  • yellow/red cards
  • match date, time, and location
  • and anything else that might be interesting to download and analyze

Thanks in advance for any pointers!

submitted by /u/EdScavalier
[link] [comments]

“Flight Tracking API For Small-scale Commercial Use…what’s Actually Worth It?

Hey all – working on a dispatch system for a small airport shuttle service. One of the components is adjusting pickup times based on flight delays/early arrivals.

I’ve been researching flight tracking APIs and so far I’ve come across:

– AeroDataBox (~$15-30/mo on RapidAPI)

– Airlabs ($49/mo for 25K queries)

– FlightAware AeroAPI ($100/mo minimum)

– FlightStats/Cirium (enterprise pricing, way out of budget)

We’re only tracking maybe 30-40 domestic arrivals per day at one airport (PHX). Not looking for anything fancy – just arrival ETAs, delay notifications, and maybe gate/terminal info if available.

Push notifications/webhooks would be awesome so we’re not wasting API queries polling, but polling would be doable if the price is right.

Anyone else working with flight data at a small scale? Something cheaper/better that I’m missing? Open to scrappy solutions too – just needs to be stable enough for a real business.

submitted by /u/zues8
[link] [comments]

Title: I Spent 200+ Hours Building A Forensic Financial Database From 1.48M DOJ Epstein EFTA Files. Here’s Where $1.96 Billion Went.

I’m a finance professional with a background in data science and cybersecurity. Over the past two weeks I built a 6.9GB forensic database from 1,476,377 DOJ EFTA files across 19 datasets — then ran a 24-phase extraction pipeline to trace wire transfers through the Epstein trust network.

Key results:

• $1.964B in financial activity extracted (104.6% of the $1.878B FinCEN SAR benchmark)

• 382 audited wire transfers in the master ledger

• 4-tier shell trust hierarchy mapped with dollar flows on every edge

• 43 shell-to-shell transfers identified

• 9 contamination bugs caught and corrected during the pipeline (including $311M in chain-hop inflation I subtracted from my own numbers)

• 11.4 million entities extracted, 734K unique persons identified

I traced $51.9M flowing through a brokerage shell (Jeepers Inc.) into Epstein’s personal account across 21 wires. I found Plan D LLC disbursing $18M to Leon Black with near-zero inflow. I found an entity called “Gratitude America” sending 88% of its money to investment accounts and 7% to charity.

Everything is (Unverified) — automated extraction, not an audit opinion. I documented every limitation, every bug, and every methodological decision. The methodology, findings, compliance statement, and a 382-wire master ledger are all published.

To my knowledge, this is the first project to systematically reconstruct the financial infrastructure from the EFTA corpus using quantitative forensic methods rather than narrative document review.

GitHub:

https://github.com/randallscott25-star/epstein-forensic-finance

Built solo. For the girls.

submitted by /u/Specialist_Rip5492
[link] [comments]

Has Anyone Successfully Contacted The Seagull Dataset Team

I’m trying to get access to the Seagull Dataset (the UAV maritime surveillance dataset from VisLab). Their page says the data is available “upon request,” but I haven’t received any reply after reaching out.

Has anyone here managed to contact them recently or gotten access?
If so, how long did it take, and which email or method worked for you?

Any insight would be appreciated!

submitted by /u/Due_Radio2866
[link] [comments]

The ENTIRE Epstein Files Dataset Is Now Fully Viewable

So, I was going through Hugging Face and was wondering. Hmm, did someone upload the ENTIRE new Epstein Files? And I found out, nobody did. Nobody uploaded the complete ones and even worse, nobody processed them well..

So, Ladies and Gents, here is the full dataset, easily processable for everyone. If you want to recreate what I did, here is the GitHub: GitHub Link

What does the dataset include? Audio files, Videos, Images, PDF texts.. even excel files?

Questions? Just ask.
Compliments? Just give me some.
Love y’all ❤️

submitted by /u/itsnikity
[link] [comments]

Made A Fast Go Downloader For Massive Files (beats Aria2 By 1.4x)

Hey guys, we’re a couple of CS students who got annoyed with slow single-connection downloads, so we built Surge. Figured the datasets crowd might find it handy for scraping huge CSVs or image directories.

It’s a TUI download manager, but it also has a headless server mode which is perfect if you just want to leave it running on a VPS to pull data overnight.

  • It splits files and maximizes bandwidth by using parallel chunk downloading.
  • It is much more stable and fast than using a browser like Chrome or Firefox!
  • You can use it remotely (over LAN for something like a home lab)
  • You can deploy it easily via Docker compose.
  • We benched it against standard tools and it beat aria2c by about 1.38x, and was over 2x faster than wget.

Check it out if you want to speed up your data scraping pipelines.

GH: github.com/surge-downloader/surge

submitted by /u/SuperCoolPencil
[link] [comments]

Made A Fast Go Downloader For Massive Files (beats Aria2 By 1.4x)

Hey guys, we’re a couple of CS students who got annoyed with slow single-connection downloads, so we built Surge. Figured the datasets crowd might find it handy for scraping huge CSVs or image directories.

It’s a TUI download manager, but it also has a headless server mode, which is perfect if you just want to leave it running on a VPS to pull data overnight.

  • It splits files and maximizes bandwidth by using parallel chunk downloading.
  • You can deploy it easily via Docker compose.
  • We benched it against standard tools, and it beat aria2c by about 1.38x, and was over 2x faster than wget.

Check it out if you want to speed up your data scraping pipelines.

submitted by /u/SuperCoolPencil
[link] [comments]

I Analyzed 25M+ Public Records To Measure Racial Disparities In Sentencing, Traffic Stops, And Mortgage Lending Across The US

I built three investigations using only public government data:

Same Crime, Different Time — 1.3M federal sentencing records (USSC, 2002-2024). Black defendants receive 3.85 months longer sentences than white defendants for the same offense, controlling for offense type, criminal history, and other factors.

Same Stop, Different Outcome — 8.6M traffic stops across 18 states (Stanford Open Policing Project). Black and Hispanic drivers are searched at 2-4x the rate of white drivers, yet contraband is found less often.

Same Loan, Different Rate — 15.3M mortgage applications (HMDA, 2018-2023). Black borrowers pay 7.1 basis points more and Hispanic borrowers 9.7 basis points more in interest rate spread, even after OLS regression controls.

All data is public, all code is open source, and the interactive sites are free:

• samecrimedifferenttime.org (http://samecrimedifferenttime.org/)

• samestopdifferentoutcome.org (http://samestopdifferentoutcome.org/)

• sameloandifferentrate.org (http://sameloandifferentrate.org/)

Happy to answer questions about methodology.

submitted by /u/justiceindexhub
[link] [comments]

How Do MTGTop8 / Tcdecks And Other Actually Get Their Decklists? (noob Here)

Hello guys,

I’m looking into building a small tournament/decklist aggregator (just a personal project, something easy looking), and I’m curious about the data sourcing behind the big sites like MTGTop8 or Tcdeck, Mtgdecks, Mtggoldfish and others.

I doubt these sites are manually updated by people typing in lists 24/7. So, can you help me to understand how them works?:

Where do these sites “pull” their lists from? Is there a an API for tournament results (besides the official MTGO ones), or is it 100% web scraping?

Does a public archive/database of historical decklists (from years ago) exist, or is everyone just sitting on their own proprietary?

Is there a standard way/format to programmatically receive updated decklists from smaller organizers?

If anyone has experience with MTG data engineering or knows of any open-source scrapers/repos any help is really appreciated.

thank you guys

submitted by /u/Dariospinett
[link] [comments]

Alternatives To The UDC (Universal Decimal Classification) Knowledge Taxonomy

I’ve been looking for a general taxonomy with breadth and depth, somewhat similar to the Dewey-Decimal, or UDC taxonomies.

I can’t find an expression of the Dewey-Decimal (and tbh it’s probably fairly out of date now) and while the UDC offer a widely available 2,500-concept summary version, it doesn’t go down into enough detail for practical use. The master-reference file is ~70k in size, but costs >€350 a year to license.

Are there any openly available, broad and deep taxonomical datasets that I can easily download that are both reasonably well-accepted as standards, and which do a good job of defining a range of topics, themes or concepts I can use to help classify documents and other written resources.

One minute I might be looking at a document that provides technical specifications for a data-processing system, the next, a summary of some banking regulations around risk-management, or a write-up of the state of the art in AI technology. I’d like to be able to tag each of these different documents within a standard scheme of classifications.

submitted by /u/ResidentTicket1273
[link] [comments]

“Why Does Our Scraping Pipeline Break Every Two Weeks?”

Most enterprise teams consider only the costs of proxy APIs and cloud servers, overlooking the underlying issue.

Senior Data Engineers, who command salaries of $150,000 or more, spend up to 30% of their time addressing Cloudflare blocks and broken DOM selectors. From a capital allocation perspective, assigning top engineering talent to manage website layout changes is inefficient when web scraping is not your core product.

The solution is not to purchase better scraping tools, but to shift from building infrastructure to procuring outcomes.

Forward-thinking enterprises are adopting Fully Managed Data-as-a-Service. In practice, this approach offers the following benefits:

Engineers are no longer required to fix broken scripts. The managed partner employs autonomous AI agents to handle layout changes and anti-bot systems seamlessly.

Instead of purchasing code, you secure a contract. If a target site undergoes a complete redesign overnight, the partner’s AI adapts, ensuring your data is delivered on time.

Extraction costs are capped, allowing your engineering team to focus on developing features that drive revenue.

A more reliable data supply chain is needed, not just a better scraper.

Is your engineering team focused on building your core product, or are they managing broken pipelines?

Multiple solutions are available; choose the one that best fits your needs.

submitted by /u/3iraven22
[link] [comments]

Lowest Level Of Geospatial Demographic Dataset

Please where can I get block level demographic data that I can use a clip analysis tool to just clip the area I want without it suffering any “casualties “(adding the full data from a block group or zip code of adjoining bg just because a small part of the adjoining bg is part of my area of interest. )

Ps I’ve tried census bureau and nghis and they don’t give me anything that I like . Census bureau is near useless btw . I don’t mind paying from one of those brokers website that charge like $20 but which one is credible ? Please help

submitted by /u/owuraku_ababio
[link] [comments]