Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Anyone Have Any Experience Downloading League Of Legends Data Sets Like Na.op.gg?

Hi everyone.

I was wondering if anyone on this sub had experince working with/downloading solo and duoque League of legends data. Is it possible to export from na.op.gg or maybe riot has an API I get it from.

Ideally I would like to wrangle the data in a way where I could separate my soloq games from my duoq to get some stats and expose my duoq partner.

Anyone have experince with this or think its possible?

EDIT: I can use python with things like pandas, numpy etc for some simple data wrangling and analysis.

submitted by /u/ebscodingjourney
[link] [comments]

Scraping Google Trends Data In 2023?

The issue with the famous 429 when mass scraping google trends seems to have me stuck. I have a list of around 30k keywords I want data on, but don’t want to wait for the timeouts.

I’m using pytrends and have tried using rotating proxies but the high traffic seems to bring my costs up way too high when renting those. I tried multiprocessing using unique Tor circuits for each keyword, but I seem to get authentication errors from google, which seem to get sorted out by including some identity headers, which quickly become invalid due to rate limiting.

Does anyone have a workaround/working code for this? Multiple Google accounts with programmatic login and getting the headers from there, followed by injecting them into pytrends requests? I’d be grateful if you could share your experiences. Thanks!

submitted by /u/thefoque
[link] [comments]

Need Scientific Computing Power For Your Research? Got A Big Dataset To Iterate Over? BOINC Can Get You Teraflops Computing Power Absolutely Free!

For those unfamiliar with it, BOINC is the Berkeley Open Infrastructure for Network Computing. It is a free software and volunteer computing infrastructure focused on science with over 15 active projects. There are teraflops of computing power available to you for absolutely free. If you are working on problems that can be done in a distributed or parallel matter, YSK about it.

The BOINC server software works with any app you have (such as a protein simulator), and can handle all the workunit creation/delivery/validation. You can run the server as a docker container and distribute your app as as pre-compiled binary or inside a virtualbox image to instantly work across platforms. BOINC not only supports 32 and 64-bit Windows/OS X/Linux hosts, but ARM and Android as well. And it supports GPU acceleration as well on both Nvidia and AMD cards. It’s also open-source so you can modify it to suit your use case. For small projects, you can run the BOINC server on a $10/month VPS or a spare laptop in a closet for larger projects obviously the memory and storage needs will scale with complexity.

Once you have your server up (or beforehand, if you need to secure a guarantee of computation before investing development resources), you can approach Science United and Gridcoin for your guaranteed computation (“crunching”). Neither of these mechanisms require you to be affiliated with a university or other institution, they just require that you are doing interesting scientific research.

Science United is a platform run by the BOINC developers which connects volunteer computing participants to BOINC projects. Once they add you to their list, thousands of volunteers around the globe will immediately start crunching data for your project giving you many teraflops of power. Science United is particularly good for smaller projects which don’t have large, ongoing workloads or have sporadic work.

Gridcoin is a cryptocurrency (founded 2013, not affiliated with the BOINC developers) which incentivizes people to crunch workunits for you. They currently incentivize most active BOINC projects (with their permission) and hand out approx $500 USD equivalent in incentivization money to your “crunchers” monthly. The actual value of the computation you receive is much higher than this. All of this happens without you ever needing to do anything aside from have a BOINC server. There are some requirements you must meet such as having a large amount of work to be done (be an ongoing project), but they can direct petaflops of power your way and have a procedure to “pre-approve” your project before it’s done being developed.

BOINC can also be used to harvest under-utilized compute resources on your campus or in your company. It can be installed on platforms and set to compute only while the machine is idle, so it doesn’t slow it down while in use.

Famous research institutes and major universities across the world use BOINC. World Community Grid, the Large Hadron Collider, Rosetta, University of Texas, and the University of California are a handful of the big names that use BOINC for work distribution.

Relevant links:

/r/BOINC4Science

http://boinc.berkeley.edu

submitted by /u/makeasnek
[link] [comments]

[self-promo] Aviation Safety Network (ASN) Dataset

If you’re looking for reliable and up-to-date information on civil aviation accidents and incidents, the Aviation Safety Network (ASN) dataset may be just what you’re looking for. This global database has information on more than 100,000 accidents and incidents that happened since 1919. You can download the dataset stored in a csv file format for further analysis. The csv file has the following columns:

Date – Date of the accident Type – Type of aircraft registration – Registration of the aircraft operator – Operator of the aircraft fatalities – Number of fatalities location – Location of the accident country – Country of the accident cat – Category of the accident described by ASN year – Year of the accident

It is available for download at the below Github link:
https://github.com/alsonpr/Aviation-Safety-Network-Dataset

submitted by /u/woolly-mamoth
[link] [comments]

Old GAN Site – Thispersondoesnotexist

There used to be a site thispersondoesnotexist.com which generates AI generated (GAN created ) artificial human face images . ( Originally a project done at NVDIA ) . That site has been replaced by another one – https://this-person-does-not-exist.com/en ) which has watermarks etc .

Does anyone have the dataset of those AI generated images ? (1024×1024 px ) . I found a few on kaggle datasets , but they are not of the same resolution of the original images that were generated by the site. If so, can you please share the links to the dataset ?

submitted by /u/pythoslabs
[link] [comments]

How To Represent Large Categorical Data?

I’ve 10 numerical and large datasets where each has 3 generic categories. Each row contains unique data. The end row of each dataset contains the labels for each category. The category is not distinct thus other row may refer to any of the 3 categories.

e.g.

Date Value Category 1/1/2010 1.11111 Alpha 2/1/2010 2.11111 Beta 3/1/2010 2.00009 Alpha 4/1/2010 0.00000 Charlie

But the 10 datasets have different volume of data. E.g. dataset A may have 10K rows, dataset B around 100K, Dataset C 1 million, etc.

I couldn’t process all the data as its too large.

What would be the best way to sample each dataset? I’d like the sample containing a fair representative of the 3 categories.

submitted by /u/runnersgo
[link] [comments]

Cannot Find The I2R Dataset On The Internet.

I have been studying a paper and I noticed that they were using video from a dataset called I2R. I tried searching for this dataset but wasn’t able to find it. Does it have a different name or is this dataset not available publicly?

Specifically, the paper mentioned the WaterSurface dataset, Campus, Waving trees, fountain, curtain and switch light datasets.

I am looking for these datasets to apply a background/foreground separation algorithm.

submitted by /u/Curious_Analyst986
[link] [comments]

Daily Cash And Debt Operations Of The U.S. Treasury 2005-2023

The Daily Treasury Statement (DTS) dataset contains a series of tables showing the daily cash and debt operations of the U.S. Treasury. The data includes operating cash balance, deposits and withdrawals of cash, public debt transactions, federal tax deposits, income tax refunds issued (by check and electronic funds transfer (EFT)), short-term cash investments, and issues and redemptions of securities. All figures are rounded to the nearest million.

Source: https://fiscaldata.treasury.gov/datasets/daily-treasury-statement/deposits-and-withdrawals-of-operating-cash

Explore the data online: https://app.gigasheet.com/spreadsheet/U-S–Treasury-Daily-Cash-Debt–Oct-2005–Apr-2023-/820a1527_c8f0_4ae6_a8a6_b841d327c093

submitted by /u/n1nja5h03s
[link] [comments]

Creating A Network Of Reddit 2013 & 2023

Hello, I am working on a project for graduate school on Reddit as a social network from 2013 to 2023. I am using a previous database of 2,500 subreddits and the top 1000 posts from each from 2013 and I am recollecting it for 2023. I have the uploader, post score, list of all commenters, and their collective score for each commenter in that post

Each node will be a subreddit and the ties will be based on the commenters they have in common. How should I measure this?

Each tie is unidirectional and weighted based on the number of commenters who have ever left comments on both of those subreddits. Each tie is unidirectional and weighted based on the total score of all comments in which the commenter has posted in either subreddit

^ This one sounds more substantial but raises a few concerns such as what if Sub A is a huge subreddit and Sub B is a relatively small subreddit? In Sub A the same commenter has say 2K upvotes but in Sub B they have 300 upvotes, which is more than anyone else on that sub.

submitted by /u/admaciaszek
[link] [comments]

[self-promo] Sales & Ads Data Benchmarks For Shopify

This Shopify Benchmarks data includes a cohort of Shopify store sales, website engagement, and advertising metrics at the store category and subcategory level. This eCommerce data is made up of aggregated sales and web analytics for thousands of Shopify stores globally. Additionally, the dataset includes stores’ total Google Ad spend on search ads, embedded display ads, and more from Google Ad Manager.

Sales and engagement metrics:

Revenue Transaction count Website sessions Website page views

Advertising metrics:

Ad spending Ad clicks Ad views (impressions)

https://app.snowflake.com/marketplace/listing/GZTSZAS2KDH/cybersyn-inc-shopify-sales-advertising-benchmarks-by-category

Free trial available if you have a Snowflake account.

submitted by /u/aiatco2
[link] [comments]