Category: Other Nonsense & Spam

Public Datasets With Interesting Patterns In NULL/missing Data

I’m working on a project focused on missing data. Does anyone know of interesting datasets with the following criteria?

Publicly available for download, in a tractable format Data arrives over time (e.g. a new batch every day/week/month; or at least new rows added from time time time.) Some columns have missing values Ideally, missing values show interesting patterns of some kind (e.g. “column X is sometimes missing when column Y == A, but never when column Y == B” or “percentage of missing values in column Z is much higher on weekends.”

I’m willing to wade through a fair amount of EDA to find interesting patterns. Really, anything you can point me to would be helpful.

submitted by /u/grumpy_greybox
[link] [comments]

In Need Of: NASCAR Cup Series Dataset(s)

Hello, all. I am working on a statistical analysis of NASCAR Cup Series drivers in the modern era (1972 to present) and am in need of data. Currently, I can access the information I need through a few different channels, but wanted to see if it was possible to access a database that is already compiled that would decrease the amount of time it is taking.

The most cumbersome fields to gather are dates of birth and number of race starts pre-1972. Additionally, I am using data pieces like driver name, finishing position, condition of car at finish, team, manufacturer, etc., but those are simple enough to get right now.

If there is a dataset with all this information, or multiple datasets that would encompass all this, I would really appreciate being able to access them to use for this project.

Thank you all in advance for any help you can afford!

submitted by /u/tarvusdreytan
[link] [comments]

How To Choose The Right Off-the-Shelf AI Training Data Provider?

Choosing the right off-the-shelf AI training data provider can be a daunting task, especially with the large number of options available. Here are some factors to consider when selecting an AI training data provider:

Quality: One of the most critical factors to consider is the quality of the training data. The provider should have high-quality data that accurately reflects the real-world scenarios that the AI system will encounter. Diversity: It is also essential to ensure that the provider offers a diverse range of data sets that cover a wide variety of scenarios. This will ensure that the AI model is trained on a comprehensive dataset that reflects the real world. Customizability: The provider should offer customizable data sets that allow you to select the specific data that best suits your needs. Data Security: The provider should have robust data security measures in place to ensure that your data remains secure and confidential. Scalability: The provider should be able to provide a scalable solution that can grow with your business’s needs. Cost: Finally, consider the cost of the data sets and ensure that it is within your budget. Be wary of providers that offer data sets at an unusually low price, as this may indicate low-quality data.

By considering these factors, you can choose the right off-the-shelf AI training data provider that will provide you with the best possible training data for your AI system.

submitted by /u/Shaip111
[link] [comments]

Is It Legal To Scrape Data From RedFin Using Selenium?

I’ve been learning web scraping recently and wanted to do a project to post on Kaggle. I’ve searched and can’t find anywhere with express permission to web scrape their site. I wanted to scrape their rental data (as the for_sale and sold data are already available in csv files, but rentals aren’t). Anyone can link me to permission or something legal, so that I can include it in my project? This world of scraping legality is new to me, so apologies for any ignorances on my part.

Edit: I emailed them and asked and they said they don’t allow scraping. I was under the impression that if it’s publicly available data then it’s not illegal to scrape?

submitted by /u/bingopajamma
[link] [comments]

[REQUEST] MITRE ATT&CK Annotated Cyber Attack Trees

Interested in any Cyber Incident data that links MITRE ATT&CK labels to the time of detection or attacker kill chain, such as annotated cyber incident timelines. Particularly interested in mapping progress through the killchain to draw out most common attack paths.

I know much of this data will be commercially sensitive, or IP for incident response companies, any suggestions or direction would be greatly welcomed.

submitted by /u/swivel_chair_jockey
[link] [comments]

International Beerio Kart Championships Of The World: Power Rankings Development Help!

TL;DR: My friends and I have a stupid hobby that’s getting out of control and I need your help spiraling it further. Please help me create a fair power rankings system (using the attached spreadsheet for reference) for the Beerio Kart tournaments we host.

https://docs.google.com/spreadsheets/d/1CS5pWnmgS8wIZAvFQL4cc_jHWbTZ_khS/edit?usp=sharing&ouid=114408781303577995971&rtpof=true&sd=true

Dear members of the Statistics community,

I call humbly upon the statisticians, mathematicians, programming aficionados, excel experts, sports analysts, and power rankings enthusiasts of this great community to assist me with a vital task — creating a fair and representative power ranking formula for the International Beerio Kart Championships of the World.

A little background: my buddies and I were trapped at home Thanksgiving of ’21 for a fourteen day COVID quarantine. We were saddened by a missed opportunity to see our families, but with competitive spirit running through our veins and a surplus of leftover PBR from a party we threw (which was undoubtedly what gave us COVID), we found solace in roughly two weeks straight of fierce competition in the best drinking/video game pair to ever exist: Beerio Kart. For the uninitiated: Beerio Kart is Mario Kart, however, you need to finish your beer before the end of each race, and you can’t drink and drive (i.e. chug and control your character simultaneously). Our version of the game has many extra rules and sub-rules, however, that’s the basic premise of the game.

After two weeks of this, we needed an outlet to determine who was truly the best of us, and thusly the International Beerio Kart Championships of the World were born. It started with a modest eight competitors, but interest has increased steadily over the past three years and in recent events we’ve had as many as 58 competitors fighting to compete in a 32 person bracket (surplus competitors play in Play-in Prix’s for entry into the main bracket). We’ve now had 75 people play in official brackets and obtain power rankings, and close to 100 participate in the events overall. For a little context into how the tournaments are run, four competitors participate in each Grand Prix, and the top two competitors advance from each round until the championship. In the preliminary rounds, players must drink a beer on races two and four of each Grand Prix, and in the finals all four races are drinking rounds, thusly the final four competitors must drink a minimum of 10 beers to win the tournament.

As tournaments got larger and more intricate (and people started complaining that they were seeded unfairly), we realized we needed an objective ranking system to seed players so that the Prix’s leading up to the championship were fair and quantitative. This background brings me to the hallowed undertaking I beseech your help with: please help me figure out how to do this.

We’ve tried a few formulas, but we are but amateur statisticians and none have felt like they effectively capture a player’s skill level.

First we tried the following formula: ibkc power ranking = 0.33t/60n + 0.33z/60 + 0.33y/60, where:

60 = the maximum number of possible points scored in any given grand prix t = total points accrued over all past tournaments attended n = total number of grand prix’ held in all official tournaments z = average points scored per prix, per tournament, in all tournaments attended y = average points scored per prix, per tournament, in all tournaments attended this calendar year

It was a good start, but it unfairly biased players who had played in more tournaments, and wasn’t an accurate reflection of current skill level. It would be like baseball power rankings putting the Yankees are at the top because they’re an ancient ball club and have won 27 World Series’, even though the last time they won was 2009, or the Astros low down on the power rankings because they didn’t win their first Series until 2017, even though they’ve won twice in the past 5 years.

We then created a formula based on Pythagorean expectation, where a players skill level is calculated by averaging their (points accrued in a prix)/(points accrued in a prix + total number of possible points in a prix). Each round of a tournament was weighted heavier than the last, and tournaments with four rounds carry more weight than tournaments with three rounds. The player’s Pythagorean expectation was then averaged over all tournaments they’ve participated in, averaged over the last four tournaments held, and averaged over the last two tournaments held. Their power score was then calculated by averaging these three numbers together with the intention that more recent tournaments would be weighted heavier than older ones. This is the formula that the attached spreadsheet uses.

This new formula was better than the first but has an inverse problem — it weighs recent tournaments too heavily and doesn’t account for any rank decay from missing tournaments. For example, you can see that BAT has won 6 of 8 tournaments, but after a huge upset in the semi’s, BAT did not make the finals of the last tournament, and was booted from first place overall to third. All the while, Squirt4Boyz advanced from second place overall to first, even though Squirt4Boyz didn’t even participate in the last tournament.

There’s all sorts of hidden columns and rows and whatnot in this spreadsheet so please dm me with any questions you might have, but please, I beg of you fine and glorious proprietors of the world’s most stressful game, help me create a ranking system that makes sense. Ultimately we need a system that reflects how many points a player is expected to score, considers that player’s tournament wins, podium finishes and finals appearances, accounts for rank decay, and like in global tennis or golf rankings, has some bias for recent events.

Thank you, friends.

Your servant,

The International Beerio Kart Championships of the World League Commissioner

submitted by /u/zakarm22
[link] [comments]

Briefly Describing How A Titty Feels Like After Touching One Only Once In Life

Soft, vibrant, has a certain warm temperature, good grip. Titty is soft but when grabbing, has great resistance. Sense of awe highly present, somewhat like being starstruck and not being able to hold back smile or state of excitement. Time was experienced very quickly. Hard to believe. The situation itself becomes isolated, environment seems to be in a lower dimension. Titty is confirmed 3D. My recollection of touching both of them with two hands is too blurred but the possibility lies currently at 51,3%. Looking forward to do it again if opportunity is given. Sending new query to titty dispatch.

Help Finding An Actual Research And Dataset That Uses Distributions.

I need to find a research done by someone where they use a dataset and use distributions such as normal distribution, t distribution, anova distribution e.t.c to do their research and then i need to show my understanding of it. It doesn’t have to be very complicated as I’m just a fresher(undergrad) and all i need to do is show the use of any of these distributions in research in real life. Any links or ideas about any such research papers or actual life use of these done by people?

Thanks in advance

submitted by /u/youredumbaflol
[link] [comments]

Best Ways To Analyze Data, Useful For NBA Stats

Hello all, just wondering if I have a massive set of data that I want to compare or analyze the set for trends, would there be a good way to do this through a website or should I manually look for these trends myself. Another question would be how could I easily spot trends or important data figures within my set of data. Thanks!

submitted by /u/floppy11
[link] [comments]

Mountain Goats Are Goats Who Ascended To 5D

They have escaped the goat matrix. I think this is very important to know for all who have nothing left to lose.

There are also mountain GOAT’s (greatest of all times). These are usually mountain Buddha niggas located on the peak of a mountain who practice transcendence.

Magic: The Gathering Dashboard | Check The API / Dataset Behind It | Feedback Welcome

Hi everyone,

I am fairly new, learning Python since December 2022, and coming from a non-tech background. I took part in the DataTalksClub Zoomcamp. I started using these tools used in the project in January 2023.

Project link: GitHub repo for Magic: The Gathering

Project background:

I used to play Magic: The Gathering a lot back in the 90s I wanted to understand the game from a meta perspective and tried to answer questions that I was interested in

Technologies used:

Infrastructure via terraform, and GCP as cloud I read the scryfall API for card data Push them to my storage bucket Push needed data points to BigQuery Transform the data there with DBT Visualize the final dataset with Looker

I am somewhat proud to having finished this, as I never would have thought to learn all this. I did put a lot of long evenings, early mornings and weekends into this. In the future I plan to do more projects and apply for a Data Engineering or Analytics Engineering position – preferably at my current company.

Please feel free to leave constructive feedback on code, visualization or any other part of the project.

Thanks 🧙🏼‍♂️ 🔮

submitted by /u/binchentso
[link] [comments]

What Are The Essential SQL Skills For Senior Business Analysts?

Hello everyone,

I am currently pursuing a career as a Senior Business Analyst, and I know that having a strong understanding of SQL is essential for this role. However, there are so many aspects of SQL to learn, and I’m not sure where to focus my attention.

I would like to know from those who work as Senior Business Analysts, or those who have experience working with them, what are the best aspects of SQL to learn for this position? Which SQL skills do you use the most in your day-to-day work, and which ones have been the most valuable for you?

I appreciate any insights or advice you can offer, and I look forward to learning from your experiences. Thank you!

submitted by /u/LampRunner
[link] [comments]

Looking For A Dataset To Train A Chatbot For Recommending Solutions To Java Application Log Errors

Hello everyone,

I am currently working on creating a chatbot that can recommend solutions to log errors that occur in Java applications. To do this, I need a dataset that contains examples of log errors along with their corresponding solutions. I am hoping to find a dataset that is large enough to train a machine learning model to accurately suggest solutions based on the log error message.

If anyone knows of a dataset that would be helpful for this project or has any suggestions on where to find one, I would greatly appreciate it. Any information or assistance would be extremely valuable to me.

Thank you for your time and consideration.

submitted by /u/Farjou69
[link] [comments]

How To Treat Features Of Different Types

Hello there, I have a medical dataset in which some features are numeric, while others are categorical. With “categorical” I mean that these features are natively encoded with ordinal integer encoding, such that every possible value is represented as an incremental integer value. It is important for you to know that this dataset has been obtained as part of a survey, so that every categorical value is referred to different types of answers such as “never”, “sometimes”, “a lot of the time” and so on. I have to apply a MLP to this kind of data and I know that in order to do it I first need to scale data. Question is, do I have to scale all features without regard to categorical ones or do I need to scale only numerical variables applying One-hot encoding to the others? I was also wondering if it is necessary to apply one-hot encoding to categorical columns or if I can leave them as they are, applying standardization only to the numerical variables.

submitted by /u/NathanDrake27
[link] [comments]