Category: Other Nonsense & Spam

Find All Utility And Public Works Buildings For Three States?

Finding all utility and public works addresses in three states?

How might I go about finding the locations above? Is there a big data set out there? I attempted using open street map with big query. I can’t say if I did the query correctly. Additionally tried using a place query with ESRI geocoder city by city for each of the states but that was a disaster. I have 6 years of GIS experience and am semi proficient in python and other coding langauges.

submitted by /u/Different_Camp4002
[link] [comments]

WebScraping Specific Zip Code Data From Zillow

Hello, I have a data science project I’m interested in doing. I want to web-scrape housing data from the Zillow website within a 15-mile radius of a potential career location. I don’t have much experience in web scraping but, I know I need to use selenium (an automated browser) and python’s beautiful soup library to execute this part of my project. Does anyone have experience in web scraping Zillow’s website specifically? Any advice or Youtube videos to help me get started?

P.S. I was informed to check to see if Zillow has an API. I checked and it looks like the best I’ll be able to get from an API is using RapidAPI: 40 records of data per GET request with a one-month limit of 20 GET REquest (800 records).

submitted by /u/juangui37
[link] [comments]

CleanVision: Audit Your Image Datasets For Better Computer Vision

To all my computer vision friends working on real-world applications with messy image data, I just open-sourced a Python library you may find useful!

CleanVision audits any image dataset to automatically detect common issues such as images that are blurry, under/over-exposed, oddly sized, or near duplicates of others. It’s just 3 lines of code to discover what issues lurk in your data before you dive into modeling, and CleanVision can be used for any image dataset — regardless of whether your task is image generation, classification, segmentation, object detection, etc.

from cleanvision.imagelab import Imagelab imagelab = Imagelab(data_path=”path_to_dataset”) imagelab.find_issues() imagelab.report()

As leaders like Andrew Ng and OpenAI have lately repeated: models can only be as good as the data they are trained on. Before diving into modeling, quickly run your images through CleanVision to make sure they are ok — it’s super easy!

Github: https://github.com/cleanlab/cleanvision

Disclaimer: I am affiliated with Cleanlab.

submitted by /u/jonas__m
[link] [comments]

Scrape Thousands Of Records Of Housing Data Using Python [Self-Promotion]

Hey r/datasets,

I originally posted this library earlier this week, but it got downvoted once within 10 minutes and was never heard from again. And I get it, this is a place for posting/requesting datasets.

So, here’s an actual dataset of CA housing data I generated using the RedfinScraper library. Scraping these 47,000 records took just over 3 minutes.

While this data may be useful today, the fact is, it will only be useful for about a week longer. The high-velocity nature of housing data means that datasets need to be updated frequently.

This issue was the driving force for sharing this library publically: to allow users to quickly scrape the latest housing data at their leisure.

I hope you find this library useful, and I am excited to see what you create with it.

submitted by /u/ryan_s007
[link] [comments]

How Features With A Strong Correlation Should Be Treated?

Hey guys,

I am with some difficult to clear my dataset , originally I had 80 features, after apply some ML rules, for example remove a feature that just have null values etc, I am now with 67 features.

I decided to apply correlation, and I have many features with a strong correlation +0.9 or – 0.9 , I saw that I can remove features with a strong correlation.

But I could not find if there is a rule, or which of this feature should I remove.

For example if I have feature A,B,C,D , and A x B and A x C has a strong correlation should I remove A ? or B and C ?

If someone could kind give some help or some documents about It I will be more than glad.

Thank you.

submitted by /u/No_Bee_9081
[link] [comments]

Nucreal Fusion Dataset, Is It Too Experimental/secret?

so, im almost done with my data science course. final project, which is supposed to take a month, will be due may 4th.

i was wondering, where does an average joe like me, get his hands on some nuclear fusion datasets? i have no clue what id be doing with it.. but i think nuclear fusion is fascinating, and if i can do something with it, why not.

ive tried google, kaggle and huggingface, couldnt find much.

i know everything is in development right now. its cutting edge technology. pushing the boundary of our knowledge.. and now im wondering, would those datasets be considered top secret?

well, anyway. thanks for reading and any of the help you could provide

submitted by /u/RngdZ
[link] [comments]

Magic: The Gathering Deck Lists Scraped From MtgTop8

Magic: the Gathering deck dataset

I scraped deck lists from a competitive deck sharing platform called MtgTop8 for a project I’m working on.

Decks are separated by format in the following:

– standard

– modern

– pioneer

– historic

– explorer

– pauper

– legacy

– vintage

They’re stored as Apache feather files which can be easily converted to either pickle or csv files.

Feel free to use them for whatever purpose.

Here’s the link

submitted by /u/ArmyOfCorgis
[link] [comments]

GIS Data For A Project. I Apologize For The Banality Of My Request And For My English.

Hi all, I’m new to the community and also new to the world of data.

In a postgraduate course they assigned me an exercise on the QGIS software by representing a specific data model on a map, the goal is to make us practice and the topic is free.

Where can I get open data suitable for QGIS? I apologize for the banality of my request and for my English.

Thank you all 🥲

submitted by /u/Scarraf1
[link] [comments]

Online Sales As % Of Total Sales, By Category And By Year

Hello reddit! I’m working on a project for an economics class, and one of the pieces i’m missing is a dataset of online sales as percentage of total retail sales. Ideally these would be sorted by year and by industry category (i’m imagining some sort of histogram). Sounds simple, but it’s been deceivingly hard to find. Geographical distribution would be unimportant. Does anyone have any idea of where I could look, how I could phrase my search in a more effective way, or how I could build something like this myself?

submitted by /u/ciofs
[link] [comments]

4682 Episodes Of The Alex Jones Show (15875 Hours) Transcribed [self-promotion?]

I’ve spent a few months running OpenAI Whisper on the available episodes of The Alex Jones show, and was pointed to this subreddit by u/UglyChihuahua. I used the medium English model, as that’s all I had GPU memory for, but used Whisper.cpp and the large model when the medium model got confused.

It’s about 1.2GB of text with timestamps.

I’ve added all the transcripts to a github repository, and also created a simple web site with search, simple stats, and links into the relevant audio clip.

submitted by /u/fudgie
[link] [comments]

Literature Review – How To Filter Out Redundant Search Results From Similar Search Iterations?

Hey all, I’ve got sort of an unusual research question. Basically, I’d like to perform a comprehensive review of all the literature of a particular topic. To do this, I’d like to use combinations of search terms. For example, I’d conduct a search using terms “A” and “B”, then I’d conduct another search using terms “A” and “C”, then again using “A” and “D”, etc. The problem with this is that there’s a decent amount of overlap of search results among these different combinations and there are thousands of search results for each combination so I want to minimize redundancy as much as possible in order to save time. Is there a way for me to conduct an initial search (e.g., A + B) and then conduct each subsequent search (A + C, A + D, etc.) that will only show search results that are NOT included in the initial A + B search?

I’m using OVID Medline as the search database, but I’d be open to any general workaround solutions as well. From my limited knowledge on a possible solution, I was wondering if it’s possible to export all the search results, copy them as a list into a column within Excel, and then use the Excel function that can highlight duplicate values. This method would allow me to avoid redundant search results from each search iteration. This isn’t an elegant solution imo, but I imagined a possible solution like this. The most ideal solution would be for the database to filter out redundant search results for me automatically.

I can explain or clarify the problem further if that’s helpful. Thank you for any help or suggestions with this problem!!

submitted by /u/pantaloonsss
[link] [comments]

Any Publicly Available Flawed Datasets?

Hey guys,

Is there any dataset with flaws (missing/corrupted values) that is publicly available?

I need to do data cleansing, deal with outliers, be able to apply visualization techniques.

To further the analysis, I will need to pass it through data mining algorithms.

Thanks in advance.

submitted by /u/Chuchu123DOTexe
[link] [comments]

Large Dataset Of Mixed Frequency Economic Variables

I am working on a Nowcasting application for US macroeconomic indicators. I can create my own set of variables using FRED that I select myself for example but I am hoping someone is aware of an already existing dataset (ideally FRED indicators) used in literature that I could start from. Mainly because then the variable selection is more easily defensible when its been used elsewhere. I have yet to find much in the way of mixed frequency panels as the literature in this field is much smaller.

I am aware of Fred-MD and Fred-QD but these are obviously not mixed frequency which is the purpose here. My ideal hope is to have a dataset spanning daily, weekly, monthly, and quarterly variables across a wide cross-section of macro topics.

submitted by /u/thehallmarkcard
[link] [comments]