Category: Other Nonsense & Spam

Online Sales As % Of Total Sales, By Category And By Year

Hello reddit! I’m working on a project for an economics class, and one of the pieces i’m missing is a dataset of online sales as percentage of total retail sales. Ideally these would be sorted by year and by industry category (i’m imagining some sort of histogram). Sounds simple, but it’s been deceivingly hard to find. Geographical distribution would be unimportant. Does anyone have any idea of where I could look, how I could phrase my search in a more effective way, or how I could build something like this myself?

submitted by /u/ciofs
[link] [comments]

4682 Episodes Of The Alex Jones Show (15875 Hours) Transcribed [self-promotion?]

I’ve spent a few months running OpenAI Whisper on the available episodes of The Alex Jones show, and was pointed to this subreddit by u/UglyChihuahua. I used the medium English model, as that’s all I had GPU memory for, but used Whisper.cpp and the large model when the medium model got confused.

It’s about 1.2GB of text with timestamps.

I’ve added all the transcripts to a github repository, and also created a simple web site with search, simple stats, and links into the relevant audio clip.

submitted by /u/fudgie
[link] [comments]

Literature Review – How To Filter Out Redundant Search Results From Similar Search Iterations?

Hey all, I’ve got sort of an unusual research question. Basically, I’d like to perform a comprehensive review of all the literature of a particular topic. To do this, I’d like to use combinations of search terms. For example, I’d conduct a search using terms “A” and “B”, then I’d conduct another search using terms “A” and “C”, then again using “A” and “D”, etc. The problem with this is that there’s a decent amount of overlap of search results among these different combinations and there are thousands of search results for each combination so I want to minimize redundancy as much as possible in order to save time. Is there a way for me to conduct an initial search (e.g., A + B) and then conduct each subsequent search (A + C, A + D, etc.) that will only show search results that are NOT included in the initial A + B search?

I’m using OVID Medline as the search database, but I’d be open to any general workaround solutions as well. From my limited knowledge on a possible solution, I was wondering if it’s possible to export all the search results, copy them as a list into a column within Excel, and then use the Excel function that can highlight duplicate values. This method would allow me to avoid redundant search results from each search iteration. This isn’t an elegant solution imo, but I imagined a possible solution like this. The most ideal solution would be for the database to filter out redundant search results for me automatically.

I can explain or clarify the problem further if that’s helpful. Thank you for any help or suggestions with this problem!!

submitted by /u/pantaloonsss
[link] [comments]

Any Publicly Available Flawed Datasets?

Hey guys,

Is there any dataset with flaws (missing/corrupted values) that is publicly available?

I need to do data cleansing, deal with outliers, be able to apply visualization techniques.

To further the analysis, I will need to pass it through data mining algorithms.

Thanks in advance.

submitted by /u/Chuchu123DOTexe
[link] [comments]

Large Dataset Of Mixed Frequency Economic Variables

I am working on a Nowcasting application for US macroeconomic indicators. I can create my own set of variables using FRED that I select myself for example but I am hoping someone is aware of an already existing dataset (ideally FRED indicators) used in literature that I could start from. Mainly because then the variable selection is more easily defensible when its been used elsewhere. I have yet to find much in the way of mixed frequency panels as the literature in this field is much smaller.

I am aware of Fred-MD and Fred-QD but these are obviously not mixed frequency which is the purpose here. My ideal hope is to have a dataset spanning daily, weekly, monthly, and quarterly variables across a wide cross-section of macro topics.

submitted by /u/thehallmarkcard
[link] [comments]

Help, Does Anyone Know If A Site Like This Exists?

Hello everyone, I don’t know if this is the right sub to ask, but I’ll try.

I’m searching for a site that collects the statistics of the usage of certain themes in media (cinema, literature, tv shows, ecc.) through time. Does anyone know if such a thing exists? Because I searched the internet, but I couldn’t find it.

submitted by /u/dadenelo
[link] [comments]

Ideas For Flight Delay Analysis (Specifically For Airlines)

I’m working on a project for a class where we have to come up with unique ways to analyze and then visualize data. The data set my team has chosen a giant list of flights in the US across a five year period. We each have our own areas/elements of the data that we are supposed to explore and come back together with ideas of what we could potentially do for an analysis/visualization.

My assignment is to focus on the airlines. I’m looking for ideas suggestions on any kind of interesting features/patterns/relationships that I could try an explore between the airlines themselves, or between airlines and other factors of the data (delays, origin/destination, cancellations, diversions, etc.). I have a few ideas of my own but would love to get some from others.

submitted by /u/diyage
[link] [comments]

Finding Large NBA Dataset For School Project

I am struggling to find a file larger than 3 MB to scrape through in any of the formattings: .txt, .csv, .tsv, .html, .xml, .xlsx. I would like the file to have every player’s stats for each game through the current season, but any large file would be great!

Does anyone have any advice on where to look? BBReference has great info, but it does not have the larger files that I am looking for

submitted by /u/Organic-Prune7965
[link] [comments]

Need Help, EBB! Dataset. Source/How To Get?

Hello, I am reading some papers on synthetic Bokeh for a project, and several of those mention an EBB! dataset that I cannot obtain(full version, I am only getting 200 train files out of 5k) even after registering in an official competition hosted a couple of years back on CodaLab, which is also apparently the only link mentioned on EBB’s paper with code page. Does anyone know any source for it? Will be really helpful, thanks.

https://paperswithcode.com/dataset/ebb

submitted by /u/abx05
[link] [comments]

Music Charts, AT40, Country Latin, British Charts

I am looking for a complete/comprehensive list of songs, and their position on the America Top 40 (or 100), Country, Latin, R&B, British music charts since inception. Looking to have weekly rankings showing current position, previous weeks position, song title and artist. Does anyone know where I can find this, if it is available?

submitted by /u/TexasBound1973
[link] [comments]

Auto Claims Dataset [Insurance Dataset]

I am looking for text + columnar dataset related to auto claims for insurance; an ideal dataset that I am looking for would have customer data, insurance claim data. For Insurance Data,the data lifecycle would start from first notice of loss made by customer to insurance company paying out or rejecting the claim.

It need not be real, a synthetic dataset would also do.

submitted by /u/willing-Stres
[link] [comments]

Mobile Vs Desktop/Laptop Internet Traffic? [Looking For A Dataset]

Hi all,

I’m looking for a dataset that details mobile vs desktop (or laptop) internet traffic. This can be global, or specific to a country (global would be best but i’m being a bit of a beggar with this so anything would do).

I’d like to use it to try and do some sort of time-series forecasting.

If anyone knows where I could find a dataset like that i’d massively appreciate it!

submitted by /u/EddieDemo
[link] [comments]

Databoutique.com, A Marketplace For Web Data

Hi 👋 all! We’re building a marketplace for web data (https://www.databoutique.com).

If you need web data for training models or app development, you can ask the community for it. The goal is to save time and cut down on scraping costs.

The basic idea is that most of the times, you’ll need data that someone is already scraping, so it’s faster and easier to ask for it, instead of doing again the scrape yourself.

We’re in early phase, any feedback is welcome. We hope this helps lower the barriers to data.

submitted by /u/Pigik83
[link] [comments]

Dealing With Missing Standart Deviation Due To Only 1 Observation

Hi,

i have the following problem: I need the standart deviation as part of my regression. Therefore i restrict the data to be atleast 3 observation per category for a specific year. However i do also want to include the data with only 1 or 2 observation but ofc for 1 there isn’t a standart deviation and is kinda pointless for 2. The standart deviation is only a control variable but vital for the result.

Does anyone know how i could handle that so i can still include these years for the categories with only 1 or 2 observations and not ruin my regression?

submitted by /u/Basilis988
[link] [comments]

Chinese Outward Foreign Direct Investment Data

Hi!

Since my first post here was a request if someone knew how to access Chinese OFDI Data sorted by country which some researchers frequently seem to use, I can now finally share where exactly the data comes from and hope that I thereby maybe save another poor soul from spending hours to find it.

Unsurprisingly, you have to search in Chinese to find the data:

XX 年度中国对外直接投资统计公报 (XX for the year you are searching)

sample: http://images.mofcom.gov.cn/hzs/201810/20181029160118046.pdf

Hope this helps and have fun!

submitted by /u/BlueApple12
[link] [comments]