Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

How Does One Create A Dataset For An LLM AI Based Off Specific Content From A Website.

Ive started playing around with custom AI models because I was bored and it looked fun from things I’ve seen in YouTube. I’ve created characters, tested different models and had loads of fun learning and playing. But now I want to “fine tune” the local model I’m using on specific data for it to pull from.

The overall goal is to have this chatbot assist me in writing wiki articles and events for an online roleplay thing, I want it to have access to all 7,567 already created articles that the community has made so it can pull information and make enhance my writing and suggestions with cannon responses.

How…. how would I do that? As in get the data and put it in a format that could be used for fine tuning. The YouTube tutorials I’ve seen generally focus on “reverse engineering” midjouney prompts or medical questions.

submitted by /u/Jakob4800
[link] [comments]

How Can I Get Data From Statcan With Characteristics ?

I usually study on data that is ready in the server so I have no idea how to get it from StatCan. I read their website, but it might be I’m not a dev so … still have no clue at all.

For instance, I want the report of persistence and graduation of doctoral degree students, within Canada, by student characteristics ( including sex, age, marital, father/ mother occupation, scholarship, funding, location, household income…. ) for a period.

Where I can get all the tables I need? I would prefer the flat files CSV.

I downloaded files from website, but it’s not data same as what I got from Kaggle.

TIA!

submitted by /u/Whatswrongwithman
[link] [comments]

IMDB Vs TMDB – Advice For Recommender System

I’m building a film recommendation system, I have a large csv file with film data scraped from the IMDB dataset which I plan to use to build the machine learning model, at the same time I’m using theMovieDB api to get some extra film details like plot summary.

I’m using around 300,000 films from IMDB, and some records are missing certain data, like editor, cinematographer etc., and I’m not sure how much more data each dataset has on a film compared to the other.

Would it be better to consistently use TMDB api to display film data on the frontend, and only use IMDB to build the ML model, or consistently use the IMDB csv throughout my system for the model and for displaying film details. Alternatively I could cross-reference both sources but I’m wary of contrasting data in both datasets.

Any advice is appreciated

submitted by /u/wobowizard
[link] [comments]

In Need Of A Dataset That Has Over 1000 Rows

im currently doing a school project right now and it requires me to have a dataset that has over 1000 rows and able to download into google sheets. im currently on a mac computer so i was wondering if anyone could reply to this with links to ones that would also be available to open on this device. thanks

submitted by /u/formithica
[link] [comments]

Need Help To Access The IAM Handwriting Dataset

I need help with the IAM handwriting dataset as I cannot access it from anywhere, I don’t even have an account from which I could remote download.

Can anybody please provide me a working link to that dataset (gdrive, mega, anything). If you have ever download it and have it in your drive can you please share.

This is the link to the dataset: https://fki.tic.heia-fr.ch/databases/iam-handwriting-database

submitted by /u/Chiragrvijay
[link] [comments]

Will A Tool Like This Help You In Visualisation?

We are working on Mokkup.ai

https://www.mokkup.ai/

Which is a dashboard wireframing tool that helps create high fidelity wireframes in minutes, even for people with no design acumen.

We are targeting data analysts, PMs, developers, HRs, other business teams and stakeholders. It’s super simple to use, with drag and drop elements, 150+ pre built templates spanning across industries and for several, custom use cases.

Will a tool like this help you to create a dashboard to translate your ideas before moving to working w real data sets?

I’d love to hear about your reviews, thoughts about what we are creating! This year we have geared up to do some mad business so your every insight and comment would be incredibly valuable. Thank you!

submitted by /u/Hamburgerleader
[link] [comments]

Looking For Multivariate Data For Assignment About Microbiology?

Hi everyone, for my doctoral training, I am following a multivariate statistics course. For the exam we need to make an assignment in which we analyse a multivariate dataset of our choice by using different methods (such as PCA, discriminant analysis, factor analysis, biplot, cluster analysis …).

Do you have recommendations for interesting data sets to analyse that are available online. It would be cool if it can be about microbiology (or bacteriophage research) since this is what my doctoral research is about.

Many thanks and happy new year!

submitted by /u/Subject-Extent5978
[link] [comments]

Is There A Dataset Of Fake/fraudulent/pseudoscientific Illnesses And Medical Conditions?

There’s a system that allows users to add their medical conditions from a list. I found that there are some non-existent conditions in the list, things like autistic enterocolitis.

I need a list of conditions that have been claimed to exist but are not recognised by mainstream medicine, so I can make a script to detect the overlap.

Does such a list exist?

submitted by /u/Defiant-Snow8782
[link] [comments]

TSA – Time Series Analysis Decomposition

I am studying a daily dataset from a game which contains informations about the peak of players from 2013-2023. This is my first try applying decomposition to see how time series components behaves. Later I would like to perform a forecasting using some models, I have applied the ADF test and it revealed the series as stationary. I’m having some questions to determine what value could fit better in the period parammeter with a daily data.

This is the series over the whole time:
https://i.stack.imgur.com/jlsSm.png

Additive model:
https://i.stack.imgur.com/jlsSm.png

Multiplicative model:
https://i.stack.imgur.com/3MAoB.png

Based on different types of time series data such as annual, monthly and daily, how should the choice of period be made?

submitted by /u/Dota_curious
[link] [comments]

Dataset For A Healtcare Triage System

Hey everyone! I had a personal side project idea and am looking for feedback on whether there’s a dataset that people might know of that could be applied to this project:

I’m looking to build an AI model that has the capability of telling a user which type of healthcare facility they should go to, depending on their symptoms.

More specifically, I was planning to have the user input information relating to certain factors that would be used as model features, such as:

Age

Gender

Symptoms

Underlying conditions

So that the model would tell the user which type of healthcare facility is most optimal for them out of these options:

Hospital

Pediatrics

Clinic

Pharmacy

Long-Term Care (For older aged people)

Specialized Care (For non-emergency situations that require invasive procedures)

I’ve begun looking for datasets that have this type of information, but haven’t found any usable ones so far.

Does anyone know if there are possible datasets available that I could use to train this type of model? Would I have to create my own dataset? Or is there no available dataset?

submitted by /u/CharacterAlbatross16
[link] [comments]

[Part-Synthetic] “Generative AI For Math: Part I — MathPile: A Billion-Token-Scale Pretraining Corpus For Math”

Paper: https://arxiv.org/abs/2312.17120

Datasets: https://huggingface.co/datasets/GAIR/MathPile

Code: https://github.com/GAIR-NLP/MathPile/

Project page: https://gair-nlp.github.io/MathPile/

Abstract:

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of “less is more”, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of MathPile with the scripts used for processing, to facilitate future developments in this field.

submitted by /u/APaperADay
[link] [comments]