Better Way Of Preparing Datasets For Finetuning With Large Text In Each Example???

Better way to prepare datasets ?

I have my datasets in format :

text : length 19k

extracted entity 1 : list of entity 1 extracted

extracted entity 2 : list of entity 2 extracted

Does anyone have idea on how to finetune opensource model with this kind of data .

Is finetuning better option becuase the model(llm) have to learn to extract items from the text and length of text is so large ?

Example : I have train a llm model to look at whole book text and extract author name, place name, people name Now I have 100 of books data how can I proeare datsets to fine-tune llm to be very good at extracting also consider I have supervised data of book text with extracted author, people name place name from whole text……
How can I finetune a good model let me know

submitted by /u/Guilty-Tea6607
[link] [comments]

0

Where To Find Data For A Regression Analysis In R?

I would like to run a regression analysis and am looking for data for crop yield, temperature, rainfall.

I am interested in soybeans.

submitted by /u/Consistent-Ganache88
[link] [comments]

0

Looking For Historical Dataset(s) On Monthly Gas And Electricity Prices And Caps In Bristol (or UK Regions)

Hi, I am doing a University Project, and part of it is creating a bill prediction service and I am having a really tough time finding good sources for what I need. I’m focusing on Bristol at the moment to help with initial development, but if there are datasets based on regions that would work too.

I need the average monthly cost for electricity and gas (separately) in Bristol dating back to at least 2018/19 to 2024, ideally with upper and lower values. I also need the price cap data (unit rates, standing charges) for those periods of time, which typically change every three months and have been posted by Ofgem, however I cannot seem to find any sources for previous years – only the current year.

I’d really appreciate any help, as I’ve said, I am really struggling to find valid datasets.

submitted by /u/JamesCompSci
[link] [comments]

0

Historic Pollen Count For The Carolinas?

Hello, I’m trying to find historic daily pollen count for full year 2023 and YTD 2024 for North and South Carolina. I think Pollen.com only goes back 30 days, so would love to know if anyone has a promising lead. Thanks!

submitted by /u/Revolutionary_Data93
[link] [comments]

0

Help Finding Bottle Deposit Information (USA) Database

Hello! I’m trying to add bottle deposit data to my e-commerce store.

Does anyone know where I can find Bottle deposit information? I’d prefer something with UPC values to cross-reference against my products.

submitted by /u/omegal0l420
[link] [comments]

0

California Median Income (and Possibly Other Economic Characteristics) By Zip Code Tabulation Shapefile

Title. Doing a school project in R Studio and want to have it set up where I can plot a gradient to see higher vs. lower income areas at a glance.

submitted by /u/BonelessHS
[link] [comments]

0

Looking For Trained Model Weights…

Has anyone trained Swin Transformer model in the research paper – “When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection”. If so plz share weights of trained model or any other suggestion to train faster.

Link to paper : https://arxiv.org/abs/2202.11911

submitted by /u/Quiet-Ad-3909
[link] [comments]

0

Looking For Twitter (X) Dataset Containing Real And Fake(impersonating Someone Else’s Profile) Accounts

I am looking for a dataset that has various features and a label identifying a real(an original account that may or may not be verified) and fake (an account impersonating someone else that has a twitter account), so far I have only been able to find the ones that identify bot accounts. Thanks

submitted by /u/TamunoImim
[link] [comments]

0

Seeking Data For Analyzing Niches And Growth Trends In The Data Analytics Industry

Hey Everyone!

I’m currently working on a project focused on analyzing different niches and growth trends within the data analytics industry. My objective is to gain insights into emerging trends, market opportunities, and career prospects within various niche segments of the data analytics field.

I’m reaching out to this community to seek assistance in gathering relevant datasets for my analysis. Specifically, I’m looking for datasets that include information on:

Market size and growth rates of different niche segments within the data analytics industry. Job demand, postings, and salary trends for various data analytics roles. Emerging technologies, tools, and applications in specialized areas of data analytics. Industry reports, research studies, or surveys providing insights into niche markets and trends.

I’m open to suggestions and recommendations for reliable sources or datasets that could contribute to my analysis. Any publicly available datasets, research reports, or academic publications related to the data analytics industry would be greatly appreciated.

Your assistance in finding suitable datasets for this project would be invaluable to my research efforts. Thank you in advance for your help and contributions.

submitted by /u/vizwha
[link] [comments]

0

Help With CRM Datasets For A Data Engineering Project

Hi everyone!

Where can i find a really messy CRM dataset? I have been told that I’ll be working with CRM data in about a month, so looking for similar datasets to practice on.

submitted by /u/Wide_Action8979
[link] [comments]

0

Looking For Datasets On Environmental Health

My project partner and I would like to analyze the association between air pollution, floods, and other environmental concerns and health outcomes like respiratory diseases, prenatal health, premature birth, etc. I’ve been looking for datasets for this specific aim but haven’t found one. There are multiple studies on this topic, but I can’t seem to access the datasets.

submitted by /u/Introvertedwin
[link] [comments]

0

MySQL Error While Importing Data (importing Csv But Getting Error)

I am trying to import a csv file to my mysql localhost server, but this error is coming:
Unhandled exception: ‘charmap’ codec can’t decode byte 0x8d in position 4887: character maps to <undefined>
I’ll link the csv file too, please do try to import it, if you are successful then PLEASE HELPPPPPP!
link: https://drive.google.com/file/d/16s54EfGnKFeedkD0Z-JItt_piqKPA370/view?usp=sharing

submitted by /u/Swat_Sam2
[link] [comments]

0

Looking For A Mortality And EHR/health Condition Dataset

Hello, this is a bit of a specific request. I’m wondering if anyone has any suggestions for finding a dataset with patient data including all (or at least most) medical conditions the patient had and their birth/death dates.

I’m having difficulty finding this looking around on various databases.

submitted by /u/my_cat_is_a_fed
[link] [comments]

0

Help I Need Datasets For My Stats Class!

I am stuck on part 2! I am unable to find similar datasets with at least 100 values. I’m hoping for 150 to 200. Please Help! I can do parts 3 & 4 I just can’t find that data! at this point, I don’t care what the data pertains to!

The Project (Outlined Below)

The project must be submitted as a single PDF file after completing the following tasks.

Only one group member needs to submit the report. Note that if you submit a DOCX

or XLSX file for your project, you will receive a score of 0.

Download two data sets you can compare containing similarly quantifiable information (such as stock prices, economic indicators, sports analytics, and weather forecasts) that have at least 100 data values each. If you downloaded a .csv file, save it as a .xlsx file. You can find data sets on dataset search. research. google.com, data.gov, or simply Googling “public data sets”.

Set up the file with two data sets of equal size (at least 100 data values each).

Create a frequency distribution table and frequency polygons of both data sets.

Use the minimum value in the data set as your lowest class limit.

Compute the mean, median, variance, standard deviation, coefficient of variation

of each dataset.

submitted by /u/Amokittenss
[link] [comments]

0

Are There Any Romanized Sanskrit Corpora Out There?

I’m looking for some sort of Sanskrit corpus where the words are in a Romanized script for a research paper. Does anyone have any suggestions? I’ve already found this: https://sanskritdocuments.org/dict/dictall.pdf, but I want to know if there are any other corpora out there that are a bit more credible/reputable.

submitted by /u/Erotic-Career-7342
[link] [comments]

0

Help With Data Analysis Project (mysql Online Server Help)

I have to create a power BI project with a data which should be present in MySQL online hosted server But the problem is that the data which i have is 2 tables with 130k rows each (csv files), and i made a mysql server on freemysqlhosting.net but there are 2 problems, firstly it has a 5mb limit for the database Secondly each row takes about 4 seconds to upload And on this speed i think itll take 6 days to just upload 1 table

Is there any other way to do this? Maybe something like, i could make the database in the local mysql server with the tables which doesn’t take much time and then i could maybe set up this server to be accessible to publoc somehow Please help🥲

submitted by /u/Swat_Sam2
[link] [comments]

0

Datasets Or Pre-trained Models For Banner Ad / Marketing Text Classification?

I am trying to find good datasets for classifying web images as ads, so that I can use it to train an image classification model for filtering out ads and only downloading useful image content from websites. I would also be interested in sets for classifying marketing/ad text to help with filtering out ad captions as well. I’m suspecting that there might be issues with copyright that are preventing people from releasing ad sets publicly, but I’m hoping that something is out there.
I found this dataset on PapersWithCode, and several sets that use old banner ads from the 90s/early 2000s, but I am wondering if there are any other publicly available web ad datasets with more recent data.
Does anyone have suggestions on good quality public datasets or preexisting classification models for ad detection?

submitted by /u/jferments
[link] [comments]

0

Any Tips On Healthy Lymph Node WSI Image Dataset?

Hi all, hope you guys are doing well!

I am doing a project on lymphoma detection using WSI images of lymph node tissues. I am a bit stuck as I cannot find any control dataset for this project. I am looking for a dataset which contains WSI images of healthy Lymph node tissues which can help me in the classification model.

Please leave any tips or suggestions that can be helpful

submitted by /u/gauravvvvvv
[link] [comments]

0

Dataset For Non-Technical Loss Of Eletrical Energy

What the title says. I would like to get some data about non-technical loss of Eletrical Energy for work with machine learning for my masters program.

submitted by /u/TioMir
[link] [comments]

0

How To Download Dataset From Baidu Cloud If Not Having Account

Hey, i wanna reproduce some paper and it requires this dataset:
https://pan.baidu.com/s/1rnUoDm7IxxmX1n1LmtXNXw#list/path=%2F,
can anyone help how to download it? It requires baidu account which requires chinese phone number to authenticate.

submitted by /u/zelenadinja0111
[link] [comments]

0

Bringing Home Your Very First Data Product

submitted by /u/growth_man
[link] [comments]

0

Need Help Finding Insurance Claim Dataset

Hi everyone,

I’m working on a project to create a dashboard for visualizing and analyzing insurance claims processing efficiency, and I’m in search of a suitable dataset to fuel this endeavor.

I’m aiming to develop a comprehensive dashboard that tracks metrics such as claims cycle time, processing costs, and customer satisfaction scores. To achieve this, I need a dataset containing diverse information including individual insurance claims data, policyholder demographics, adjuster reports, customer feedback, and operational performance metrics.

Does anyone know where I can find such a dataset or recommend reliable sources for insurance claims processing data?
Any suggestions or leads would be greatly appreciated! Thank you in advance for your help.”

submitted by /u/No_Track9088
[link] [comments]

0

Does Commons Crawl Include Youtube Metadata?

Does anyone know if Commons Crawl include Youtube videos metadata?
If yes, which metadata does it include? Subtitles?

submitted by /u/panqueca_frita
[link] [comments]

0

Dataset That Lists The Showrunners Of Each Season Of A TV Series

What the title says. IMDb doesn’t list showrunners.

submitted by /u/EmonMusk
[link] [comments]

0

Dataset For Books Published By Genre Over Time

Hello, hoping to identify a dataset that shows the number of books published by year by genre (e.g., 100K fantasy books published in 2018 vs 90K in 2017), or another proxy for popularity (e.g., sales). Particularly indexed on the (1) Fantasy and (2) Romance genres.

I have tried a few angles:

Library Datasets – Seattle Public Library reports checkouts by year by title, however this seems to be the exception and other major libraries do not report this same data ISBNDB – Based on ‘database’ page, it does not appear to include genre in the dataset (closest is Dewey decimal for select rows)

Fine with leveraging a paid database / report to improve approachability of the dataset.

Thank you for any guidance you can provide.

submitted by /u/Acrobatic_Scheme4448
[link] [comments]

0

Diabetic Friendly/healthy Meal Recipes Dataset?

Hi I’m looking for datasets to train an LLM model. Hopefully someone could recommend a dataset with healthy/diabetic friendly meal recipes so I could make a chatbot to recommend meals

submitted by /u/jrvbwr34bhcmdl
[link] [comments]

0

How Do I Publish A Dataset Made Up Off Posts From Facebook?

I have a large dataset with facebook post that I would like to make public for educational purposes. How would I go about doing that and is there any legal issues?

I am based in the EU.

submitted by /u/fjender
[link] [comments]

0

Need Dataset For Drug Addiction Detection

I’m currently planning on starting a project to detect or classify drug addicts based on the way they talk or text. Is there any dataset that contains the texts of drug addicts?

submitted by /u/karthic2811
[link] [comments]

0

Looking For A National Budget Dataset

Hi everybody, I am writing a paper about the effects of politics on military spending and found a website with an amazing excel spreadsheet that had each country and data from the 1940s to present. It had various tabs with GDP, national budget, military spending, etc. I used it for my data sets in STATA, but found it on a library computer and forgot to save the link or write down the website and now am looking everywhere for it to cite in my bibliography and cannot find it. If anyone knows what spreadsheet I’m talking about or could help me find it I would be extremely grateful!

submitted by /u/Responsible_Ear_279
[link] [comments]

0

Where Can I Find BTC/USD Daily Dataset With Features That Are Essential For Predicting Close Price?

If you have any ideas or have a dataset like this please help me

submitted by /u/YigitTheResearcher
[link] [comments]

0

Category: Datatards