Domain-tagged/specific Text Generation Datasets For Language Models

I want to investigate parameter-efficient fine-tuning (PEFT) methods (LoRA, bottleneck adapters, etc.) in the context of generative LLMs in different domains. I started reading the PEFT literature to find established benchmarks for my project. I saw people using datasets like SQuAD, E2E dataset, and XSum. Despite addressing multiple domains, there are no tags for the domain of each sample. I would need to have this information for my project. I could just use one dataset as one domain but the datasets I found do not usually have specific domains but contain samples from different domains. To summarize I would need datasets that

require a generative model (e.g. question answering with open answers, not multiple-choice)

cover a specific domain (sports, medicine, science, law, etc.) or contain this information as a feature for every sample

submitted by /u/beanswithoutjeans
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *