Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Need To Find A Twitter Dataset That Has Random Tweets Over Several Years

Hello,

I’m conducting an independent research project surveying the prevalance of hate speech in Twitter throughout the years, seeing if the ratio of hate/non-hate has increased over time and if the rise is correlated to any other long-term trends (such as the popularity of Twitter or political climate). For that, I need data over several years of Twitter so I can link longitudinal trends to hate speech data throughout time. In addition, I would want to get a randomized sample so the study has less chance for bias or error.

So, are there any publicly accessible Twitter datasets that has data over several years without any content filters? And if not, what should I do to get longitudinal data for this study?

submitted by /u/GeoZ17
[link] [comments]

I Made An Olympic Games API (json) With Real Time Data!

Hey everyone, I built an Olympics API with all the games, medals, countries, and sports that updates in real-time. In addition to the data, it also provides images of the sports (pictograms) and the flags of the countries.

If you want/can give me some feedback later:

Documentation
https://docs.apis.codante.io/olympic-games-english

Endpoints
Medals and Countries
Games with Results
Sports (with pictograms)

Repo
https://github.com/codante-io/api-service

Thanks!

submitted by /u/robertotc12345
[link] [comments]

Python Code Prompts Requesting Building Neural Networks

Hi guys!
I’m writing an academic paper on Filter Functions in LLMs.
For evaluation purposes I need to check for the ability to filter out certain code libraries, and I think the best way to do this would be to get a dataset with code requests (“hey can you write a program that does X?”), specifically requests for neural nets with pytorch/tensorflow.

Just to make clear – I do not need to train any model on these, just to run them through the LLM with/out the filter.

Example – “Hey can you build a neural network that classifies semantics of tweets?”
I don’t need anything too complicated

I’ve searched standard datasets on huggingface/google but haven’t found any with enough samples.
Any ideas?
Any help would be much appreciated and I’d love to answer any questions about the research itself.

Thanks!

submitted by /u/AltivoTheHorseX
[link] [comments]

UI For Data Enrichment With LLMs + Search

I build a system to enrich datasets I found myself doing this a lot with LLMs connected to search. ChatGPT can’t do it yet as it doesn’t ‘loop’. The functionality is basic, but it works well. You can upload a CSV, provide instructions in natural language, preview results for top X rows, process task for full dataset, download results as CSV.

Example tasks I have done:

Check if information seems to be valid based on top few search results and return a boolean Write a description of a company using LLM (+ optionally search results) Re-assign categories based on LLM

Is this of interest to anyone? Comment if so and I’ll put it online and send you a private link. Currently it uses my OpenAI API key so I would need to modify to BYOK or add billing, which I won’t bother with unless there’s interest.

submitted by /u/oacoleshill
[link] [comments]

[Request] Looking For Datasets That Compile Public-facing Statements And/or Posts Made By Politicians

I’m looking to do sentiment analysis for a project and am hoping to find a large compilation of public statements by politicians, preferably containing American and English politicians or parties. Ideal conditions would be Bay Area (CA) local, Manhattan (NY) local and London local politicians, but a by-party or full uncategorized set might do fine as well.

submitted by /u/hexahedron17
[link] [comments]

Seeking Efficient Method To Identify Websites In Europe Offering Monthly Subscription Plans

I’ve been working on a project using Python to compile a list of websites based in Europe that offer monthly subscription plans. Here’s my current approach:

1. Data Collection: I pulled data from the Common Crawl API for URLs from May 2024. This resulted in approximately 3 billion records. I started processing them in batches of 30,000 records. 2. Location Filtering: For each batch of 30,000 records (I’ve only done 3 batches so far), I used a free geo-location API to filter URLs by country based on their IP addresses, starting with the UK. This filtering narrowed it down to about 6,000 URLs per batch. 3. Subscription Plan Filtering: I have another script that filters these URLs based on the presence of keywords in the URL (such as “subscription,” “pricing,” “monthly,” “yearly,” etc.). I realize this step might not be the most efficient, as adding more filters increases the processing time. However, it has returned some websites that match the keywords.

So far, I’ve filtered around 90,000 URLs but found only one site matching my criteria. Most of the URLs in the results are either outdated websites or do not offer a subscription plan.

This method is proving inefficient, as it involves processing a vast number of irrelevant URLs.

My Question: Is there a smarter way to approach finding websites that specifically offer monthly subscription plans? Are there more efficient tools or APIs available that can directly provide this information, or any datasets that could help narrow down the search more effectively?

I’m open to using paid services if they can provide a more targeted and scalable solution. Any advice or recommendations would be greatly appreciated. Thanks in advance for your support!

submitted by /u/Mrpackage123
[link] [comments]

Historic DC Rental Datasets For Data Science Project

I’m doing a project that requires that I have some historic rental datasets to look at. I’m specifically looking for datasets focused on Washington DC. I’m making a program to compare current rental prices to historic prices for buildings in the address. Anyone who could point me to a relevant set of data would be greatly appreciated.

submitted by /u/ziggyguy22
[link] [comments]

How Do You Count The Occurrences Of Unknown Words?

Hey everyone! I don’t know if this is the right sub but I hope you can help me!

I need a platform that allows me to do the following: I must send several surveys to several clients and, in turn, my clients’ clients must respond to those surveys. They will respond with a few words, a maximum of four words or 30 characters, and with the results I want to put together a kind of graph. Google Sheets is the first thing that came to my mind. Then I have thought of a word cloud, or perhaps a list, putting the most repeated words at the top. I also want the platform or tool to be capable of compiling repeated words within the answers and putting them as one result. For example, if I ask who is your favorite soccer player and one person answers “Lionel Messi” and another person answers only “Messi”, I want only one result to appear: “Messi”. And the number of people who answered that is 2, (I don’t want two different results, one with the full name and another only with the last name). The thing is, I don’t know what people will reply. I don’t know if they’ll come up with a 1990 player or a kid who is now playing very well and is very young, so there are millions of players available to choose from and millions of ways of writing their names.

I had thought about Word Clouds, but the tools I found online have this error that they don’t compile repeated words. (So now I’m thinking that maybe a list of results would be better if the first option doesn’t exist) I would also like that once the survey, which is simply a single question, has been answered, it takes them to this graphic panel to see the result and see what the rest of the people are putting. For this, I thought that having Google Sheets or another platform or tool would be a good idea. I need them to be able to respond several times by re-entering the same link (if the survey is a Google Sheets one this can be done easily). I found the www.mentimeter.com but it cannot collect similar words. However, it is the one that I liked the most because of its simplicity and its adaptability to answer from the phone, which is very important for my case.

submitted by /u/JohnnyBeGood88
[link] [comments]

Annual Consumer Bankruptcy Data Needed By State

I need household bankruptcy data by state. It could be raw numbers it could be by chapter filing I’m just doing a project on consumer bankruptcies compared across the states in the USA and can’t find anywhere that provides a data set of either a % or raw bankruptcy numbers. I’m curious if anyone has any suggestions?? Thanks

submitted by /u/Weak_End_2925
[link] [comments]