Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Help Me With This : I’m New To Coding

Using data from the excel file and coding in Python, you should now estimate the following: for each ETF, estimate the sensitivity of ETF flows to past returns. a. Write down the main regression specification, and estimate at least five regression models based on it (e.g., with varying the number of lags). Then, present the regression output for one ETF of choice, including coefficients with t-stats, R squared, and number of observations.

a. Estimate the OLS regression from (2a) for each ETF and save betas. Then, conduct cluster analysis using k-means clustering with different variables, but for a start, try these two dimensions: i. Flow-performance sensitivity (i.e., betas from point (2)) vs fund size (AUM). ii. Propose at least one other dimension, and perform the cluster analysis again. What did you learn? iii. Now, instead of clustering, analyse fund types, and see whether flow- performance sensitivity varies by fund type.

dm me so that I can send you the cleaned up data

submitted by /u/Spiritual_Key_2204
[link] [comments]

Access IEA World Energy Outlook 2024 Extended Data Set

Hi everyone,

Any ideas on how I could have access to IEA’s World Energy Outlook 2024 extended data set (without paying 23k€) ? I am doing research on the storage solutions and would need to have their data on pumped hydro, batteries behind the meter and utility scale, and others. This for their NZE, STEPS and APS scenarios. Thanks for the help !

submitted by /u/Vulgar_Eros
[link] [comments]

Sample Bank Account Data For Compliance

I am looking for official compliance account data for bank data. I looked FDIC office of comptroller and see lots of regulations which is great but not any sample data I could use. This doesn’t have to be great data just realistic enough that scenarios can be run.

I know that if your working with bank you will get this data. However it would be nice to run some sample data before I approach a bank so I can test things out.

submitted by /u/Proper-Store3239
[link] [comments]

Need Help Gathering Data For Bot Detection Models

Hi! I am trying to build a ML model to detect Reddit bots (I know many people have attempted and failed, but I still want to try doing it). I already gathered quite some data about bot accounts. However, I don’t have much data about human accounts.

Could you please send me a private message if you are a real user? I would like to include your account data in the training of the model.

Thanks in advance!

submitted by /u/SheepherderOk3463
[link] [comments]

[Dataset] Countries & Cities With Arabic Translations And Population — CSV, Excel, JSON, SQL

Hi everyone,

I’m sharing a dataset I built while working on a recent project where I needed a list of countries and cities with accurate Arabic translations and population data.

I checked out several GitHub repositories but found most were:

  • Incomplete or had incorrect translations
  • Missing population info
  • Not consistently formatted
  • Labeled incorrectly — many included states but called them cities

So I decided to gather and clean the data myself using trusted sources like Wikidata, and I’m making it publicly available in case it helps others too.

What’s included:

  • Countries
  • Cities
  • Arabic and English names
  • Population data (where available)

Available formats:

  • CSV
  • Excel (.xlsx)
  • JSON
  • JSONL
  • SQL insert script

All files are open-source and available here:

🔗 https://github.com/jamsshhayd/world-cities-translations

Hopefully this saves other developers and data engineers some time. Let me know if you’d like to see additional formats or data fields added!

submitted by /u/jamsshhayd
[link] [comments]

In Search Of A Dataset Of 1-to-1 Chats For Sentiment Analysis

i would like to train a model to estimate the mood of a 1to1 chat, a good starting point would be a classic sentiment analysis dataset that labels each one of the messages as positive or negative (or neutral) or even better that assigns a score for example in the range of [-1,1] for the “positiveness” of the message, but ideally the perfect dataset for my goal would be a dataset of full conversations, i mean, every data point should be a series of N messages from both the sides in which all the messages have the same context, for example if i message a friend asking for his opinion about a movie the single datapoint of the dataset should contain all the messages we send each other starting from my question until we stop talking and we go doing something else, does someone know if there’s a free dataset of any of these types?

submitted by /u/samas69420
[link] [comments]

Looking For A Dataset Of Telemedicine Companies And Their CEOs

Hello Reddit,

I’m currently conducting research and am looking for a comprehensive dataset or source that lists telemedicine companies or startups along with the names of their CEOs and websites. Ideally, I’d prefer a structured format such as CSV, Excel, or a Google Sheet, but even a reliable list or database would be helpful.

If anyone has compiled this information or knows where I could find it (public databases, APIs, industry reports, etc.), your guidance would be greatly appreciated.

Thank you in advance!

submitted by /u/WhizCanadian
[link] [comments]

An Alternative Cloudflare AutoRAG MCP Server

I built an MCP server that works a little differently than the Cloudflare AutoRAG MCP server. It offers control over match threshold and max results. It also doesn’t provide an AI generated answer but rather a basic search or an ai ranked search. My logic was that if you’re using AutoRAG through an MCP server you are already using your LLM of choice and you might prefer to let your own LLM generate the response based on the chunks rather than the Cloudflare LLM, especially since in Claude Desktop you have access to larger more powerful models than what you can run in Cloudflare.

submitted by /u/brass_monkey888
[link] [comments]

Newly Uploaded Dataset On Subdomain Of Huge Tech Companies.

I have always wondered how large companies arrange their subdomains in a pattern ! As a result of my yesterday’s efforts, I have managed to upload a dataset on kaggle containing sub-domains of top tech companies. It would be really helpful for aspiring internet startups to analyse sub-domain patterns and embrace them to save the precious time. Sharing the link for datasets below. Any feedback is much appreciated. Thanks.
Link – https://www.kaggle.com/datasets/jacob327/subdomain-dataset-for-top-tech-companies

submitted by /u/stardep
[link] [comments]

Datasets Relevant To Hurricanes Katrina And Rita

I am responsible for data acquisition for a project where we are assessing the impacts of hurricanes Katriana and Rita for work.

We are interested in impacts relevant to the coastal and environmental health, healthcare, education, and the economy. I have already found FBI crime data, and am using the rfema package in rstudio to get additional data from Fema.

Any other suggestions? I have checked out USGS already and cant seem to find one that is especially helpful.

Thanks!

submitted by /u/elifted
[link] [comments]

Finally Built The Dataset Generator Thing I Mentioned Earlier

hey! just wanted to share an update, a while back I posted about a tool I was building to generate synthetic datasets. I had said I’d share it in 2–3 days, but ran into a few hiccups, so sorry for the delay. finally got a working version now!

right now you can:

  • give a query describing the kind of dataset you want
  • it suggests a schema (you can fully edit — add/remove fields, tweak descriptions, etc.)
  • it shows a list of related subtopics (also editable — you can add, remove, or even nest subtopics)
  • generate up to 30 sample rows per subtopic
  • download everything when you’re done

there’s also another section I’ve built (not open yet — it works, just a bit resource-heavy and I’m still refining the deep research approach):

  • upload a file (like a PDF or doc) — it generates an editable schema based on the content, then builds a dataset from it
  • paste a link — it analyzes the page, suggests a schema, and creates data around it
  • choose “deep research” mode — it searches the internet for relevant information, builds a schema, and then forms a dataset based on what it finds
  • there’s also a basic documentation feature that gives you a short write-up explaining the generated dataset

this part’s closed for now, but I’d really love to chat and understand what kind of data stuff you’re working on — helps me improve things and get a better sense of the space.

you can book a quick chat via Calendly, or just DM me here if that’s easier. once we talk, I’ll open up access to this part also

try it here: datalore.ai

submitted by /u/Interesting-Area6418
[link] [comments]

AI To Cleanup Names In Csv Lead List

I’m having such a difficult time dealing with edge cases to clean up 50k leads to be imported into our CRM. I’ve tackled this with multiple Python scripts but the accuracy is still too low and producing too many edge cases for manual changes. Is there an AI that can simply look at a name and assign whether it’s a company or human?

submitted by /u/Boullionaire
[link] [comments]

Need Help With Manufacturing Data Set

Good evening, I need one comprehensive data set for manufacturing facility, to perform the following in an academic project:

1- Forecasting (Exponential Smoothing)

2- Aggregate Planning

3- Material Requirements Planning (MRP)

4- Inventory Management

Could anyone help?

submitted by /u/Bl00djunkie
[link] [comments]

Looking For Datasets Of Small Businesses (like Bakeries) With EDA – Any Suggestions?

Hey everyone,

I’m working on a project that involves analyzing small/local businesses, specifically bakeries, cafés, and similar retail setups. I’m looking for datasets that include granular operational data, such as:

  • Every sale and transaction
  • Product-level data (what was sold, when, and how often)
  • Pricing information
  • Inventory levels or stock movement
  • Possibly some historical trends or time-series data

It’d be great if any of this comes with some initial exploratory data analysis (EDA) or summaries to help get oriented.

Does anyone know where I can find this kind of dataset, either free or reasonably priced? Also, if you’ve worked on similar data, which providers would you recommend that are reliable and affordable for R&D or prototyping?

Thanks in advance! Really appreciate any leads, tips, or suggestions.

submitted by /u/69sheeesh420
[link] [comments]