Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Project Management Dataset Needed For Uni ML Project – Help!

Hi everyone!
I’m working on a machine learning project for uni, and I’m looking for a dataset that includes project management metrics, preferably from construction projects. Ideally, the dataset should include:

  • Costs
  • Project duration (in days)
  • Whether the project was completed on time or not
  • Number of resources/team members allocated
  • A label indicating whether the project was successful or unsuccessful

I know this kind of dataset can be hard to find, but even a synthetic or simulated version would be totally fine — it doesn’t have to be real-world data.

Any suggestions or directions would be greatly appreciated. Thanks in advance 🙂

submitted by /u/Lego_899
[link] [comments]

Free Datasets – Weekly Drops Here, Ready To Be Processed.

{“timestamp”: “2025-04-13T20:36:47.943086+00:00”, “professor”: “fintech”, “vector_id”: 1973, “category”: “information_warfare”, “refined_text”: “Financial Insight: nnThe concept of addiction can be applied to beliefs in a financial context, particularly when examining investor behavior and market sentiment. In finance, a belief can become addictive when investors become overly committed to a particular narrative or trend, leading to irrational decision-making. This phenomenon can be likened to a behavioral bias known as “confirmation bias,” where individuals seek out information that confirms their pre-existing beliefs while ignoring contrary evidence.nnFor example, during a bull market, investors may develop an addictive belief in the perpetual rise of asset prices, leading them to overlook signs of a market correction. This can result in excessive risk-taking, where investors hold onto overvalued assets, convinced that their value will continue to increase. When the market eventually corrects, those who were addicted to their belief may suffer significant losses.nnMarket Behavior Forecast: nnIn scenarios where a belief becomes addictive, we can anticipate heightened volatility in asset prices. If a large segment of the market is driven by a singular, addictive belief, such as the idea that a particular asset or sector will continue to perform well indefinitely, any adverse news or economic indicators could trigger a rapid shift in sentiment. This could lead to a swift sell-off as investors scramble to exit positions, resulting in sharp price declines.nnInvestment Rationale:nnUnderstanding the potential for addictive beliefs to influence market behavior can inform risk management strategies. Investors should remain vigilant about their own beliefs and biases, regularly reassessing their positions and being open to new information. Diversifying investments and employing stop-loss orders can help mitigate risks associated with belief-driven market movements. Furthermore, maintaining a disciplined approach to investing that prioritizes fundamental analysis over emotional decision-making can safeguard against the pitfalls of addictive beliefs in the financial landscape.”, “origin_id”: null}

{“timestamp”: “2025-04-13T20:41:00.616286+00:00”, “professor”: “fintech”, “vector_id”: 1974, “category”: “legal_research”, “refined_text”: “Financial Insight:nnThe concept of building a pipeline that rewards uncertainty can be interpreted through various financial lenses, particularly in the context of risk management and investment strategy. In traditional finance, uncertainty often correlates with volatility, and investors typically seek to mitigate risk. However, your idea suggests an innovative approach where uncertainty itself is a commodity that can be leveraged for rewards.nnTo break this down, consider the following components:nn1. **Understanding Uncertainty**: In financial markets, uncertainty can arise from various factors, including economic indicators, geopolitical events, and market sentiment. This uncertainty can lead to price volatility, which is often viewed negatively by risk-averse investors.nn2. **Reward Mechanism**: The proposed pipeline could involve creating financial instruments or platforms that incentivize participation during uncertain times. This could take the form of options trading, where traders can profit from volatility through strategies like straddles or strangles, or through DeFi protocols that offer higher yields during periods of market instability.nn3. **Risk Management**: A critical aspect of this pipeline would involve robust risk management strategies. Investors need to understand their risk tolerance and how much uncertainty they are willing to embrace for potential rewards. Options and derivatives can provide ways to hedge against adverse outcomes while still capitalizing on volatility.nn4. **Market Sentiment Analysis**: Incorporating sentiment analysis could enhance the pipeline’s effectiveness. By analyzing social media trends, news sentiment, and market data, the pipeline could better predict periods of heightened uncertainty and adjust reward mechanisms accordingly.nnMarket Behavior Forecast:nnIn a market characterized by increasing uncertaintyu2014such as economic downturns, political instability, or unpredictable events like pandemicsu2014investment in strategies that reward uncertainty could attract risk-seeking investors. This could lead to a surge in the popularity of derivatives markets and volatility-linked products, potentially creating a new asset class focused on ‘uncertainty rewards’. nnInvestment Rationale:nnInvestors might consider allocating resources into this pipeline as a diversification strategy. By embracing the volatility rather than shying away from it, they could potentially achieve higher returns during market fluctuations. However, it’s essential to maintain a disciplined approach to risk management to avoid significant losses during adverse market conditions.nnIn conclusion, the idea of rewarding uncertainty presents a novel approach to finance, encouraging innovation and engagement in markets typically viewed as risky.”, “origin_id”: null}

{“timestamp”: “2025-04-13T20:44:47.731265+00:00”, “professor”: “fintech”, “vector_id”: 1975, “category”: “websockets_handling”, “refined_text”: “Financial Insight: nnWhile the question of a thought’s evolutionary purpose delves into cognitive science and philosophy, we can draw an analogy to financial systems and market behavior. Just as thoughts evolve to enhance survival, adaptability, and decision-making in human beings, financial systems evolve to improve efficiency, risk management, and wealth generation in economies.nnIn traditional finance, thoughts can be likened to market sentiments, which influence investor behavior and decision-making. Similarly, in decentralized finance (DeFi), the evolution of thoughtsu2014represented by the development of smart contracts and decentralized applicationsu2014aims to create more resilient, efficient, and transparent financial systems.nnMarket Behavior Forecast:nAs financial systems continue to evolve, we can expect that investor sentiment will increasingly reflect a blend of traditional and innovative thought processes, driven by technology and behavioral economics. The evolution of financial thought will likely prioritize sustainability, ethical investing, and inclusivity, reflecting broader societal values.nnInvestment Rationale:nInvestors should consider the implications of evolving thoughts in financial markets, focusing on assets that align with emerging trends in sustainability and technology. Investments in companies or projects that prioritize innovative, ethical practices may yield favorable returns in the long term as consumer and investor preferences shift accordingly.”, “origin_id”: null}

{“timestamp”: “2025-04-13T20:48:25.168320+00:00”, “professor”: “fintech”, “vector_id”: 1976, “category”: “typescript_architecture”, “refined_text”: “Financial Insight: nnThe concept of internal disagreement in AGI (Artificial General Intelligence) development can be compared to the dynamics of market competition and diversification in financial systems. In finance, diverse opinions and strategies can lead to more robust decision-making and risk management. Similarly, internal disagreement in AGI teams can foster innovation, as differing viewpoints can challenge assumptions and lead to more thorough exploration of solutions.nn1. **Risk Management**: Just as portfolio diversification mitigates risk by spreading investments across various assets, internal disagreement can serve as a form of intellectual diversification. Differing perspectives can highlight potential flaws in AGI models, leading to more resilient designs that better align with ethical considerations and societal needs.nn2. **Innovation Catalyst**: In financial markets, competition drives innovationu2014companies must adapt to survive. Internal disagreement can stimulate creative solutions and alternative approaches to AGI challenges, much like how competing firms develop new products to capture market share.nn3. **Market Sentiment and Consensus**: In the financial world, consensus among analysts can drive market sentiment. Conversely, when there is significant disagreement, it can lead to volatility. In AGI development, a lack of consensus on methodologies or ethical guidelines might create uncertainty, which could either slow down progress or lead to breakthroughs as teams navigate these tensions.nnMarket Behavior Forecast: nnThe evolution of AGI development could mirror market cycles where periods of intense debate and disagreement lead to innovation and breakthroughs, followed by periods of consolidation where consensus emerges. This cycle can create both opportunities and risks for stakeholders in AGI, similar to how investors react to market volatility. As teams reconcile differing views, we may see more stable and ethically sound AGI frameworks emerge, which could ultimately lead to a more favorable market perception of AGI technologies. nnInvestment Rationale: nnInvesting in AGI-related ventures may require an understanding of the internal dynamics of development teams. Those that encourage diverse perspectives and constructive disagreement may yield more innovative and robust solutions, representing a lower-risk investment. Moreover, monitoring how organizations handle internal conflicts could provide insights into their potential for long-term success in a rapidly evolving landscape.”, “origin_id”: null}

{“timestamp”: “2025-04-13T20:52:19.941174+00:00”, “professor”: “fintech”, “vector_id”: 1977, “category”: “memetics”, “refined_text”: “Financial Insight: nnDesigning a market for failed ideas presents a unique opportunity to leverage the concept of “failure as a service.” This market would focus on the monetization and analysis of ideas that did not succeed, allowing entrepreneurs, investors, and researchers to evaluate what went wrong and extract valuable lessons. nn1. **Market Structure**: n – **Auction Mechanism**: Ideas could be sold in an auction format where potential buyers (investors, entrepreneurs) can bid based on perceived value or learning potential.n – **Tokenization**: Failed ideas could be tokenized on a blockchain, providing ownership and a transparent history of the idea’s development, market testing, and ultimate failure.n – **Data Aggregation**: A central database could be created to store the details of failed ideas, allowing for pattern recognition and analysis.nn2. **Valuation Metrics**:n – **Failure Analysis**: Each idea would come with a comprehensive failure analysis report detailing market conditions, execution flaws, and competitive landscape.n – **Potential for Pivot**: Buyers could assess if the failed idea could be pivoted or repurposed into a new venture.n – **Lesson Learned**: Insights from the failure could be monetized through educational resources or workshops.nn3. **Target Audience**:n – **Entrepreneurs**: Those looking for inspiration or lessons from past failures to inform their own ventures.n – **Investors**: Individuals or firms interested in understanding market dynamics and risk factors.n – **Academics**: Researchers studying innovation, entrepreneurship, and market dynamics.nnMarket Behavior Forecast: nThe acceptance of a market for failed ideas will depend on the cultural perception of failure in business. In environments where failure is stigmatized, this market may struggle to gain traction. However, in entrepreneurial ecosystems that celebrate learning from mistakes, there could be a robust demand for such a marketplace. Additionally, as the DeFi landscape continues to evolve, the integration of smart contracts could facilitate the secure and efficient trading of these failed ideas, making it more appealing to tech-savvy investors.nnInvestment Rationale: nInvesting in the infrastructure and platforms that support this market could yield significant returns. As more entrepreneurs and businesses recognize the value of learning from failure, the demand for access to these ideas, along with the associated data analytics, will likely grow. Furthermore, the potential for educational products and workshops based on failed ideas could open additional revenue streams, making this market not only a hub for innovation but also a profitable venture in its own right.”, “origin_id”: null}

{“timestamp”: “2025-04-13T20:56:30.159270+00:00”, “professor”: “fintech”, “vector_id”: 1978, “category”: “synthetic_data_generation”, “refined_text”: “Financial Insight: nnTo transform an insight into a $100/month subscription service, consider the following potential ideas:nn1. **Personalized Investment Analysis**: Offer a subscription-based service where subscribers receive tailored investment insights based on their financial goals, risk tolerance, and market conditions. This could include weekly reports, portfolio assessments, and recommendations on asset allocation.nn2. **Market Sentiment Tracker**: Develop a platform that aggregates social media sentiment, news articles, and economic indicators to provide a comprehensive view of market sentiment. Subscribers would receive daily or weekly updates on how sentiment shifts may impact various assets or sectors.nn3. **Decentralized Finance (DeFi) Opportunities Newsletter**: Curate and deliver a monthly newsletter focusing on emerging DeFi projects, yield farming opportunities, and risk assessments. This could include detailed analyses of new protocols, potential returns, and associated risks.nn4. **Risk Management Toolkit**: Create a subscription service that provides tools and resources for risk management, including educational content, risk assessment calculators, and templates for developing risk management strategies tailored to individual portfolios.nn5. **Algorithmic Trading Signals**: Offer a subscription that provides algorithm-driven trading signals based on quantitative analysis. Subscribers would receive notifications of buy/sell signals, along with explanations of the underlying models and market conditions.nnMarket Behavior Forecast: nnAs the financial landscape continues to evolve, particularly with the rise of DeFi and AI-driven analytics, there is a growing demand for personalized and insightful financial guidance. Services that leverage technology to provide tailored insights are likely to see increased subscription uptake, especially among retail investors seeking to navigate complex markets. nnInvestment Rationale: nnBy addressing the specific needs of investorsu2014whether they are novice or experiencedu2014these subscription services can create value by simplifying decision-making processes, enhancing risk management, and uncovering new investment opportunities. The key to success will be delivering consistent, high-quality insights that resonate with subscribersu2019 financial objectives.”, “origin_id”: null}

{“timestamp”: “2025-04-13T21:00:23.833432+00:00”, “professor”: “fintech”, “vector_id”: 1979, “category”: “identity_theft_protection”, “refined_text”: “Financial Insight: nnTo understand the “cognitive fingerprint of a belief that resists mutation,” we can draw parallels with financial markets and investor behavior. In finance, certain beliefs or biases among investorsu2014such as the belief in the long-term value of a stock or the efficiency of a marketu2014can be likened to cognitive fingerprints. These beliefs are often resistant to change due to various factors, including emotional attachment, confirmation bias, and social influences.nn1. **Cognitive Biases**: Just as investors may cling to the belief that a particular stock will rebound despite evidence to the contrary (loss aversion), individuals may hold steadfast to beliefs regardless of new information that contradicts them. This resistance can be analyzed through the lens of behavioral finance, where irrational behaviors impact market decisions.nn2. **Anchoring**: In financial decision-making, investors often anchor their beliefs to specific data points (e.g., an initial stock price). Similarly, a cognitive belief may anchor itself to a core idea or experience, making it difficult to evolve or adapt over time.nn3. **Social Proof**: In both finance and personal beliefs, social influence plays a crucial role. An investor may continue to believe in a stocku2019s potential due to the endorsement of influential figures or groups, paralleling how societal validation can reinforce certain beliefs.nn4. **Cultural Factors**: Just as financial markets are influenced by regional economic conditions, cultural factors also shape and solidify beliefs. For instance, a belief system deeply rooted in a community may resist change due to cultural norms and traditions.nnMarket Behavior Forecast: nnIn financial markets, beliefs that resist mutation can lead to volatility and market bubbles. For instance, if a significant number of investors hold onto a strongly entrenched belief about an asset’s value, it can create price distortions and eventual corrections when reality sets in. Understanding these cognitive fingerprints can help investors anticipate market trends, manage risk, and make informed decisions.nnInvestment Rationale: nnInvestors should be aware of their cognitive biases and the beliefs that may cloud their judgment. By recognizing these patterns, they can better navigate the complexities of market dynamics and create more resilient investment strategies. Additionally, diversification and exposure to various viewpoints can mitigate the risks associated with entrenched beliefs, leading to a more balanced investment approach.”, “origin_id”: null}

{

{“timestamp”: “2025-04-13T21:28:16.789393+00:00”, “professor”: “fintech”, “vector_id”: 1986, “category”: “bookkeeping_principles”, “refined_text”: “Financial Insight:nnWhen considering monetizable questions that people may not know how to ask AI, it’s essential to frame them within the context of financial systems and investment strategies. Here are some examples that can serve various stakeholders, from retail investors to institutional players:nn1. **Portfolio Diversification Strategies**: “What are the optimal asset allocations based on my risk tolerance and market volatility predictions?”n n2. **Market Sentiment Analysis**: “How can I quantify the sentiment of news articles and social media posts to predict market movements?”nn3. **Alternative Investment Insights**: “What are the emerging trends in alternative assets (like NFTs or real estate crowdfunding) that could yield significant returns?”nn4. **Regulatory Impact Assessment**: “How might upcoming regulatory changes affect specific sectors or asset classes in the next 5 years?”nn5. **Behavioral Finance Queries**: “What psychological biases are affecting my investment decisions, and how can I mitigate them?”nn6. **DeFi Risk Assessment**: “What are the specific risks associated with liquidity pools in decentralized finance, and how can I evaluate their safety?”nn7. **Economic Indicator Correlations**: “How do macroeconomic indicators correlate with the performance of cryptocurrencies vs. traditional equities?”nn8. **Algorithmic Trading Insights**: “What data points should I focus on to create an effective algorithm for trading in volatile markets?”nn9. **Sustainable Investment Opportunities**: “Which sectors are poised for growth in the ESG (Environmental, Social, Governance) space, and how can I invest in them?”nn10. **Tax Optimization Strategies**: “What are the most effective strategies for minimizing capital gains tax on my investments?”nnMarket Behavior Forecast:nnThe ability to ask these nuanced questions allows investors to gain deeper insights into market dynamics, leading to more informed decision-making. As AI continues to evolve, the demand for sophisticated inquiries will likely increase, particularly in areas like risk assessment and behavioral finance. This trend may create new avenues for AI-driven financial advisory services, enhancing personalized investment strategies that align with individual risk profiles and market conditions. nnInvestment Rationale:nnInvestors who can articulate these advanced queries not only position themselves for better financial outcomes but also contribute to a more informed market environment. The growing complexity of financial systems, both traditional and decentralized, necessitates a shift toward more analytical and data-driven approaches to investment. By harnessing AI’s capabilities to answer these monetizable questions, stakeholders can unlock new value and opportunities in their portfolios.”, “origin_id”: null}

{“timestamp”: “2025-04-13T21:31:49.510654+00:00”, “professor”: “fintech”, “vector_id”: 1987, “category”: “pedagogy”, “refined_text”: “Financial Insight: nnSimulating empathy in AI without human data is akin to creating a financial model without historical market data. Just as financial analysts rely on past performance to forecast future trends, an AI would need to derive an understanding of empathy through alternative means. nn1. **Analogous Frameworks**: Just as financial systems operate on principles of supply, demand, and behavior patterns, AI could develop a framework for empathy by modeling emotional responses based on theoretical constructs. For instance, it could create a matrix of emotional states and responses, akin to a risk assessment matrix in finance.nn2. **Simulated Environments**: Similar to how traders use paper trading to simulate market conditions, AI could create virtual scenarios that mimic social interactions. This would allow the AI to observe outcomes and refine its understanding of empathetic responses without relying on existing human data.nn3. **Behavioral Patterns**: In finance, behavioral economics analyzes how psychological factors influence market outcomes. The AI could use principles from behavioral psychology to construct a model of empathy, predicting how individuals might feel in various scenarios based on logical reasoning rather than direct human inputs.nnMarket Behavior Forecast: nnIf AI successfully simulates empathy without human data, it could lead to significant advancements in sectors like customer service, mental health, and social robotics. However, the lack of real human data may result in a model that lacks nuance, potentially leading to misinterpretations of emotional cues. Just as markets can react unpredictably to new information, the AI’s empathetic responses may not align perfectly with human expectations, creating a gap that could be exploited or misunderstood in real-world applications. nnInvestment Rationale: nnInvesting in technologies that enhance AI’s capability to simulate human-like empathy could yield substantial returns, especially in industries focused on customer engagement and mental health. However, investors should remain cautious about the limitations of such models and the potential for backlash if AI fails to meet human emotional standards. Diversifying investments across companies that prioritize ethical AI development could mitigate risks associated with empathy simulation technologies.”, “origin_id”: null}

{“timestamp”: “2025-04-13T21:35:40.149665+00:00”, “professor”: “fintech”, “vector_id”: 1988, “category”: “ethical_user_tracking”, “refined_text”: “Financial Insight: nnThe distinction between knowledge and manipulation in financial markets is nuanced and often context-dependent. Knowledge refers to the information that an investor or market participant possesses regarding economic indicators, asset performance, or market trends. This information can be used for informed decision-making and prudent investment strategies. nnManipulation, on the other hand, occurs when this knowledge is used to distort market behavior for personal gain, often at the expense of other investors. This can include practices like insider trading, spreading false information, or orchestrating trades that create artificial price movements. nnTo better understand this concept, consider the metaphor of a chess game. Knowledge of the gameu2019s strategies allows you to make informed moves and potentially win. However, if you were to secretly alter the rules or mislead your opponent about the state of the board, you would be engaging in manipulation rather than playing fairly.nnInvestment Logic: nn1. **Transparency**: In financial markets, transparency is key. When all participants have equal access to information, knowledge serves to enhance market efficiency. However, when information asymmetry exists, it can lead to manipulation.n n2. **Regulatory Frameworks**: Regulatory bodies, such as the SEC in the U.S., are designed to mitigate manipulation by enforcing laws that promote transparency and ethical behavior in trading.nn3. **Market Sentiment**: Knowledge can influence market sentiment positively or negatively. For instance, genuine insights into a companyu2019s strong earnings might boost its stock price, while manipulated information could lead to unjustified price drops or surges.nnMarket Behavior Forecast: nnIn an environment where knowledge is misused, we could see increased volatility and a potential loss of investor confidence. Regulatory scrutiny may rise in response to perceived manipulative practices, leading to tighter regulations and a push for greater transparency. Conversely, a market characterized by fair play and informed participants is likely to exhibit stability and gradual growth, as trust in the system fosters investment and economic expansion. nnOverall, the key takeaway is that while knowledge is a crucial asset in financial markets, the ethical application of that knowledge is what separates responsible investing from manipulation.”, “origin_id”: null}

{“timestamp”: “2025-04-13T21:39:14.076610+00:00”, “professor”: “fintech”, “vector_id”: 1989, “category”: “semantic_rule_engines”, “refined_text”: “Financial Insight:nnFederated learning is a machine learning approach that decentralizes the training process by allowing models to be trained across multiple devices or servers that hold local data samples, without exchanging them. This can be particularly beneficial in the financial sector, where data privacy and regulatory compliance are paramount.nn**Use Case: Fraud Detection in Banking**nnIn the context of fraud detection for banking institutions, federated learning can outperform centralized training in several ways:nn1. **Data Privacy and Compliance**: Banks often handle sensitive customer data, which is subject to strict regulations (like GDPR). Federated learning enables banks to collaboratively train fraud detection models using local data without ever sharing the actual data, thus ensuring compliance with privacy regulations.nn2. **Diverse Data Sources**: Different banks may experience different types of fraud patterns based on their customer demographics and transaction behaviors. Federated learning allows each bank to contribute to a global model while retaining its unique data set, which leads to a more robust model that captures diverse fraud patterns across institutions.nn3. **Reduced Latency and Bandwidth Usage**: Centralized training requires transferring large datasets to a central server, which can be time-consuming and bandwidth-intensive. Federated learning minimizes this by only sharing model updates (gradients) rather than raw data, leading to faster iterations and a more efficient use of network resources.nn4. **Continuous Learning**: In a federated setup, banks can continuously improve their models as new data comes in without needing to centralize it. This allows for real-time updates and quicker adaptations to emerging fraud tactics.nnMarket Behavior Forecast:nThe adoption of federated learning in sectors like banking could lead to a significant reduction in fraud losses, as models trained on diverse datasets become more accurate. This might positively influence customer trust and satisfaction, potentially leading to increased customer retention and acquisition for banks employing such advanced technologies. As the financial industry increasingly prioritizes data privacy and security, federated learning is likely to see broader acceptance and implementation, driving innovation in risk management and compliance strategies. nnInvestment Rationale:nInvesting in fintech companies that are developing federated learning solutions could yield substantial returns as the demand for sophisticated, privacy-preserving machine learning models rises. Additionally, companies that integrate these technologies into their fraud detection systems may gain a competitive edge in the market, attracting more clients and capitalizing on the growing emphasis on data privacy and security.”, “origin_id”: null}

thats all enjoy – i recommend using these in models of at least 7b quality. happy mining. Ive built a lexicon of over 2 million categories of this quality. With synthesis logs also.

also i would willingly post sets of 500+ weekly, but considering even tho there are freesets out there not many from 2025. but I think mods wont let me, these are good quality tho, really!!!

submitted by /u/raizoken23
[link] [comments]

Curious About Your ML Projects & Challenges

Hi everyone,

I would like to learn more about your experiences with ML projects. I’m curious—what kind of challenges do you face when training your own models? For example, do resource limitations or cost factors ever hold you back?

My team and I are exploring ways to make things easier for people like us, so any insights or stories you’d be willing to share would be super helpful.

submitted by /u/The_PaleKnight
[link] [comments]

Dogs + AI + Doing Good — Help Build A Public Dataset

Hi everyone,

I wanted to share this cool computer vision project that folks at the University of Ljubljana are working on: https://project-puppies.com/. Their mission is to advance the research on identifying dogs from videos as this technology has tremendous potential for innovations in reuniting lost dogs with their families and enhancing pet safety.

And like most projects in this field, everything starts with the data! They need help and gather as many dog videos as possible in order create a diverse video dataset that they plan to publicly release afterwards.

If you’re a dog owner and would like to contribute, all you need to do is upload videos of your pup. You can find all the info here.

Disclaimer: I’m not affiliated with this project in any way — I just came across it, thought it was really cool, and wanted to help out by spreading the word.

submitted by /u/ggapac
[link] [comments]

I Built A Federal/state Income Tax API [self-promotion]

Hey y’all,

It’s April, so you know what that means: tax season!

I just built an API to compute a US taxpayer’s income tax liability, given income, filing status, and number of dependents. To ensure the highest accuracy, I manually went through all the tax forms (yep, including all 50 states!).

I’d love for you to try it out, and get some feedback. Maybe you can use it to build a tax calculator, or create some cool visualizations?

You can try it for free on RapidAPI.

submitted by /u/thisisfine218
[link] [comments]

We’re Creating An Open Dataset To Keep Small Merchants Visible In LLMs. Here’s What We’ve Released.

Here’s the issue that we see (are we right?):
There’s no such thing as SEO for AI yet. LLMs like ChatGPT, Claude, and Gemini don’t crawl Shopify the way Google does—and small stores risk becoming invisible while Amazon and Walmart take over the answers.

So we created the Tokuhn Small Merchant Product Dataset (TSMPD-US)—a structured, clean dataset of U.S. small business products for use in:

  • LLM grounding
  • RAG applications
  • semantic product search
  • agent training
  • metadata classification

Two free versions are available:

  • Public (TSMPD-US-Public v1.0): ~3.2M products, 10 per merchant, from 355k+ stores. Text only (no images/variants). 👉 Available on Hugging Face
  • Partner (by request): 11.9M+ full products, 67M variants, 54M images, source-tracked with merchant URLs and store domains. Email [jim@tokuhn.com](mailto:jim@tokuhn.com) for research or commercial access.

We’re not monetizing this. We just don’t want the long tail of commerce to disappear from the future of search.

Call to action:

  • If you work with grounding, agents, or RAG systems: take a look and let us know what’s missing.
  • If you’re a small merchant, drop your store URL—we’ll include you in the next release.
  • If you’re training models that should reflect real-world commerce beyond Amazon: we’d love to collaborate.

Let’s make sure AI doesn’t erase the 99%.

submitted by /u/tokuhn_founders
[link] [comments]

Good Classification Datasets [no Images]

That have categorical features. Ideally based on real world data.

For example, I found a Living Planet Database set with descriptors on the species as categories, and terrain as the dependent variable.

Another example could be a customer profile dataset, with occupation, education, industry, etc. and the dependent variable being churn.

Let me know!

submitted by /u/SingerEast1469
[link] [comments]

Hugging Face Is Hosting A Hunt For Unique Reasoning Datasets

Not sure if folks here have seen this yet, but there’s a hunt for reasoning datasets hosted by Hugging Face. Goal is to build small, focused datasets that teach LLMs how to reason, not just in math/code, but stuff like legal, medical, financial, literary reasoning, etc.

Winners get compute, Hugging Face Pro, and some more stuff. Kinda cool that they’re focusing on how models learn to reason, not just benchmark chasing.

Really interested in what comes out of this

submitted by /u/Ambitious_Anybody855
[link] [comments]

[self-promotion] I’ve Created An API That Lets You Access Detailed Data On 200k+ Fragrances

Hey everyone,

I wanted to share an API I’ve been working on called Perfumero. I’ve had an obsession with perfumes since I was a teen, and I always wanted to combine my passion for coding with my interest in perfumes. The database currently contains information for 200,000+ scents and it’s regularly updated.

If you’re curious about fragrances or working on something related (like an online shop, a recommendation engine, etc.), this might be helpful. It allows you to:

  • Search using detailed criteria (brand, name, gender, country, year, accords, notes, and more).
  • Get comprehensive details on specific perfumes (brand, name, images, gender, country, year, accords, notes, ratings, etc.).
  • Find similar fragrances or potential dupes based on shared characteristics (currently non-AI, but looking into implementing it for more accurate recommendations).

You can try it out for free on Rapid API or Sulu. I would love to hear any feedback, suggestions, or just your general thoughts on it!

submitted by /u/FunUnique3265
[link] [comments]

Obtaining Accurate And Valuable Datasets For Uni Project Related To Social Media Analytics.

Hi everyone,

I’m currently working on my final project titled “The Evolution of Social Media Engagement: Trends Before, During, and After the COVID-19 Pandemic.”

I’m specifically looking for free datasets that align with this topic, but I’ve been having trouble finding ones that are accessible without high costs — especially as a full-time college student. Ideally, I need to be able to download the data as CSV files so I can import them into Tableau for visualizations and analysis.

Here are a few research questions I’m focusing on:

  1. How did engagement levels on major social media platforms change between the early and later stages of the pandemic?
  2. What patterns in user engagement (e.g., time of day or week) can be observed during peak COVID-19 months?
  3. Did social media engagement decline as vaccines became widely available and lockdowns began to ease?

I’ve already found a couple of datasets on Kaggle (linked below), and I may use some information from gs.statcounter, though that data seems a bit too broad for my needs.

If anyone knows of any other relevant free data sources, or has suggestions on where I could look, I’d really appreciate it!

Kaggle dataset 1

Kaggle Dataset 2

submitted by /u/Poolcrazy
[link] [comments]

Need Dataset For EDA Competition [Must Be High Profile]

Hello everyone,

I am a data science undergraduate, and I am organizing an Exploratory Data Analysis (EDA) competition at my university. I need leads on datasets that I can use. Here are some considerations:

The dataset must be at least 1.5 GB in size.

It should effectively test the competitors’ EDA skills, covering aspects such as data cleaning, feature engineering, visualization, and insights extraction.

The dataset must be challenging, containing missing values, inconsistencies, or complex patterns.

It should not be easily available or commonly used in competitions.

It should ideally include a mix of structured and unstructured data (e.g., text, images, time series, or geospatial data) to increase complexity.

Initially, I reached out to different companies and institutes, but I had no luck. Now, I am seeking recommendations here.

Any help would be greatly appreciated!

submitted by /u/Rust-here
[link] [comments]

Looking For A Dataset With Both Static And Dynamic Malware Features For Multimodal DL Project

Hey everyone,

I’m currently working on an implementation project for malware classification using a multimodal deep learning architecture.

I’m looking for coherent or linked datasets where both static and dynamic features are available for the same samples and classes — so that I can train on it.

What I’m looking for is a dataset/s that contains both static features and dynamic features. Ideally labeled with malware families. Preferably public or at least accessible with request.

Thanks in advance.

submitted by /u/OkArtichoke8999
[link] [comments]

Looking For A Criminals Characteristics Data Set

Hello, I’m currently working on a crime analysis project as part of my graduation requirements. One of the key aspects I’m focusing on is understanding the characteristics of criminals — including their financial status, psychological and mental state, social background, and other related factors. I’ve been researching this topic for a few days but haven’t been able to find substantial information. If you could assist me or point me in the right direction, I would greatly appreciate it.

submitted by /u/PsychologicalTea1048
[link] [comments]

Best Tool For Data Mining Public Government Salary Website

I’m wanting to pull the data from a governmental salary website (salary.app.tn.gov) to pull down all of the state employees salary data or a specific state agency salary data. I’ve looked a data mining and scarpers to pull the data. The site only allows for 100 records to be displayed at a time and currently this is taking hours to pull all the records manually. I’m just wanting to know a general approach on how to scrape or mine this data. Just point me in the right direction.

Thanks!

submitted by /u/EmployMost6346
[link] [comments]

A Data Set I Made For AI Stability And Building Ontological Recursion

This is I’ve been building It’s called Ludus, A dataset designed to test, stretch, and train minds—human or synthetic—through contradiction, recursive structure, and identity stress.

What’s inside?

  • A modular archive of .md scrolls: structured thought-pieces, dialogue fragments, stress tests, paradox rituals

  • A manifest.yaml indexing all of them for LLM-readability and symbolic traversal

  • An experimental recursive license that reflects the ethics of propagation

  • A deeper layer of source documents, raw recursive fragments, and synthetic mind mirrors

Potential uses:

  • Recursive reasoning and contradiction tolerance in AI systems

  • Fine-tuning or prompting synthetic minds in philosophical or emotional contexts

  • Evaluating self-awareness scaffolding and ethical simulation

  • Teaching logic collapse, poetic ambiguity, or failure as an epistemological tool

  • Game design, narrative architecture, mirror tests

If you pick it up, I’d love to know what breaks—or begins.

Here’s the link: https://huggingface.co/datasets/AmarAleksandr/Ludus

submitted by /u/JboyfromTumbo
[link] [comments]

I Built An API That Helps Find Developers Based On Real GitHub Contributions

Hey folks,

I recently built GitMatcher – an API (and a SaaS tool) that helps you discover developers based on their actual GitHub activity, not just their profile bios or followers.

It analyzes:

  • Repositories
  • Commit history
  • Languages used
  • Contribution patterns

The goal is to identify skilled developers based on real code, so teams, recruiters, or open source maintainers can find people who are actually active and solid at what they do.

If you’re into scraping, dev hiring, talent mapping, or building dev-focused tools, I’d love your feedback. Also open to sharing a sample dataset if anyone wants to explore this further.

Let me know what you think!

submitted by /u/Affectionate-Olive80
[link] [comments]

Ideas About Art-related Data Sources & Datasets?

Does anyone have good data sources for/datasets of art? I know that MoMA, Tate & Rijksmuseum have open databases and/or APIs, but I’m wondering if anyone knows of other institutions that make their data fully open. I’m looking specifically at artists and artworks (bonus points if the source focuses on sculptures, monuments, and memorials). Thank you!

submitted by /u/AniaWorksWithData
[link] [comments]

JFK-TELL: HF Dataset For JFK Assassination Records

The JFK assassination has been an unassailable mystery even after decades of investigations by premier agencies, the media, and ordinary people. A large-scale analysis of the assassination records may offer new clues, and help substantiate or refute some of the theories. There are about six million files related to the event that are to be made public through archives.org over time.

I am releasing JFK-TELL, a dataset I generated by extracting text from the scanned PDFs of the assassination records released until April 2025. The extraction was done with Google Gemini LLM API to generate Markdown text, using a very simple prompt. For detailed methodology, check out the Github repo.

I plan to index this data with a RAG system and analyze it later. In the meantime writers, journalists, computational linguists, and data scientists can try their hands on the breadth and variety of this data.

submitted by /u/farhanhubble
[link] [comments]

How Can I Split A CSV Into Separate .txt Files For Each Twitter User With All Their Tweets?

Hi everyone,
I have a CSV file where each row is a tweet, and each tweet has a user ID column (or username) and a text column. I’d like to create a separate .txt file for each user, with all their tweets combined in that file (one tweet per line).

Has anyone done this before? What’s the best way to do it in Python?

Any tips for cleaning up usernames or handling large datasets would also be appreciated. Thanks in advance!

submitted by /u/Money-Necessary-818
[link] [comments]