Scaling a RAG-based AI for Student Wellness: How to ethically scrape & curate 500+ academic papers for a “White Box” Social Science project?

Hi everyone!

I’m part of an interdisciplinary team (Sociology + Engineering) at Universidad Alberto Hurtado (Chile). We are developing Tuküyen, a non-profit app designed to foster self-regulation and resilience in university students.

Our project is backed by the Science, Technology, and Society (STS) Research Center. We are moving away from “Black Box” commercial AIs because we want to fight Surveillance Capitalism and the “Somatic Gap” (the physiological deregulation caused by addictive UI/UX).

The Goal: Build a Retrieval-Augmented Generation (RAG) system using a corpus of ~500 high-quality academic papers in Sociology and Psychology (specifically focusing on somatic regulation, identity transition, and critical tech studies).

The Technical Challenge: We need to move from a manually curated set of 50 papers to an automated pipeline of 500+. We’re aiming for a “White Box AI” where every response is traceable to a specific paragraph of a peer-reviewed paper.

I’m looking for feedback on:

Sourcing & Scraping: What’s the most efficient way to programmatically access SciELO, Latindex, and Scopus without hitting paywalls or violating terms? Any specific libraries (Python) you’d recommend for academic PDF harvesting?
PDF-to-Text “Cleaning”: Many older Sociology papers are messy scans. Beyond standard OCR, how do you handle the removal of “noise” (headers, footers, 10-page bibliographies) so they don’t pollute the embeddings?
Semantic Chunking for Social Science: Academic prose is dense. Does anyone have experience with Recursive Character Text Splitting vs. Semantic Chunking for complex theoretical texts? How do you keep the “sociological context” alive in a 500-character chunk?
Vector DB & Costs: We’re on a student/research budget (~$3,500 USD total for the project). We need low latency for real-time “Somatic Interventions.” Pinecone? Milvus? Or just stick to FAISS/ChromaDB locally?
Ethical Data Handling: Since we deal with student well-being data (GAD-7/PHQ-9 scores), we’re implementing Local Differential Privacy. Any advice on keeping the RAG pipeline secure so the LLM doesn’t “leak” user context into the global prompt?

Background/Theory: We are heavily influenced by Shoshana Zuboff (Surveillance Capitalism) and Jonathan Haidt (The Anxious Generation). We believe AI should be a tool for autonomy, not a new form of “zombification” or behavioral surplus extraction.

Any advice, repo recommendations, or “don’t do this” stories would be gold! Thanks from the South of the world! 🇨🇱

submitted by /u/Spare-Customer-506
[link] [comments]

Scaling A RAG-based AI For Student Wellness: How To Ethically Scrape & Curate 500+ Academic Papers For A “White Box” Social Science Project?

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments