I’m working with a fairly large dataset (CSV) (~3 crore / 30 million rows). Due to memory and compute limits (I’m currently using Google Colab), I can’t load the entire dataset into memory at once.
What I’ve done so far:
- Randomly sampled ~1 lakh (100k) rows
- Performed EDA on the sample to understand distributions, correlations, and basic patterns
However, I’m concerned that sampling may lose important data context, especially:
- Outliers or rare events
- Long-tail behavior
- Rare categories that may not appear in the sample
So I’m considering an alternative approach using pandas chunking:
- Read the data with chunksize=1_000_000
- Define separate functions for:
- preprocessing
- EDA/statistics
- feature engineering
Apply these functions to each chunk
Store the processed chunks in a list
Concatenate everything at the end into a final DataFrame
My questions:
-
Is this chunk-based approach actually safe and scalable for ~30M rows in pandas?
-
Which types of preprocessing / feature engineering are not safe to do chunk-wise due to missing global context?
-
If sampling can lose data context, what’s the recommended way to analyze and process such large datasets while still capturing outliers and rare patterns?
-
Specifically for Google Colab, what are best practices here?
-Multiple passes over data? -Storing intermediate results to disk (Parquet/CSV)? -Using Dask/Polars instead of pandas?
I’m trying to balance:
-Limited RAM -Correct statistical behavior -Practical workflows (not enterprise Spark clusters)
Would love to hear how others handle large datasets like this in Colab or similar constrained environments
submitted by /u/insidePassenger0
[link] [comments]