How To Represent Large Categorical Data?

I’ve 10 numerical and large datasets where each has 3 generic categories. Each row contains unique data. The end row of each dataset contains the labels for each category. The category is not distinct thus other row may refer to any of the 3 categories.

e.g.

Date Value Category 1/1/2010 1.11111 Alpha 2/1/2010 2.11111 Beta 3/1/2010 2.00009 Alpha 4/1/2010 0.00000 Charlie

But the 10 datasets have different volume of data. E.g. dataset A may have 10K rows, dataset B around 100K, Dataset C 1 million, etc.

I couldn’t process all the data as its too large.

What would be the best way to sample each dataset? I’d like the sample containing a fair representative of the 3 categories.

submitted by /u/runnersgo
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *