How To Handle Missing Values In A Dataset?

I am working on a diabetes prediction model for my project and I need help on how should I handle missing values in the smoking history column in my structured tabular dataset.

My dataset has 100,000 rows, with around 35% of rows having “No Info” for smoking history. Since smoking history has a significant impact on diabetes, this column cannot be ignored.

Other entries in this column are: “Never”, “Current”, “Not current” and “Former”

Key concerns:

Encoding: If I am encoding this column, then how should “No Info” be treated in this case? One hot encoding will lead to unneccessary high dimensionality whereas there is no clear order that I can choose between the values if I go with ordinal encoding.

Data Loss: Would dropping these rows (35%) lead to bias, or is it a valid approach?

I would appreciate your personal insights on the best approach for this since I have already searched this thing enough on the internet.

submitted by /u/shaitaanbaluck
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *