Taking care of missing data
- Remove rows with missing data
- Replace missing data with some proper value (like mean/ mode of column)
| Country | Age | Salary | Purchased |
|---|---|---|---|
| France | 44 | 72000 | No |
| Spain | 27 | 48000 | Yes |
| Germany | 30 | 54000 | No |
| Spain | 38 | 61000 | No |
| Germany | 40 | Yes | |
| France | 35 | 58000 | Yes |
| Spain | 52000 | No | |
| France | 48 | 79000 | Yes |
| Germany | 50 | 83000 | No |
| France | 37 | 67000 | Yes |
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
imputer.fit(data[:, 1:3])
# fitting all rows and all numerical columns
data[:, 1:3] = imputer.transform(data[:, 1:3])
| Country | Age | Salary | Purchased |
|---|---|---|---|
| France | 44 | 72000 | No |
| Spain | 27 | 48000 | Yes |
| Germany | 30 | 54000 | No |
| Spain | 38 | 61000 | No |
| Germany | 40 | 63777.77777777778 | Yes |
| France | 35 | 58000 | Yes |
| Spain | 38.77777777777778 | 52000 | No |
| France | 48 | 79000 | Yes |
| Germany | 50 | 83000 | No |
| France | 37 | 67000 | Yes |