Encoding categorical data
One Hot Encoding¶
- let say in a
countrycolumn, there are 3 unique values: [France,Germany,Spain] - If we convert this data in such a way that: {
France: 0,Germany: 1,Spain: 2}, - ML model can consider the values as weights, i.e. spain has more effect on depandant variable, even though it is not the case
- So, we prefer converting this data as follows:
| France | Germany | Spain |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
- Process of converting categorical data into such form is known as "One Hot Encoding".
| Country | Age | Salary |
|---|---|---|
| France | 44 | 72000 |
| Spain | 27 | 48000 |
| Germany | 30 | 54000 |
| Spain | 38 | 61000 |
| Germany | 40 | 63777 |
| France | 35 | 58000 |
| Spain | 38 | 52000 |
| France | 48 | 79000 |
| Germany | 50 | 83000 |
| France | 37 | 67000 |
from sklearn.compose import ColumnTransform
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# Here [0] is column number of column index range which contains categorical data
# In our case first column 'Country' is a categorical data column
# 'passthrough' means that we want to keep other columns which dont apply transformation too.
X = np.array(ct.fit_transform(X))
- First three columns of following table can represent any of ['France', 'Germany', 'Spain']
| France | Germany | Spain | Age | Salary |
|---|---|---|---|---|
| 1.0 | 0.0 | 0.0 | 44.0 | 72000.0 |
| 0.0 | 0.0 | 1.0 | 27.0 | 48000.0 |
| 0.0 | 1.0 | 0.0 | 30.0 | 54000.0 |
| 0.0 | 0.0 | 1.0 | 38.0 | 61000.0 |
| 0.0 | 1.0 | 0.0 | 40.0 | 63777.0 |
| 1.0 | 0.0 | 0.0 | 35.0 | 58000.0 |
| 0.0 | 0.0 | 1.0 | 38.0 | 52000.0 |
| 1.0 | 0.0 | 0.0 | 48.0 | 79000.0 |
| 0.0 | 1.0 | 0.0 | 50.0 | 83000.0 |
| 1.0 | 0.0 | 0.0 | 37.0 | 67000.0 |
Label Encoding¶
- To convert a column with binary values (yes/no, true/false etc), we use label encoder.
y = ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y = [0 1 0 0 1 1 0 1 0 1]