Encoding categorical data

One Hot Encoding¶

let say in a country column, there are 3 unique values: [France, Germany, Spain]
If we convert this data in such a way that: {France: 0, Germany: 1, Spain: 2},
ML model can consider the values as weights, i.e. spain has more effect on depandant variable, even though it is not the case
So, we prefer converting this data as follows:

France	Germany	Spain
1	0	0
0	1	0
0	0	1

Process of converting categorical data into such form is known as "One Hot Encoding".

Input data (X)

Country	Age	Salary
France	44	72000
Spain	27	48000
Germany	30	54000
Spain	38	61000
Germany	40	63777
France	35	58000
Spain	38	52000
France	48	79000
Germany	50	83000
France	37	67000

One Hot Encoding

from sklearn.compose import ColumnTransform
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# Here [0] is column number of column index range which contains categorical data
# In our case first column 'Country' is a categorical data column
# 'passthrough' means that we want to keep other columns which dont apply transformation too.
X = np.array(ct.fit_transform(X))

Output Data

First three columns of following table can represent any of ['France', 'Germany', 'Spain']

France	Germany	Spain	Age	Salary
1.0	0.0	0.0	44.0	72000.0
0.0	0.0	1.0	27.0	48000.0
0.0	1.0	0.0	30.0	54000.0
0.0	0.0	1.0	38.0	61000.0
0.0	1.0	0.0	40.0	63777.0
1.0	0.0	0.0	35.0	58000.0
0.0	0.0	1.0	38.0	52000.0
1.0	0.0	0.0	48.0	79000.0
0.0	1.0	0.0	50.0	83000.0
1.0	0.0	0.0	37.0	67000.0

Label Encoding¶

To convert a column with binary values (yes/no, true/false etc), we use label encoder.

Input Data

y = ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

Label Encoding

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Output Data

y = [0 1 0 0 1 1 0 1 0 1]