The Python Book
 
pandas onehot
20160420

Onehot encode the categorical data of a data-frame

.. using the pandas get_dummies function.

Data:

import StringIO
import pandas as pd

data_strio=StringIO.StringIO('''category   reason         species
Decline    Genuine        24
Improved   Genuine        16
Improved   Misclassified  85
Decline    Misclassified  41
Decline    Taxonomic      2
Improved   Taxonomic      7
Decline    Unclear        41
Improved   Unclear        117''')

df=pd.read_fwf(data_strio)

One hot encode 'category':

cat_oh= pd.get_dummies(df['category'])
cat_oh.columns= map( lambda x: "cat__"+x.lower(), cat_oh.columns.values)

cat_oh

   cat__decline  cat__improved
0             1              0
1             0              1
2             0              1
3             1              0
4             1              0
5             0              1
6             1              0
7             0              1

Do the same for 'reason' :

reason_oh= pd.get_dummies(df['reason'])
reason_oh.columns= map( lambda x: "rsn__"+x.lower(), reason_oh.columns.values)

Combine

Combine the columns into a new dataframe:

ohdf= pd.concat( [ cat_oh, reason_oh, df['species']], axis=1)

Result:

ohdf

   cat__decline  cat__improved  rsn__genuine  rsn__misclassified  \
0             1              0             1                   0   
1             0              1             1                   0   
2             0              1             0                   1   
3             1              0             0                   1   
4             1              0             0                   0   
5             0              1             0                   0   
6             1              0             0                   0   
7             0              1             0                   0   

   rsn__taxonomic  rsn__unclear  species  
0               0             0       24  
1               0             0       16  
2               0             0       85  
3               0             0       41  
4               1             0        2  
5               1             0        7  
6               0             1       41  
7               0             1      117  

Or if the 'drop' syntax on the dataframe is more convenient to you:

ohdf= pd.concat( [ cat_oh, reason_oh, 
            df.drop(['category','reason'], axis=1) ], 
            axis=1)
 
Notes by Willem Moors. Generated on momo:/home/willem/sync/20151223_datamungingninja/pythonbook at 2019-07-31 19:22