The Python Book

Latest
All — By Topic
2019 2016 2015 2014

Tags:
3d aggregation angle archive argsort atan beautifulsoup binary bisect class clean collinear colsum comprehension count covariance csv cut_paste_cli datafaking dataframe datetime day_of_week delta_time deltatime df2sql distance doctest dotproduct dropnull exif file floodfill fold format formula frequency function garmin gaussian geocode geojson gps groupby html httpserver insert ipython is join kfold legend linalg links magic matrix max min namedtuple none null numpy onehot oo osm outer_product packaging pandas plot point quickies range read_data regex repeat reverse sample sample_data sample_words shortcut shorties sort split sqlite stack stemming string strip_html stripaccent tools track tuple visualization zen zip zipfile

dataframe pandas

20190202

Turn a dataframe into an array

eg. dataframe cn

cn

                     asciiname  population  elevation
128677                Crossett        5507         58
7990    Santa Maria da Vitoria       23488        438
25484                 Shanling           0        628
95882     Colonia Santa Teresa       36845       2286
38943                 Blomberg        1498          4
7409              Missao Velha       13106        364
36937                  Goerzig        1295         81

Turn into an arrary


cn.iloc[range(len(cn))].values


array([['Yuhu', 0, 15],
       ['Traventhal', 551, 42],
       ['Velabisht', 0, 60],
       ['Almorox', 2319, 539],
       ['Abuyog', 15632, 6],
       ['Zhangshan', 0, 132],
       ['Llica de Vall', 0, 136],
       ['Capellania', 2252, 31],
       ['Mezocsat', 6519, 91],
       ['Vars', 1634, 52]], dtype=object)

Sidenote: cn was pulled from city data: cn=df.sample(7)[['asciiname','population','elevation']].

dataframe dropnull

20190202

Filter a dataframe to retain rows with non-null values

Eg. you want only the data where the 'population' column has non-null values.

In short

df=df[df['population'].notnull()]

Alternative, replace the null value with something:

df['population']=df['population'].fillna(0)

In detail

import numpy as np
import pandas as pd

# setup the dataframe
data=[[ 'Isle of Skye',         9232, 124 ],
      [ 'Vieux-Charmont',     np.nan, 320 ],
      [ 'Indian Head',          3844,  35 ],
      [ 'Cihua',              np.nan, 178 ],
      [ 'Miasteczko Slaskie',   7327, 301 ],
      [ 'Wawa',               np.nan,   7 ],
      [ 'Bat Khela',           46079, 673 ]]

df=pd.DataFrame(data, columns=['asciiname','population','elevation'])
 
#display the dataframe
df

            asciiname  population  elevation
0        Isle of Skye      9232.0        124
1      Vieux-Charmont         NaN        320
2         Indian Head      3844.0         35
3               Cihua         NaN        178
4  Miasteczko Slaskie      7327.0        301
5                Wawa         NaN          7


# retain only the rows where population has a non-null value
df=df[df['population'].notnull()]

            asciiname  population  elevation
0        Isle of Skye      9232.0        124
2         Indian Head      3844.0         35
4  Miasteczko Slaskie      7327.0        301
6           Bat Khela     46079.0        673

pandas dataframe

20161004

Turn a pandas dataframe into a dictionary

eg. create a mapping of a 3 digit code to a country name

BEL -> Belgium
CHN -> China
FRA -> France
..

Code:

df=pd.io.parsers.read_table(
    '/u01/data/20150215_country_code_iso/country_codes.txt',
    sep='|')

c3d=df.set_index('c3')['country'].to_dict()

Result:

c3d['AUS']
'Australia'

c3d['GBR']
'United Kingdom'

pandas dataframe

20150302

Create an empty dataframe

# create 
df=pd.DataFrame(np.zeros(0,dtype=[
    ('ProductID', 'i4'),
    ('ProductName', 'a50')
    ]))

# append
df = df.append({'ProductID':1234, 'ProductName':'Widget'},ignore_index=True)

Other way

columns = ['price', 'item']
df2 = pd.DataFrame(data=np.zeros((0,len(columns))), columns=columns) 

dataframe csv

20141203

Add records to a dataframe in a for loop

The easiest way to get csv data into a dataframe is:

pd.read_csv('in.csv')

But how to do it if you need to massage the data a bit, or your input data is not comma separated ?

cols= ['entrydate','sorttag','artist','album','doi','tag' ] 
df=pd.DataFrame( columns= cols )

for ..:
    data = .. a row of data-fields separated by |
              with each field still to be stripped 
              of leading & trailing spaces
    df.loc[len(df)]=map(str.strip,data.split('|'))

dataframe

20141202

Dataframe quickies

Count the number of different values in a column of a dataframe

pd.value_counts(df.Age)

Drop a column

df['Unnamed: 0'].head()     # first check if it is the right one
del df['Unnamed: 0']        # drop it

pandas dataframe numpy

20141019

Add two dataframes

Add the contents of two dataframes, having the same index

a=pd.DataFrame( np.random.randint(1,10,5), index=['a', 'b', 'c', 'd', 'e'], columns=['val'])
b=pd.DataFrame( np.random.randint(1,10,3), index=['b', 'c', 'e'],columns=['val'])

a
   val
a    5
b    7
c    8
d    8
e    1

b
   val
b    9
c    2
e    5

a+b
   val
a  NaN
b   16
c   10
d  NaN
e    6

a.add(b,fill_value=0)
   val
a    5
b   16
c   10
d    8
e    6

Notes by Willem Moors. Generated on momo:/home/willem/sync/20151223_datamungingninja/pythonbook at 2019-07-31 19:22