The Python Book
 
packaging
20141228

Intro:

I admit that I was a huge fan of the Python setuptools library for a long time. There was a lot in there which just resonated with how I thought that software development should work. I still think that the design of setuptools is amazing. Nobody would argue that setuptools was flawless and it certainly failed in many regards. The biggest problem probably was that it was build on Python's idiotic import system but there is only so little you can do about that. In general setuptools took the realistic approach to problem-solving: do the best you can do by writing a piece of software scratches the itch without involving a committee or require language changes. That also somewhat explains the second often cited problem of setuptools: that it's a monkeypatch on distutils.

Full article

Talks about: setuptools, distutils, .pth files, PIL, eggs, ..

tools
20141205

Digest of infoworld article

  • Beautiful Soup: Processing parse trees -- XML, HTML, or similarly structured data
  • Pillow: image processin g(following to PIL)
  • Gooey: turn a console-based Python program into one that sports a platform-native GUI.
  • Peewee: a tiny ORM that supports SQLite, MySQL, and PostgreSQL, with many extensions.
  • Scrapy: screen scraping and Web crawling.
  • Apache Libcloud: accessing multiple cloud providers through a single, consistent, and unified API.
  • Pygame: a framework for creating video games in Python.
  • Pathlib: handling filesystem paths in a consistent and cross-platform way, courtesy of a module that is now an integral part of Python.
  • NumPy: scientific computing and mathematical work, including statistics, linear algebra, matrix math, financial operations, and tons more.
  • Sh: calling any external program, in a subprocess, and returning the results to a Python program -- but with the same syntax as if the program in question were a native Python function.
dataframe csv
20141203

Add records to a dataframe in a for loop

The easiest way to get csv data into a dataframe is:

pd.read_csv('in.csv')

But how to do it if you need to massage the data a bit, or your input data is not comma separated ?

cols= ['entrydate','sorttag','artist','album','doi','tag' ] 
df=pd.DataFrame( columns= cols )

for ..:
    data = .. a row of data-fields separated by |
              with each field still to be stripped 
              of leading & trailing spaces
    df.loc[len(df)]=map(str.strip,data.split('|'))
dataframe
20141202

Dataframe quickies

Count the number of different values in a column of a dataframe

pd.value_counts(df.Age)

Drop a column

df['Unnamed: 0'].head()     # first check if it is the right one
del df['Unnamed: 0']        # drop it
strip_html html
20141130

Strip HTML tags from a text.

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


txt='''
<span class="mw-headline" id="The_K.C3.B6ln_concert">The Köln concert </span>
<span class="mw-editsection"><span class="mw-editsection-bracke t">
[</span><a href="/w/index.php?title=The_K%C3%B6ln_Concert&amp;action=edit&amp;section=1" 
title="Edit section: The Köln concert">edit</a><span class="mw-editsection-bracket">]</span>
</span></h2>
<p>The concert was organized by 17-year-old 
Vera Brandes, then Germany ’s youngest concert promoter.<sup id="cite_ref-5" class="reference">
<a href="#cite_note-5"><span>[</span>5<span>]</span></a></sup> At Jarrett's request, Brandes 
had selected a <a href="/wiki/B%C3%B6sendorfer" title="Bösendorfer">Bösendorfer</a> 
290 Imperial concert grand piano for the performance. 
'''

print strip_tags(txt)

Output:

The Köln concert 

[edit]

The concert was organized by 17-year-old 
Vera Brandes, then Germany ’s youngest concert promoter.
[5] At Jarrett's request, Brandes 
had selected a Bösendorfer 
290 Imperial concert grand piano for the performance. 

As found on : stackoverflow

day_of_week
20141128

Deduce the year from day_of_week

Suppose we know: it happened on Monday 17 November. Question: what year was it?

import datetime as dt

for i in [ dt.datetime(yr,11,17)  for yr in range(1970,2014)]:
    if i.weekday()==0: print i

1975-11-17 00:00:00
1980-11-17 00:00:00
1986-11-17 00:00:00
1997-11-17 00:00:00
2003-11-17 00:00:00
2008-11-17 00:00:00

Or suppose we want to know all mondays of November for the same year range:

for i in [ dt.datetime(yr,11,1) + dt.timedelta(days=dy) 
           for yr in range(1970,2014) for dy in range(1,30)] : 
    if i.weekday()==0: print i

1970-11-02 00:00:00
1970-11-09 00:00:00
1970-11-16 00:00:00
..
..
visualization
20141128

Visualizing distributions of data

Visualizing distributions of data

This notebook demonstrates different approaches to graphically representing distributions of data, specifically focusing on the tools provided by the seaborn packageb

stemming
20141118
pandas ipython
20141026

Quickies

You want to pandas to print more data on your wide terminal window?

pd.set_option('display.line_width', 200)

You want to make the max column width larger?

pd.set_option('max_colwidth',80)
datetime pandas numpy
20141025

Dataframe with date-time index

Create a dataframe df with a datetime index and some random values: (note: see 'simpler' dataframe creation further down)

Output:

    In [4]: df.head(10)
    Out[4]: 
                value
    2009-12-01     71
    2009-12-02     92
    2009-12-03     64
    2009-12-04     55
    2009-12-05     99
    2009-12-06     51
    2009-12-07     68
    2009-12-08     64
    2009-12-09     90
    2009-12-10     57
    [10 rows x 1 columns]

Now select a week of data

Output: watchout selects 8 days!!

    In [235]: df[d1:d2]
    Out[235]: 
                value
    2009-12-10     99
    2009-12-11     70
    2009-12-12     83
    2009-12-13     90
    2009-12-14     60
    2009-12-15     64
    2009-12-16     59
    2009-12-17     97
    [8 rows x 1 columns]


    In [236]: df[d1:d1+dt.timedelta(days=7)]
    Out[236]: 
                value
    2009-12-10     99
    2009-12-11     70
    2009-12-12     83
    2009-12-13     90
    2009-12-14     60
    2009-12-15     64
    2009-12-16     59
    2009-12-17     97
    [8 rows x 1 columns]


    In [237]: df[d1:d1+dt.timedelta(weeks=1)]
    Out[237]: 
                value
    2009-12-10     99
    2009-12-11     70
    2009-12-12     83
    2009-12-13     90
    2009-12-14     60
    2009-12-15     64
    2009-12-16     59
    2009-12-17     97
    [8 rows x 1 columns]

Postscriptum: a simpler way of creating the dataframe

An index of a range of dates can also be created like this with pandas:

pd.date_range('20091201', periods=31)

Hence the dataframe:

df=pd.DataFrame(np.random.randint(50,100,31), index=pd.date_range('20091201', periods=31))
numpy magic sample_data
20141021

The magic matrices (a la octave).

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
magic3= np.array(
   [[8,   1,   6],
    [3,   5,   7],
    [4,   9,   2] ] )

magic4= np.array(
   [[16,    2,    3,   13],
    [ 5,   11,   10,    8],
    [ 9,    7,    6,   12],
    [ 4,   14,   15,    1]] ) 

magic5= np.array( 
   [[17,   24,    1,    8,   15],
    [23,    5,    7,   14,   16],
    [ 4,    6,   13,   20,   22],
    [10,   12,   19,   21,    3],
    [11,   18,   25,    2,    9]] )

magic6= np.array(
   [[35,    1,    6,   26,   19,   24],
    [ 3,   32,    7,   21,   23,   25],
    [31,    9,    2,   22,   27,   20],
    [ 8,   28,   33,   17,   10,   15],
    [30,    5,   34,   12,   14,   16],
    [ 4,   36,   29,   13,   18,   11]] )

magic7= np.array(
     [ [30,  39,  48,   1,  10,  19,  28],
       [38,  47,   7,   9,  18,  27,  29],
       [46,   6,   8,  17,  26,  35,  37],
       [ 5,  14,  16,  25,  34,  36,  45],
       [13,  15,  24,  33,  42,  44,   4],
       [21,  23,  32,  41,  43,   3,  12],
       [22,  31,  40,  49,   2,  11,  20] ] ) 

# no_more_magic

Sum column-wise (ie add up the elements for each column):

np.sum(magic3,axis=0)
array([15, 15, 15])

Sum row-wise (ie add up elements for each row):

np.sum(magic3,axis=1)
array([15, 15, 15])

Okay, a magic matrix is maybe not the best way to show row/column wise sums. Consider this:

rc= np.array([[0, 1, 2, 3, 4, 5],
              [0, 1, 2, 3, 4, 5],
              [0, 1, 2, 3, 4, 5]])

np.sum(rc,axis=0)           # sum over rows
[0,  3,  6,  9, 12, 15]

np.sum(rc,axis=1)           # sum over columns 
[15, 
 15, 
 15]

np.sum(rc)                  # sum every element
45
links
20141019

Python documentation links

pandas dataframe numpy
20141019

Add two dataframes

Add the contents of two dataframes, having the same index

a=pd.DataFrame( np.random.randint(1,10,5), index=['a', 'b', 'c', 'd', 'e'], columns=['val'])
b=pd.DataFrame( np.random.randint(1,10,3), index=['b', 'c', 'e'],columns=['val'])

a
   val
a    5
b    7
c    8
d    8
e    1

b
   val
b    9
c    2
e    5

a+b
   val
a  NaN
b   16
c   10
d  NaN
e    6

a.add(b,fill_value=0)
   val
a    5
b   16
c   10
d    8
e    6
pandas csv
20141019

Read/write csv

Read:

pd.read_csv('in.csv')

Write:

<yourdataframe>.to_csv('out.csv',header=False, index=False ) 

Load a csv file

Load the following csv file. Difficulty: the date is spread over 3 fields.

    2014, 8, 5, IBM, BUY, 50,
    2014, 10, 9, IBM, SELL, 20 ,
    2014, 9, 17, PG, BUY, 10,
    2014, 8, 15, PG, SELL, 20 ,

The way I implemented it:

10
11
12
13
14
15
16
17
18
19
20
21
22
# my way
ls_order_col= [ 'year', 'month', 'day', 'symbol', 'buy_sell', 'number','dummy' ]
df_mo=pd.read_csv(s_filename, sep=',', names=ls_order_col, skipinitialspace=True, index_col=False)

# add column of type datetime 
df_mo['date']=pd.to_datetime(df_mo.year*10000+df_mo.month*100+df_mo.day,format='%Y%m%d')

# drop some columns
df_mo.drop(['dummy','year','month','day'], axis=1, inplace=True)

# order by datetime
df_mo.sort(columns='date',inplace=True )
print df_mo

An alternative way,... it's better because the date is converted on reading, and the dataframe is indexed by the date.

 
Notes by Willem Moors. Generated on momo:/home/willem/sync/20151223_datamungingninja/pythonbook at 2019-07-31 19:22