The Python Book
 
plot function
20151231

Plot a function

eg. you want a plot of function: f(w) = 5-(w-10)² for w in the range 0..19

import matplotlib.pyplot as plt

x=range(20) 
y=map( lambda w: 5-(w-10)**2, x)
plt.plot(x,y) 
plt.show()
zen
20151215

The Zen of Python

From: www.thezenofpython.com

    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
plot zip
20151212

Plot some points

Imagine you have a list of tuples, and you want to plot these points:

l = [(1, 9), (2, 5), (3, 7)]

And the plotting function expects to receive the x and y coordinate as separate lists.

First some fun with zip:

print(l) 
[(1, 9), (2, 5), (3, 7)]

print(*l) 
(1, 9) (2, 5) (3, 7)

print(*zip(*l))
(1, 2, 3) (9, 5, 7)

Got it? Okay, let's plot.

plt.scatter(*zip(*pl))
plt.show()
delta_time pandas
20151207

Add/subtract a delta time

Problem

A number of photo files were tagged as follows, with the date and the time:

20151205_17h48-img_0098.jpg
20151205_18h20-img_0099.jpg
20151205_18h21-img_0100.jpg

..

Turns out that they should be all an hour earlier (reminder: mixing pics from two camera's), so let's create a script to rename these files...

Solution

1. Start

Let's use pandas:

import datetime as dt
import pandas as pd
import re

df0=pd.io.parsers.read_table( '/u01/work/20151205_gran_canaria/fl.txt',sep=",", \
        header=None, names= ["fn"])
df=df0[df0['fn'].apply( lambda a: 'img_0' in a )]  # filter out certain pics     

2. Make parseable

Now add a column to the dataframe that only contains the numbers of the date, so it can be parsed:

df['rawdt']=df['fn'].apply( lambda a: re.sub('-.*.jpg','',a))\
                 .apply( lambda a: re.sub('[_h]','',a))

Result:

df.head()
                             fn         rawdt
0   20151202_07h17-img_0001.jpg  201512020717
1   20151202_07h17-img_0002.jpg  201512020717
2   20151202_07h17-img_0003.jpg  201512020717
3   20151202_15h29-img_0004.jpg  201512021529
28  20151202_17h59-img_0005.jpg  201512021759

3. Convert to datetime, and subtract delta time

Convert the raw-date to a real date, and subtract an hour:

df['adjdt']=pd.to_datetime( df['rawdt'], format('%Y%m%d%H%M'))-dt.timedelta(hours=1)

Note 20190105: apparently you can drop the 'format' string:

df['adjdt']=pd.to_datetime( df['rawdt'])-dt.timedelta(hours=1) 

Result:

                             fn         rawdt               adjdt
0   20151202_07h17-img_0001.jpg  201512020717 2015-12-02 06:17:00
1   20151202_07h17-img_0002.jpg  201512020717 2015-12-02 06:17:00
2   20151202_07h17-img_0003.jpg  201512020717 2015-12-02 06:17:00
3   20151202_15h29-img_0004.jpg  201512021529 2015-12-02 14:29:00
28  20151202_17h59-img_0005.jpg  201512021759 2015-12-02 16:59:00

4. Convert adjusted date to string

df['adj']=df['adjdt'].apply(lambda a: dt.datetime.strftime(a, "%Y%m%d_%Hh%M") )

We also need the 'stem' of the filename:

df['stem']=df['fn'].apply(lambda a: re.sub('^.*-','',a) )

Result:

df.head()
                             fn         rawdt               adjdt  \
0   20151202_07h17-img_0001.jpg  201512020717 2015-12-02 06:17:00   
1   20151202_07h17-img_0002.jpg  201512020717 2015-12-02 06:17:00   
2   20151202_07h17-img_0003.jpg  201512020717 2015-12-02 06:17:00   
3   20151202_15h29-img_0004.jpg  201512021529 2015-12-02 14:29:00   
28  20151202_17h59-img_0005.jpg  201512021759 2015-12-02 16:59:00   

               adj          stem  
0   20151202_06h17  img_0001.jpg  
1   20151202_06h17  img_0002.jpg  
2   20151202_06h17  img_0003.jpg  
3   20151202_14h29  img_0004.jpg  
28  20151202_16h59  img_0005.jpg  

5. Cleanup

Drop columns that are no longer useful:

df=df.drop(['rawdt','adjdt'], axis=1)

Result:

df.head()
                             fn             adj          stem
0   20151202_07h17-img_0001.jpg  20151202_06h17  img_0001.jpg
1   20151202_07h17-img_0002.jpg  20151202_06h17  img_0002.jpg
2   20151202_07h17-img_0003.jpg  20151202_06h17  img_0003.jpg
3   20151202_15h29-img_0004.jpg  20151202_14h29  img_0004.jpg
28  20151202_17h59-img_0005.jpg  20151202_16h59  img_0005.jpg

6. Generate scripts

Generate the 'rename' script:

sh=df.apply( lambda a: 'mv {} {}-{}'.format( a[0],a[1],a[2]), axis=1)
sh.to_csv('rename.sh',header=False, index=False )

Also generate the 'rollback' script (in case we have to rollback the renaming) :

sh=df.apply( lambda a: 'mv {}-{} {}'.format( a[1],a[2],a[0]), axis=1)
sh.to_csv('rollback.sh',header=False, index=False )

First lines of the rename script:

mv 20151202_07h17-img_0001.jpg 20151202_06h17-img_0001.jpg
mv 20151202_07h17-img_0002.jpg 20151202_06h17-img_0002.jpg
mv 20151202_07h17-img_0003.jpg 20151202_06h17-img_0003.jpg
mv 20151202_15h29-img_0004.jpg 20151202_14h29-img_0004.jpg
mv 20151202_17h59-img_0005.jpg 20151202_16h59-img_0005.jpg
bisect insert
20151026

Insert an element into an array, keeping the array ordered

Using the bisect_left() function of module bisect, which locates the insertion point.

def insert_ordered(ar, val):
    i=0
    if len(ar)>0:
        i=bisect.bisect_left(ar,val)
    ar.insert(i,val)

Usage:

ar=[]
insert_ordered( ar, 10 )
insert_ordered( ar, 20 )
insert_ordered( ar, 5 )
angle point atan
20151022

Angle between 2 points

Calculate the angle in radians, between the horizontal through the 1st point and a line through the two points

def angle(p0,p1):
    dx=float(p1.x)-float(p0.x)
    dy=float(p1.y)-float(p0.y)
    if dx==0:
        if dy==0:
            return 0.0
        elif dy<0:
            return math.atan(float('-inf'))
        else:
            return math.atan(float('inf'))
    return math.atan(dy/dx)
collinear point
20151022

Generate an array of collinear points plus some random points

  • generate a number of points (integers) that are on the same line
  • randomly intersperse these coordinates with a set of random points
  • watchout: may generate dupes! (the random points, not the collinear points)

Source:

import random

p=[(5,5),(1,10)]    # points that define the line 

# warning: this won't work for vertical line!!!
slope= (float(p[1][1])-float(p[0][1]))/(float(p[1][0])-float(p[0][0]) )
intercept= float(p[0][1])-slope*float(p[0][0])

ar=[]
for x in range(0,25):     
    y=slope*float(x)+intercept

    # only keep the y's that are integers
    if (y%2)==0: 
        ar.append((x,int(y)))
    
    # intersperse with random coordinates
    r=3+random.randrange(0,5)  

    # only add random points when random nr is even
    if r%2==0: 
        ar.extend( [ (random.randrange(0,100),random.randrange(0,100)) for j in range(r) ])  
    
print ar

Sample output:

[(1, 10), (97, 46), (94, 12), (33, 10), (9, 71), (9, 0), (28, 34), 
(2, 94), (30, 29), (69, 28), (82, 31), (79, 86), (88, 46), (59, 24), 
(2, 78), (54, 88), (94, 78), (99, 37), (75, 48), (91, 1), (67, 61), 
(12, 11), (55, 55), (58, 82), (95, 99), (56, 27), (12, 18), (99, 25), 
(77, 84), (31, 39), (64, 84), (4, 13), (80, 63), (43, 27), (78, 43), 
(24, 32), (17, -10), (73, 15), (6, 97), (0, 74), (16, 97), (6, 77), 
(60, 77), (19, 83), (19, 82), (19, 40), (58, 63), (64, 62), (14, 53),
(57, 21), (49, 24), (66, 94), (82, 1), (29, 39), (55, 64), (85, 68), 
(39, 24)]
point class
20151022

Define a point class

  • with an x and y member
  • with methods to 'autoprint'

Definition

import math

class P:
    x=0
    y=0
    def __init__(self,x,y):
        self.x=x
        self.y=y

    # gets called when a print is executed
    def __str__(self):
        return "x:{} y:{}".format(self.x,self.y)

    # gets called eg. when a print is executed on an array of P's
    def __repr__(self):
        return "x:{} y:{}".format(self.x,self.y)


# convert an array of arrays or tuples to array of points
def convert(in_ar) :
    out_ar=[]
    for el in in_ar:
        out_ar.append( P(el[0],el[1]) )
    return out_ar

How to initialize

Eg. create a list of points

# following initialisations lead to the same array of points (note the convert)
p=[P(0,0),P(0,1),P(-0.866025,-0.5),P(0.866025,0.5)]
q=convert( [[0,0],[0,1],[-0.866025,-0.5],[0.866025,0.5]] )  
r=convert( [(0,0),(0,1),(-0.866025,-0.5),(0.866025,0.5)] )

print type(p), ' | ' , p[2].x, p[2].y,  ' | ', p[2]
print type(q), ' | ' , q[2].x, q[2].y,  ' | ', q[2]
print type(r), ' | ' , r[2].x, r[2].y,  ' | ', r[2]

Output:

<type 'list'>  |  -0.866025 -0.5  |  x:-0.866025 y:-0.5
<type 'list'>  |  -0.866025 -0.5  |  x:-0.866025 y:-0.5
<type 'list'>  |  -0.866025 -0.5  |  x:-0.866025 y:-0.5

How to use

eg. Calculate the angle between :

  • the horizontal line through the first point
  • and the line through the two points

Then convert the result from radians to degrees: (watchout: won't work for dx==0)

print math.atan( ( p[3].y - p[0].y ) / ( p[3].x - p[0].x ) ) * 180.0/math.pi

Output:

30.0000115676
repeat comprehension
20151017

Fill an array with 1 particular value

Via comprehension:

z1=[0 for x in range(20)]
z1
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Via built-in repeat:

z2=[0] * 20 

Equal? yes.

z1==z2
True

Which one is faster?

timeit.timeit('z=[0 for x in range(100000)]',number=100)
0.8116392770316452

timeit.timeit('z=[0]*100000',number=100)
0.050275236018933356

The built-in repeat beats comprehension hands down!

shortcut formula
20151015

Combinations of array elements

Suppose you want to know how many unique combinations of the elements of an array there are.

The scenic tour:

a=list('abcdefghijklmnopqrstuvwxyz')
n=len(a)
stack=[]

for i in xrange(n):
    for j in xrange(i+1,n):
        stack.append( (a[i],a[j]) )

print len(stack) 

The shortcut:

print (math.factorial(n)/2)/math.factorial(n-2)

From the more general formula:

n! / r! / (n-r)!  with r=2. 

(see itertools.combinations(iterable, r) on docs.python.org/2/library/itertools.html )

Add up from 1 to n

Add up all numbers from 1 to n.

The straight-n-simple solution:

result = 0
for i in xrange(n):
        result += (i + 1)
print (result)

The fast solution:

result = n * (n+1) / 2
print (result)

Here's the explanation (thanks codility)

shorties quickies
20151014

Shorties & quickies

Get an array of 10 random numbers

randrange: choose a random item from range(start, stop[)

import random
rand_arr=[ random.randrange(0,10) for i in range(10) ]
sort tuple
20151013

Sort a list of tuples

Say you have markers that are tuples, and you want to have the marker list sorted by the first tuple element.

sorted_marker=sorted( marker, key=lambda x:x[0] ) 

Multi-level sort: custom compare

Use a custom compare function. The compare function receives 2 objects to be compared.

def cust_cmp(x,y):
    if (x[1]==y[1]):
        return cmp( x[0],y[0] )    
    return cmp(x[1],y[1])

names= [ ('mahalia', 'jackson'),  ('moon', 'zappa'), ('janet','jackson'), ('lee','albert'), ('latoya','jackson') ]

sorted_names=sorted( names, cmp=cust_cmp)

Output:

('lee', 'albert')
('janet', 'jackson')
('latoya', 'jackson')
('mahalia', 'jackson')
('moon', 'zappa')
null none
20151012

'Null' value in Python is 'None'

There's always only one instance of this object, so you can check for equivalence with x is None (identity comparison) instead of x == None

stackoverflow.com/questions/3289601/null-object-in-python

Missing data in Python

Roughly speaking:

  • missing 'object' -> None
  • missing numerical value ->Nan ( np.nan )

Pandas handles both nearly interchangeably, converting if needed.

  • isnull(), notnull() generates boolean mask indicating missing (or not) values
  • dropna() return filtered version of data
  • fillna() impute the missing values

More detail: jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html

reverse string
20151011

Reverse the words in a string

Tactic:

  1. reverse the whole string
  2. reverse every word on its own

Code:

s="The goal is to make a thin circle of dough, with a raised edge."

r=' '.join([ w[::-1] for w in s[::-1].split(' ')  ]) 

The notation s[::-1] is called extended slice syntax, and should be read as [start:end:step] with the start and end left off.

Note: you also have the reversed() function, which returns an iterator:

s='circumference'
''.join(reversed(s))

'ecnerefmucric'
floodfill stack
20151010

Flood fill

  • Turn lines of text into a grid[row][column] (taking care to pad the lines to get a proper rectangle)
  • Central data structure for the flood-fill is a stack
  • If the randomly chosen point is blank, then fill it, and push the coordinates of its 4 neighbours onto the stack
  • Handle the neighbouring points the same way

Src:

#!/usr/bin/python 

import random

lines='''
+++++++        ++++++++++                ++++++++++++++++++++    +++++   
+     +        +        +                +                  +    +   +   
+     +        +        +                +                  +    +++++ 
+     +        +     ++++                +                  + 
+++++++        +                         +                  + 
               +     ++++                +                  + 
               +        +                +                  + 
               +        +                +                  + 
               +        +                +                  + 
               ++++++++++                +                  + 
                                         +                  + 
                                         +                  + 
     ++++++++++++                        +                  + 
     +          +                        +                  + 
     +          +                         +                + 
     +          +                          +              +
     ++++++++++++                           ++++++++++++++ 
'''.split("\n") 


# maximum number of columns and rows
colmax= max( [ len(line) for line in lines ] ) 
rowmax=len(lines) 

padding=' ' * colmax
grid= [ list(lines[row]+padding)[0:colmax]  for row in range(0,rowmax) ] 

for l in grid: print( ''.join(l) )   # print the grid
print '-' * colmax                   # print a separating line

# creat a stack, and put a random coordinate on it
pointstack=[]
pointstack.append( ( random.randint(0,colmax),      # col
                     random.randint(0,rowmax) ) )   # row 

# floodfill
while len(pointstack)>0: 
    (col,row)=pointstack.pop()
    if col>=0 and col<colmax and row>=0 and row<rowmax:
        if grid[row][col]==' ': 
            grid[row][col]='O'
            if col<(colmax-1): pointstack.append( (col+1,row)) 
            if col>0:          pointstack.append( (col-1,row)) 
            if row<(rowmax-1): pointstack.append( (col,row+1) ) 
            if row>0:          pointstack.append( (col,row-1) ) 

for l in grid: print( ''.join(l) ) # print the grid

Output of a few runs:

+++++++        ++++++++++                ++++++++++++++++++++    +++++   
+     +        +        +                +OOOOOOOOOOOOOOOOOO+    +   +   
+     +        +        +                +OOOOOOOOOOOOOOOOOO+    +++++   
+     +        +     ++++                +OOOOOOOOOOOOOOOOOO+            
+++++++        +                         +OOOOOOOOOOOOOOOOOO+            
               +     ++++                +OOOOOOOOOOOOOOOOOO+            
               +        +                +OOOOOOOOOOOOOOOOOO+            
               +        +                +OOOOOOOOOOOOOOOOOO+            
               +        +                +OOOOOOOOOOOOOOOOOO+            
               ++++++++++                +OOOOOOOOOOOOOOOOOO+            
                                         +OOOOOOOOOOOOOOOOOO+            
                                         +OOOOOOOOOOOOOOOOOO+            
     ++++++++++++                        +OOOOOOOOOOOOOOOOOO+            
     +          +                        +OOOOOOOOOOOOOOOOOO+            
     +          +                         +OOOOOOOOOOOOOOOO+             
     +          +                          +OOOOOOOOOOOOOO+              
     ++++++++++++                           ++++++++++++++               

Completely flooded:

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
+++++++OOOOOOOO++++++++++OOOOOOOOOOOOOOOO++++++++++++++++++++OOOO+++++OOO
+     +OOOOOOOO+OOOOOOOO+OOOOOOOOOOOOOOOO+                  +OOOO+   +OOO
+     +OOOOOOOO+OOOOOOOO+OOOOOOOOOOOOOOOO+                  +OOOO+++++OOO
+     +OOOOOOOO+OOOOO++++OOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
+++++++OOOOOOOO+OOOOOOOOOOOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOOOOOOOOOOOO+OOOOO++++OOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOOOOOOOOOOOO+OOOOOOOO+OOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOOOOOOOOOOOO+OOOOOOOO+OOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOOOOOOOOOOOO+OOOOOOOO+OOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOOOOOOOOOOOO++++++++++OOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOO++++++++++++OOOOOOOOOOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOO+          +OOOOOOOOOOOOOOOOOOOOOOOO+                  +OOOOOOOOOOOO
OOOOO+          +OOOOOOOOOOOOOOOOOOOOOOOOO+                +OOOOOOOOOOOOO
OOOOO+          +OOOOOOOOOOOOOOOOOOOOOOOOOO+              +OOOOOOOOOOOOOO
OOOOO++++++++++++OOOOOOOOOOOOOOOOOOOOOOOOOOO++++++++++++++OOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
zip archive
20150913

Archive files into a zipfile

A cronjob runs every 10 minutes produces these JSON files:

 6752 Sep 10 08:30 orderbook-kraken-20150910-083003.json
 6682 Sep 10 08:40 orderbook-kraken-20150910-084004.json
 6717 Sep 10 08:50 orderbook-kraken-20150910-085004.json
 6717 Sep 10 09:00 orderbook-kraken-20150910-090003.json
 6682 Sep 10 09:10 orderbook-kraken-20150910-091003.json
 6717 Sep 10 09:20 orderbook-kraken-20150910-092002.json
 6717 Sep 10 09:30 orderbook-kraken-20150910-093003.json
 6682 Sep 10 09:40 orderbook-kraken-20150910-094003.json
 6682 Sep 10 09:50 orderbook-kraken-20150910-095003.json
 6752 Sep 10 10:00 orderbook-kraken-20150910-100004.json
 6788 Sep 10 10:10 orderbook-kraken-20150910-101003.json
 6787 Sep 10 10:20 orderbook-kraken-20150910-102004.json
 6823 Sep 10 10:30 orderbook-kraken-20150910-103004.json
 6752 Sep 10 10:40 orderbook-kraken-20150910-104004.json

Another cronjob, run every morning, zips all the files of one day together into a zipfile, turning above files into:

20150910-orderbook-kraken.zip
20150911-orderbook-kraken.zip
20150912-orderbook-kraken.zip
.. 

Here's the code of the archiver (ie the 2nd cronjob) :

import os,re,sys
import zipfile
from datetime import datetime

# get a date from the filename. Assumptions:
#   - format = YYYYMMDD
#   - date is in this millenium (ie starts with 2) 
#   - first number in the filename is the date 
def get_date(fn):
    rv=re.sub(r'(^.[^0-9]*)(2[0-9]{7})(.*)', r'\2', fn)
    return rv


# first of all set the working directory
wd=sys.argv[1]
if (len(wd)==0):
    print "Need a working directory"
    sys.exit(0)

os.chdir(wd)

# find the oldest date-pattern in the json files
ds=set()
for filename in os.listdir("."):
    if filename.endswith(".json"):
        ds.add(get_date(filename))
        #print "{}->{}".format(filename,dt) 

# exclude today's pattern (because today may not be complete):
today=datetime.now().strftime("%Y%m%d")
ds.remove(today)

l=sorted(list(ds))
#print l

if (len(l)==0):
    #print "Nothing to do!"
    sys.exit(0)

# datepattern selected
datepattern=l[0]

# decide on which files go into the archive
file_ls=[]
for filename in os.listdir("."):
    if filename.endswith(".json") and filename.find(datepattern)>-1:
        file_ls.append(filename)
        #print "{}->{}".format(filename,dt) 

# filename of archive: get the first file, drop the .json extension, and remove all numbers, add the datepattern
file_ls=sorted(file_ls)

stem=re.sub('--*','-', re.sub( '[0-9]','', re.sub('.json$','',file_ls[0]) ))
zipfilename=re.sub('-\.','.', '{}-{}.zip'.format(datepattern,stem))

#print "Zipping up all {} files to {}".format(datepattern,zipfilename)

zfile=zipfile.ZipFile(zipfilename,"w")
for fn in file_ls:
    #print "Adding to zip: {} ".format(fn)
zfile.write(fn)
#print "Deleting file: {} ".format(fn)
os.remove(fn)
zfile.close()

Note: if you have a backlog of multiple days, you have to run the script multiple times!

clean file
20150912

Retain recent files

eg. an application produces a backup file every night. You only want to keep the last 5 files.

Pipe the output of the following script (retain_recent.py) into a sh, eg in a cronjob:

/YOURPATH/retain_recent.py  | bash 

Python code that produces 'rm fileX' statements:

#list the files according to a pattern, except for the N most recent files
import os,sys,time

fnList=[]

d='/../confluence_data/backups/'
for f in os.listdir( d ):
    (mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime) = os.stat(d+"/"+f)
    s='%s|%s/%s|%s' % ( mtime, d, f, time.ctime(mtime))
    #print s
    fnList.append(s)

# retain the 5 most recent files:
nd=len(fnList)-5

c=0
for s in sorted(fnList):
    (mt,fn,hd)=s.split('|')
    c+=1
    if (c>=nd):
        print '#keeping %s (%s)' % ( fn, hd )
    else:
        print 'rm -v %s   #deleting (%s)' % ( fn, hd )
matrix colsum numpy
20150728

Dot product used for aggregation of an unrolled matrix

Aggregations by column/row on an unrolled matrix, done via dot product. No need to reshape.

Column sums

Suppose this 'flat' array ..

a=np.array( [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ] )

.. represents an 'unrolled' 3x4 matrix ..

a.reshape(3,4)

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

.. of which you want make the sums by column ..

a.reshape(3,4).sum(axis=0)
array([15, 18, 21, 24])

This can also be done by the dot product of a tiled eye with the array!

np.tile(np.eye(4),3)

array([[ 1,  0,  0,  0,  1,  0,  0,  0,  1,  0,  0,  0],
       [ 0,  1,  0,  0,  0,  1,  0,  0,  0,  1,  0,  0],
       [ 0,  0,  1,  0,  0,  0,  1,  0,  0,  0,  1,  0],
       [ 0,  0,  0,  1,  0,  0,  0,  1,  0,  0,  0,  1]])

Dot product:

np.tile(np.eye(4),3).dot(a) 
array([ 15.,  18.,  21.,  24.])

Row sums

Similar story :

a.reshape(3,4)
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

Sum by row:

a.reshape(3,4).sum(axis=1)
array([10, 26, 42])

Can be expressed by a Kronecker eye-onesie :

np.kron( np.eye(3), np.ones(4) )

array([[ 1,  1,  1,  1,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  1,  1,  1,  1,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1]])

Dot product:

np.kron( np.eye(3), np.ones(4) ).dot(a) 
array([ 10.,  26.,  42.])

For the np.kron() function see Kronecker product

matrix outer_product numpy
20150727

The dot product of two matrices (Eg. a matrix and it's tranpose), equals the sum of the outer products of the row-vectors & column-vectors.

a=np.matrix( "1 2; 3 4; 5 6" )

matrix([[1, 2],
        [3, 4],
        [5, 6]])

Dot product of A and A^T :

np.dot( a, a.T) 

matrix([[ 5, 11, 17],
        [11, 25, 39],
        [17, 39, 61]])

Or as the sum of the outer products of the vectors:

np.outer(a[:,0],a.T[0,:]) 

array([[ 1,  3,  5],
       [ 3,  9, 15],
       [ 5, 15, 25]])

np.outer(a[:,1],a.T[1,:])

array([[ 4,  8, 12],
       [ 8, 16, 24],
       [12, 24, 36]])

.. added up..

np.outer(a[:,0],a.T[0,:]) + np.outer(a[:,1],a.T[1,:]) 

array([[ 5, 11, 17],
       [11, 25, 39],
       [17, 39, 61]])

.. and yes it is the same as the dot product!

Note: for above, because we are forming the dot product of a matrix with its transpose, we can also write it as (not using the transpose) :

np.outer(a[:,0],a[:,0]) + np.outer(a[:,1],a[:,1])
numpy
20150709

Numpy quickies

Create a matrix of 6x2 filled with random integers:

import numpy as np
ra= np.matrix( np.reshape( np.random.randint(1,10,12), (6,2) ) )

matrix([[6, 1],
        [3, 8],
        [3, 9],
        [4, 2],
        [4, 7],
        [3, 9]])
is
20150619

is and '=='

From blog.lerner.co.il/why-you-should-almost-never-use-is-in-python

a="beta"
b=a

id(a)
3072868832L

id(b)
3072868832L

Is the content of a and b the same? Yes.

a==b
True

Are a and b pointing to the same object?

a is b
True

id(a)==id(b)
True

So it's safer to use '==', but for example for comparing to None, it's more readable and faster when writing :

if x is None:
    print("x is None!")

This works because the None object is a singleton.

max min
20150606

Minimum and maximum int

  • Max: sys.maxint
  • Min: -sys.maxint-1
pandas dataframe
20150302

Create an empty dataframe

10
11
12
13
14
15
16
17
# create 
df=pd.DataFrame(np.zeros(0,dtype=[
    ('ProductID', 'i4'),
    ('ProductName', 'a50')
    ]))

# append
df = df.append({'ProductID':1234, 'ProductName':'Widget'},ignore_index=True)

Other way

24
25
columns = ['price', 'item']
df2 = pd.DataFrame(data=np.zeros((0,len(columns))), columns=columns) 
doctest
20150227

Run a doctest on your python src file

.. first include unit tests in your docstrings:

eg. the file 'mydoctest.py'

#!/usr/bin/python3

def fact(n):
    '''
    Factorial.
    >>> fact(6)
    720
    >>> fact(7)
    5040
    '''
    return n*fact(n-1) if n>1 else n

Run the doctests:

python3 -m doctest mydoctest.py

Or from within python:

>>> import doctest
>>> doctest.testfile("mydoctest.py")

(hmm, doesn't work the way it should... import missing?)

linalg
20150206

Linear Algebra MOOC

Odds of letters in scrable:

{'A':9/98, 'B':2/98, 'C':2/98, 'D':4/98, 'E':12/98, 'F':2/98,
'G':3/98, 'H':2/98, 'I':9/98, 'J':1/98, 'K':1/98, 'L':1/98,
'M':2/98, 'N':6/98, 'O':8/98, 'P':2/98, 'Q':1/98, 'R':6/98,
'S':4/98, 'T':6/98, 'U':4/98, 'V':2/98, 'W':2/98, 'X':1/98,
'Y':2/98, 'Z':1/98}

Use // to find the remainder

Remainder using modulo: 2304811 % 47 -> 25

Remainder using // : 2304811 - 47 * (2304811 // 47) -> 25

Infinity

Infinity: float('infinity') : 1/float('infinity') -> 0.0

Set

Test membership with 'in' and 'not in'.

Note the difference between set (curly braces!) and tuple:

sum( {1,2,3,2} )
6

sum( (1,2,3,2) )
8

Union of sets: use the bar operator { 1,2,4 } | { 1,3,5 } -> {1, 2, 3, 4, 5}

Intersection of sets: use the ampersand operator { 1,2,4 } & { 1,3,5 } -> {1}

Empty set: is not { } but set()! While for a list, the empty list is [].

Add / remove elements with .add() and .remove(). Add another set with .update()

s = { 1,2,4 } 
s.update( { 1,3,5 } ) 
s
{1, 2, 3, 4, 5}

Intersect with another set:

s.intersection_update( { 4,5,6,7 } ) 
s
{4, 5}

Bind another variable to the same set: (any changes to s or t are visible to the other)

t=s

Make a complete copy:

u=s.copy()

Set comprehension:

{2*x for x in {1,2,3} }

.. union of 2 sets combined with if (the if clause can be considered a filter) ..

s = { 1, 2,4 }

{ x for x in s|{5,6} if x>=2 }
{2, 4, 5, 6}

Double comprehension : iterate over the Cartesian product of two sets:

{x*y for x in {1,2,3} for y in {2,3,4}}
{2, 3, 4, 6, 8, 9, 12}

Compare to a list, which will return 2 more elements ( the 4 and the 6) :

[x*y for x in {1,2,3} for y in {2,3,4}]
[2, 3, 4, 4, 6, 8, 6, 9, 12]

Or producing tuples:

{ (x,y) for x in {1,2,3} for y in {2,3,4}}
{(1, 2), (1, 3), (1, 4), (2, 2), (2, 3), (2, 4), (3, 2), (3, 3), (3, 4)}

The factors of n:

n=64
{ (x,y) for x in range(1,1+int(math.sqrt(n))) for y in range(1,1+n) if (x*y)==n }
{(1, 64), (2, 32), (4, 16), (8, 8)}

Or use it in a loop:

 for n in range(40,100): 
        print (n , 
            { (x,y) for x in range(1,1+int(math.sqrt(n))) for y in range(1,1+n) if (x*y)==n })
40 {(4, 10), (1, 40), (2, 20), (5, 8)}
41 {(1, 41)}
42 {(1, 42), (2, 21), (6, 7), (3, 14)}
43 {(1, 43)}
44 {(2, 22), (1, 44), (4, 11)}
45 {(1, 45), (3, 15), (5, 9)}
46 {(2, 23), (1, 46)}
47 {(1, 47)}
48 {(3, 16), (2, 24), (1, 48), (4, 12), (6, 8)}
.. 

Lists

  • a list can contain other sets and lists, but a set cannot contain a list (since lists are mutable).
  • order: respected for lists, but not for sets
  • concatenate lists using the '+' operator

  • provide second argument [] to sum to make this work: sum([ [1,2,3], [4,5,6], [7,8,9] ], []) -> [1, 2, 3, 4, 5, 6, 7, 8, 9]

Skip elements in a slice: use the colon separate triple a:b:c notation

L = [0,10,20,30,40,50,60,70,80,90]

L[::3]
[0, 30, 60, 90]

List of lists and comprehension:

listoflists = [[1,1],[2,4],[3, 9]]

[y for [x,y] in listoflists]
[1, 4, 9]

Tuples

  • difference with lists: a tuple is immutable. So sets may contain tuples.

Unpacking in a comprehension:

 [y for (x,y) in [(1,'A'),(2,'B'),(3,'C')] ]
 ['A', 'B', 'C']

 [x[1] for x in [(1,'A'),(2,'B'),(3,'C')] ]
 ['A', 'B', 'C']

Converting list/set/tuple

Use constructors: set(), list() or tuple().

Note about range: a range represents a sequence, but it is not a list. Either iterate through the range or use set() or list() to turn it into a set or list.

Note about zip: it does not return a list, an 'iterator of tuples'.

Dictionary comprehensions

{ k:v for (k,v) in [(1,2),(3,4),(5,6)] }

Iterate over the k,v pairs with items(), producing tuples:

[ item for item in {'a':1, 'b':2, 'c':3}.items()  ]
[('a', 1), ('c', 3), ('b', 2)]

Modules

  • create your own module: name it properly, eg "spacerocket.py"
  • import it in another script

While debugging it may be easier to use 'reload' (from package imp) to reload your module

Various

The enumerate function.

list(enumerate(['A','B','C']))
[(0, 'A'), (1, 'B'), (2, 'C')]


[ (i+1)*s for (i,s) in enumerate(['A','B','C','D','E'])]
['A', 'BB', 'CCC', 'DDDD', 'EEEEE']
 
Notes by Willem Moors. Generated on momo:/home/willem/sync/20151223_datamungingninja/pythonbook at 2019-07-31 19:22