The Python Book

I admit that I was a huge fan of the Python setuptools library for a long time. There was a lot in there which just resonated with how I thought that software development should work. I still think that the design of setuptools is amazing. Nobody would argue that setuptools was flawless and it certainly failed in many regards. The biggest problem probably was that it was build on Python's idiotic import system but there is only so little you can do about that. In general setuptools took the realistic approach to problem-solving: do the best you can do by writing a piece of software scratches the itch without involving a committee or require language changes. That also somewhat explains the second often cited problem of setuptools: that it's a monkeypatch on distutils.

Add records to a dataframe in a for loop

But how to do it if you need to massage the data a bit, or your input data is not comma separated ?

Dataframe quickies

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


txt='''
<span class="mw-headline" id="The_K.C3.B6ln_concert">The Köln concert </span>
<span class="mw-editsection"><span class="mw-editsection-bracke t">
[</span><a href="/w/index.php?title=The_K%C3%B6ln_Concert&amp;action=edit&amp;section=1" 
title="Edit section: The Köln concert">edit</a><span class="mw-editsection-bracket">]</span>
</span></h2>
<p>The concert was organized by 17-year-old 
Vera Brandes, then Germany ’s youngest concert promoter.<sup id="cite_ref-5" class="reference">
<a href="#cite_note-5"><span>[</span>5<span>]</span></a></sup> At Jarrett's request, Brandes 
had selected a <a href="/wiki/B%C3%B6sendorfer" title="Bösendorfer">Bösendorfer</a> 
290 Imperial concert grand piano for the performance. 
'''

print strip_tags(txt)

Deduce the year from day_of_week

Visualizing distributions of data

This notebook demonstrates different approaches to graphically representing distributions of data, specifically focusing on the tools provided by the seaborn packageb

Quickies

Dataframe with date-time index

Create a dataframe df with a datetime index and some random values: (note: see 'simpler' dataframe creation further down)

Postscriptum: a simpler way of creating the dataframe

magic3= np.array(
   [[8,   1,   6],
    [3,   5,   7],
    [4,   9,   2] ] )

magic4= np.array(
   [[16,    2,    3,   13],
    [ 5,   11,   10,    8],
    [ 9,    7,    6,   12],
    [ 4,   14,   15,    1]] ) 

magic5= np.array( 
   [[17,   24,    1,    8,   15],
    [23,    5,    7,   14,   16],
    [ 4,    6,   13,   20,   22],
    [10,   12,   19,   21,    3],
    [11,   18,   25,    2,    9]] )

magic6= np.array(
   [[35,    1,    6,   26,   19,   24],
    [ 3,   32,    7,   21,   23,   25],
    [31,    9,    2,   22,   27,   20],
    [ 8,   28,   33,   17,   10,   15],
    [30,    5,   34,   12,   14,   16],
    [ 4,   36,   29,   13,   18,   11]] )

magic7= np.array(
     [ [30,  39,  48,   1,  10,  19,  28],
       [38,  47,   7,   9,  18,  27,  29],
       [46,   6,   8,  17,  26,  35,  37],
       [ 5,  14,  16,  25,  34,  36,  45],
       [13,  15,  24,  33,  42,  44,   4],
       [21,  23,  32,  41,  43,   3,  12],
       [22,  31,  40,  49,   2,  11,  20] ] ) 

# no_more_magic

Okay, a magic matrix is maybe not the best way to show row/column wise sums. Consider this:

Python documentation links

Add two dataframes

Read/write csv

Load a csv file

# my way
ls_order_col= [ 'year', 'month', 'day', 'symbol', 'buy_sell', 'number','dummy' ]
df_mo=pd.read_csv(s_filename, sep=',', names=ls_order_col, skipinitialspace=True, index_col=False)

# add column of type datetime 
df_mo['date']=pd.to_datetime(df_mo.year*10000+df_mo.month*100+df_mo.day,format='%Y%m%d')

# drop some columns
df_mo.drop(['dummy','year','month','day'], axis=1, inplace=True)

# order by datetime
df_mo.sort(columns='date',inplace=True )
print df_mo

An alternative way,... it's better because the date is converted on reading, and the dataframe is indexed by the date.