The Python Book
 
zip archive
20150913

Archive files into a zipfile

A cronjob runs every 10 minutes produces these JSON files:

 6752 Sep 10 08:30 orderbook-kraken-20150910-083003.json
 6682 Sep 10 08:40 orderbook-kraken-20150910-084004.json
 6717 Sep 10 08:50 orderbook-kraken-20150910-085004.json
 6717 Sep 10 09:00 orderbook-kraken-20150910-090003.json
 6682 Sep 10 09:10 orderbook-kraken-20150910-091003.json
 6717 Sep 10 09:20 orderbook-kraken-20150910-092002.json
 6717 Sep 10 09:30 orderbook-kraken-20150910-093003.json
 6682 Sep 10 09:40 orderbook-kraken-20150910-094003.json
 6682 Sep 10 09:50 orderbook-kraken-20150910-095003.json
 6752 Sep 10 10:00 orderbook-kraken-20150910-100004.json
 6788 Sep 10 10:10 orderbook-kraken-20150910-101003.json
 6787 Sep 10 10:20 orderbook-kraken-20150910-102004.json
 6823 Sep 10 10:30 orderbook-kraken-20150910-103004.json
 6752 Sep 10 10:40 orderbook-kraken-20150910-104004.json

Another cronjob, run every morning, zips all the files of one day together into a zipfile, turning above files into:

20150910-orderbook-kraken.zip
20150911-orderbook-kraken.zip
20150912-orderbook-kraken.zip
.. 

Here's the code of the archiver (ie the 2nd cronjob) :

import os,re,sys
import zipfile
from datetime import datetime

# get a date from the filename. Assumptions:
#   - format = YYYYMMDD
#   - date is in this millenium (ie starts with 2) 
#   - first number in the filename is the date 
def get_date(fn):
    rv=re.sub(r'(^.[^0-9]*)(2[0-9]{7})(.*)', r'\2', fn)
    return rv


# first of all set the working directory
wd=sys.argv[1]
if (len(wd)==0):
    print "Need a working directory"
    sys.exit(0)

os.chdir(wd)

# find the oldest date-pattern in the json files
ds=set()
for filename in os.listdir("."):
    if filename.endswith(".json"):
        ds.add(get_date(filename))
        #print "{}->{}".format(filename,dt) 

# exclude today's pattern (because today may not be complete):
today=datetime.now().strftime("%Y%m%d")
ds.remove(today)

l=sorted(list(ds))
#print l

if (len(l)==0):
    #print "Nothing to do!"
    sys.exit(0)

# datepattern selected
datepattern=l[0]

# decide on which files go into the archive
file_ls=[]
for filename in os.listdir("."):
    if filename.endswith(".json") and filename.find(datepattern)>-1:
        file_ls.append(filename)
        #print "{}->{}".format(filename,dt) 

# filename of archive: get the first file, drop the .json extension, and remove all numbers, add the datepattern
file_ls=sorted(file_ls)

stem=re.sub('--*','-', re.sub( '[0-9]','', re.sub('.json$','',file_ls[0]) ))
zipfilename=re.sub('-\.','.', '{}-{}.zip'.format(datepattern,stem))

#print "Zipping up all {} files to {}".format(datepattern,zipfilename)

zfile=zipfile.ZipFile(zipfilename,"w")
for fn in file_ls:
    #print "Adding to zip: {} ".format(fn)
zfile.write(fn)
#print "Deleting file: {} ".format(fn)
os.remove(fn)
zfile.close()

Note: if you have a backlog of multiple days, you have to run the script multiple times!

 
Notes by Willem Moors. Generated on momo:/home/willem/sync/20151223_datamungingninja/pythonbook at 2019-07-31 19:22