The Python Book

frequency count
20160418

Use the collections.counter to count the frequency of words in a text.

``````import collections

ln='''
The electrical and thermal conductivities of metals originate from
the fact that their outer electrons are delocalized. This situation
can be visualized by seeing the atomic structure of a metal as a
collection of atoms embedded in a sea of highly mobile electrons. The
electrical conductivity, as well as the electrons' contribution to
the heat capacity and heat conductivity of metals can be calculated
from the free electron model, which does not take into account the
detailed structure of the ion lattice.
When considering the electronic band structure and binding energy of
a metal, it is necessary to take into account the positive potential
caused by the specific arrangement of the ion cores - which is
periodic in crystals. The most important consequence of the periodic
potential is the formation of a small band gap at the boundary of the
Brillouin zone. Mathematically, the potential of the ion cores can be
treated by various models, the simplest being the nearly free
electron model.'''``````

Split the text into words:

``words=ln.lower().split()``

Create a Counter:

``ctr=collections.Counter(words)``

Most frequent:

``````ctr.most_common(10)

[('the', 22),
('of', 12),
('a', 5),
('be', 3),
('by', 3),
('ion', 3),
('can', 3),
('and', 3),
('is', 3),
('as', 3)]``````

## Alternative: via df['col'].value_counts of pandas

``````import re
import pandas as pd

def removePunctuation(line):
return  re.sub( "\s+"," ", re.sub( "[^a-zA-Z0-9 ]", "", line)).rstrip(' ').lstrip(' ').lower()

df=pd.DataFrame( [ removePunctuation(word.lower()) for word in ln.split() ], columns=['word'])
df['word'].value_counts()``````

Result:

``````the             22
of              12
a                5
and              3
by               3
as               3
ion              3
..
..``````

Notes by Willem Moors. Generated on momo:/home/willem/sync/20151223_datamungingninja/pythonbook at 2019-07-31 19:22