An Overview of Python¶

In [1]:
import numpy as np
# define data variables
years = range(1990,2023)
countries = ["Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovak Republic", "Slovenia", "Spain", "Sweden", "United Kingdom",]
refugees = np.loadtxt("data/refugees-europe_0.txt") # data from World Bank

More Containers¶

So far we have used the builtin container list. Additionally we used arrays from numpy. Next we will look at some additional builtin containers. Choosing the right container can simplify a problem considerably. And of course you can combine containers to get more complex data structures.

But first, I would like to introduce a very useful shorthand and a caveat with arrays:

List comprehension¶

In [2]:
[x**2 for x in range(10)]  #  if x % 3 == 0
Out[2]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

List comprehension is a one-line for-loop creating a list. It can optionally contain an if condition.

Comparing Arrays¶

Arrays are compared element-wise

In [3]:
a = np.array([1,2,3])
b = np.array([1,2,3])
a == b
Out[3]:
array([ True,  True,  True])

this means the result of a comparison can not be used directly in a condition

In [4]:
if a == b:
    print("equal")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 if a == b:
      2     print("equal")

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
In [5]:
if (a == b).all():
    print("equal")
equal

Tuples¶

Sometimes you want to return several values form a function. This can be done by listing them comma-separated after return. Python will then create a tuple.

In [6]:
def country_stats(data):
    min_ = data.min()
    max_ = data.max()
    mean = data.mean()
    std  = data.std()
    return min_, max_, round(mean, 1), round(std, 1)

ret = country_stats(refugees[10])
type(ret)
Out[6]:
tuple
In [7]:
ret[1]
Out[7]:
2075445.0

Tuples are similar to lists, but once created they can not be changed. Tuples can be created by listing them with a comma (with or without round brackets):

In [8]:
my_tuple = (1,"foo",1)
print(my_tuple)
(1, 'foo', 1)
In [9]:
my_tuple[1] = 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 my_tuple[1] = 2

TypeError: 'tuple' object does not support item assignment

For a tuple with one element only, a comma is mandatory

In [10]:
one_tuple = 42,
print(one_tuple)
(42,)

like lists, tuples can be unpacked in one go

In [11]:
a, b, c = my_tuple
print(b)
foo

When comparing two tuples, python will compare one position after the other, using the specified comparison operator

In [12]:
(1,2,3) < (1,2,4)
Out[12]:
True

Note: This will fail, if the tuples contain numpy-arrays.

In [13]:
(1,2,np.array([1,2])) < (1,2,np.array([1,2]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 (1,2,np.array([1,2])) < (1,2,np.array([1,2]))

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Sets¶

While a tuple is an ordered collection and allows repetitions, a set is an unordered collection of unique elements

In [14]:
set([1,1,2,3,5])
Out[14]:
{1, 2, 3, 5}
In [15]:
s = {5,1,2,3,1,5}
print(s)
{1, 2, 3, 5}

Dictionaries¶

Dictionaries are similar to lists as well, but instead of continuously numbering the entries you can define your own keys.

In [16]:
def country_stats(data):
    res = {}
    res["std"]  = data.std()
    res["min"] = data.min()
    res["max"] = data.max()
    res["mean"] = data.mean()
    return res

country_stats(refugees[10])
Out[16]:
{'std': 377344.61433153594,
 'min': 187545.0,
 'max': 2075445.0,
 'mean': 900264.7878787878}

Note: In older versions of Python (before 3.7) dictionaries do not preserve the order of the elements from assignment.

Accessing an element in a dict, works as expected

In [17]:
def country_stats(data):
    res = {}
    res["min"] = data.min()
    res["max"] = data.max()
    res["mean"] = data.mean()
    res["std"]  = data.std()
    return res

stats = country_stats(refugees[10])
stats["mean"]
Out[17]:
900264.7878787878

The help shows many ways to define a dictionary:

In [18]:
dict?
In [19]:
dict([(1,2), (3,4)])
Out[19]:
{1: 2, 3: 4}
In [20]:
dict(a=1, b=2)
Out[20]:
{'a': 1, 'b': 2}

Dict comprehension exists as well:

In [21]:
{k: v for k, v in enumerate("abc")}
Out[21]:
{0: 'a', 1: 'b', 2: 'c'}

A key in a dictionary can be almost anything (anything that is hashable)

In [22]:
d = {range(5): "this is a range object", print: "the print function"}
In [23]:
d[print]
Out[23]:
'the print function'

Looping Over Dicts¶

The naive for-loop on a dict iterates over the dict keys:

In [24]:
for stat in stats:
    print(stat)
min
max
mean
std

However dicts offer different functions to access the keys, the values or both when iterating:

In [25]:
for stat in stats.values():
    print(stat)
187545.0
2075445.0
900264.7878787878
377344.61433153594
In [26]:
for stat_name, stat_value in stats.items():
    print(stat_name, int(stat_value))
min 187545
max 2075445
mean 900264
std 377344

As always, tab-completion is very useful:

In [27]:
#stats.

More on Functions¶

optional- and keyword-arguments¶

We can specify default values for the parameters of a function. This turns them into optional-arguments. Optional arguments can be passed either based on their position or on their keyword.

Keyword-arguments:

  • must be defined and passed after the positional arguments
  • their order is irrelevant
  • must use an existing parameter-name
In [28]:
import numpy as np
import matplotlib.pyplot as plt

def plot_refugees(years, data, country, color="r", ylabel="refugees"):
    # create axes object
    figure = plt.figure()
    axes = figure.subplots()
    # plot the data
    axes.plot(years, data, color=color)
    # add labels and title
    axes.set_xlabel("year")
    axes.set_ylabel(ylabel)
    axes.set_title("Refugees in "+country)


# define data variables
years = range(1990,2023)
countries = ["Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovak Republic", "Slovenia", "Spain", "Sweden", "United Kingdom",]
refugees = np.loadtxt("data/refugees-europe_1.csv", delimiter=",") # data from World Bank

means = refugees.mean(axis=1)
for row in range(len(countries))[:4]:
    if means[row] > 10000:
        plot_refugees(years, country=countries[row], data=refugees[row], ylabel="number", color="b")

# display the plot
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Important: You should not use mutable objects as default arguments. They are created once at load-time and not reset for individual calls.

In [29]:
def append_one(list_=[]):
    """ Do not use mutable default values """
    list_.append(1)
    return list_

my_list = []
print(append_one(my_list))
print(append_one(my_list))
my_list = append_one()
print(my_list)
print(append_one(my_list))
my_list = append_one()
print(my_list)
[1]
[1, 1]
[1]
[1, 1]
[1, 1, 1]

Arbitrary Argument Lists¶

In addition to listing individual paraemters, you can also tell python to accept all arguments and/or keywordarguments

In [30]:
def i_want_it_all(*args, **kwargs):
    print(args)
    print(kwargs)
    
i_want_it_all(1,2,3,foo="so long", bar="and thanks for all the fish")
(1, 2, 3)
{'foo': 'so long', 'bar': 'and thanks for all the fish'}

As you see, *args creates a tuple of all passed positional arguments, while **kwargs creates a dict. You can use the same syntax to pass tuples and dicts to the function when calling it:

In [31]:
i_want_it_all(*("a", "b"), **{"robot": "marvin"})
('a', 'b')
{'robot': 'marvin'}

Docstrings¶

When defining functions, you should always document them as well. Documentation will help others and your future self.

One very nice feature of python is, the ability to access documentation quickly. A docstring is a string given directly after the function definition. Python will display that string when you ask for help on the function.

In [32]:
def country_stats(data):
    """Calculate min, max, mean and standard deviation from data
    
    Arguments
    ---------
    data : numpy-array
        data to calculate statistics on 
        
    Returns
    -------
    stats : dict
        dictionary containing min, max, mean and standard deviation
    
    """
    res = {}
    res["min"] = data.min()
    res["max"] = data.max()
    res["mean"] = data.mean()
    res["std"]  = data.std()
    return res

country_stats?
In [33]:
np.abs?

More on Loops¶

We looked at the for loop in the first part. We will add a second loop type (the while loop) and look at more tricks avaibale for both loop types.

Break and continue¶

Break and continue are keywords to interrupt the normal execution of loops. They are quiet similar, so it is important to understand the difference.

break exits the loop completely and continues execution after the loop.

continue skips the rest of this iteration and moves on to the next round.

In [34]:
for i in range(30):
    if i % 2 == 0:  # % is modulo division -> i is even
        continue
    elif i > 20:
        break
    print(i)

print("We are done.")
1
3
5
7
9
11
13
15
17
19
We are done.

else¶

Once you use break, you might want to execute code only if your loop terminates normaly. A typical example is when you search for someting in a list:

In [35]:
list_ = ["hay", "hay", "hay", "hay", "hay", "needle", "hay", "hay", "hay"]
#list_ = ["hay", "hay", "hay", "hay", "hay", "hay", "hay", "hay"]
for value in list_:
    if value == "needle":
        print("found")
        break
    
found

You could add an additional variable to track if you found the needle:

In [36]:
list_ = ["hay", "hay", "hay", "hay", "hay", "needle", "hay", "hay", "hay"]
list_ = ["hay", "hay", "hay", "hay", "hay", "hay", "hay", "hay"]
found = False
for value in list_:
    if value == "needle":
        print("found")
        found = True
        break

if not found:
    print("not found")
not found

using else no additional variable is necessary:

In [37]:
list_ = ["hay", "hay", "hay", "hay", "hay", "needle", "hay", "hay", "hay"]
#list_ = ["hay", "hay", "hay", "hay", "hay", "hay", "hay", "hay"]
for value in list_:
    if value == "needle":
        print("found")
        break
else:
    print("not found")
found

while¶

We can use a for loop to iterate over a given list of items (or with range for a fixed number of iterations). However in many cases we do not know how many iterations we will need, but we know what condition we want to reach. In this case we use a while-loop:

In [38]:
value = 1
while value < 1000:
    print(value)
    value = value * 2
1
2
4
8
16
32
64
128
256
512

Again you can use break, continue and else

In [39]:
tries = 0
while tries < 3:
    pin = input("Enter PIN: ")
    if pin == "":
        continue
    if pin == "123":
        break
    tries = tries + 1
else:
    print("Too many bad tries, terminating!")
    #import sys
    #sys.exit()

print("PIN accepted, welcome back!")    
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[39], line 3
      1 tries = 0
      2 while tries < 3:
----> 3     pin = input("Enter PIN: ")
      4     if pin == "":
      5         continue

File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
   1279 if not self._allow_stdin:
   1280     msg = "raw_input was called, but this frontend does not support input requests."
-> 1281     raise StdinNotImplementedError(msg)
   1282 return self._input_request(
   1283     str(prompt),
   1284     self._parent_ident["shell"],
   1285     self.get_parent("shell"),
   1286     password=False,
   1287 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.

A note on security¶

The code above is of course a bad exampale: You should not store passwords in cleartext. Instead you should use a specialised password-hash function to store a salted and hashed value.

In [40]:
from passlib.hash import argon2

tries = 0
while tries < 3:
    pin = input("Enter PIN: ")
    if pin == "":
        continue
    if argon2.verify(pin, '$argon2id$v=19$m=102400,t=2,p=8$s9a6F4Jwbg0BgBCCsLY2Rg$cpl1xrY0GAc8tAxkFmXN/A'):
        break
    tries = tries + 1
else:
    print("Too many bad tries, terminating!")
    #import sys
    #sys.exit()

print("PIN accepted, welcome back!")    
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[40], line 5
      3 tries = 0
      4 while tries < 3:
----> 5     pin = input("Enter PIN: ")
      6     if pin == "":
      7         continue

File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
   1279 if not self._allow_stdin:
   1280     msg = "raw_input was called, but this frontend does not support input requests."
-> 1281     raise StdinNotImplementedError(msg)
   1282 return self._input_request(
   1283     str(prompt),
   1284     self._parent_ident["shell"],
   1285     self.get_parent("shell"),
   1286     password=False,
   1287 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
In [41]:
#argon2.hash("123")

More Errors¶

Understanding Errors¶

Let's look at some typical error messages in more detail.

In [42]:
import matplotlib.pyplot as plt
plt.plot([1,2,3],[1])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[42], line 2
      1 import matplotlib.pyplot as plt
----> 2 plt.plot([1,2,3],[1])

File /usr/lib/python3/dist-packages/matplotlib/pyplot.py:3590, in plot(scalex, scaley, data, *args, **kwargs)
   3582 @_copy_docstring_and_deprecators(Axes.plot)
   3583 def plot(
   3584     *args: float | ArrayLike | str,
   (...)
   3588     **kwargs,
   3589 ) -> list[Line2D]:
-> 3590     return gca().plot(
   3591         *args,
   3592         scalex=scalex,
   3593         scaley=scaley,
   3594         **({"data": data} if data is not None else {}),
   3595         **kwargs,
   3596     )

File /usr/lib/python3/dist-packages/matplotlib/axes/_axes.py:1724, in Axes.plot(self, scalex, scaley, data, *args, **kwargs)
   1481 """
   1482 Plot y versus x as lines and/or markers.
   1483 
   (...)
   1721 (``'green'``) or hex strings (``'#008000'``).
   1722 """
   1723 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1724 lines = [*self._get_lines(self, *args, data=data, **kwargs)]
   1725 for line in lines:
   1726     self.add_line(line)

File /usr/lib/python3/dist-packages/matplotlib/axes/_base.py:303, in _process_plot_var_args.__call__(self, axes, data, *args, **kwargs)
    301     this += args[0],
    302     args = args[1:]
--> 303 yield from self._plot_args(
    304     axes, this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey)

File /usr/lib/python3/dist-packages/matplotlib/axes/_base.py:499, in _process_plot_var_args._plot_args(self, axes, tup, kwargs, return_kwargs, ambiguous_fmt_datakey)
    496     axes.yaxis.update_units(y)
    498 if x.shape[0] != y.shape[0]:
--> 499     raise ValueError(f"x and y must have same first dimension, but "
    500                      f"have shapes {x.shape} and {y.shape}")
    501 if x.ndim > 2 or y.ndim > 2:
    502     raise ValueError(f"x and y can be no greater than 2D, but have "
    503                      f"shapes {x.shape} and {y.shape}")

ValueError: x and y must have same first dimension, but have shapes (3,) and (1,)
No description has been provided for this image

As explained in the introduction, we can learn a lot from this output:

  • The traceback tells us that the error occured deep inside the matplotlib module (5 function calls)
  • More precisely the error happend when matplotlib wanted to read the x and y coordinates.
  • But the top of the traceback also tells us which line of our code caused the problem.
  • This is a ValueError, one of the values we passed is probably wrong.
  • The final line tells us that we have a problem with the dimensions of our x and y lists.

Common Errors when Passing Values¶

In addition to the value error above, there is also a TypeError, indidcating that we passed the wrong type:

In [43]:
int(["1"])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[43], line 1
----> 1 int(["1"])

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

If you screw up when accessing an element in a container you usually get a IndexError or a KeyError

In [44]:
l = [1,2,3]
l[3]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[44], line 2
      1 l = [1,2,3]
----> 2 l[3]

IndexError: list index out of range
In [45]:
d = {"a" : 1}
d["b"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[45], line 2
      1 d = {"a" : 1}
----> 2 d["b"]

KeyError: 'b'

Common Errors in Syntax¶

In [46]:
123a
  Cell In[46], line 1
    123a
      ^
SyntaxError: invalid decimal literal

SyntaxError means that your words do not make any sense...

In [47]:
print("Hallo welt"
a = 123
  Cell In[47], line 1
    print("Hallo welt"
         ^
SyntaxError: '(' was never closed
In [48]:
if True:
print("Better")
    print("Faster")
  Cell In[48], line 2
    print("Better")
    ^
IndentationError: expected an indented block after 'if' statement on line 1

IndentationErrors show up, when you forget to indent or mess up the alignment of your code.

Note: Do not mix Tabs and Spaces

In [49]:
if True:
    print("Better")
    print("Faster")
	print("Stronger")
  File <string>:4
    print("Stronger")
                     ^
TabError: inconsistent use of tabs and spaces in indentation
In [51]:
variable_with_a_complicated_name = 1
print(variable_with_complicated_name)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[51], line 2
      1 variable_with_a_complicated_name = 1
----> 2 print(variable_with_complicated_name)

NameError: name 'variable_with_complicated_name' is not defined

A NameError is raised when you use a name that Python does not know about. Often this means you made a typo. (To avoid these, use tab-completion.) Or we just forgot to define a variable or function.

In [52]:
function_we_never_defined()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[52], line 1
----> 1 function_we_never_defined()

NameError: name 'function_we_never_defined' is not defined

When accessing attributes from an object, the same problem usually triggers an AttributeError

In [53]:
my_dict = {}
my_dict.append(1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[53], line 2
      1 my_dict = {}
----> 2 my_dict.append(1)

AttributeError: 'dict' object has no attribute 'append'

Handling Errors¶

In some cases, you might expect an error and know how to handle it. A typical example is when handling user input:

In [54]:
int(input("Enter an integer: "))
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[54], line 1
----> 1 int(input("Enter an integer: "))

File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
   1279 if not self._allow_stdin:
   1280     msg = "raw_input was called, but this frontend does not support input requests."
-> 1281     raise StdinNotImplementedError(msg)
   1282 return self._input_request(
   1283     str(prompt),
   1284     self._parent_ident["shell"],
   1285     self.get_parent("shell"),
   1286     password=False,
   1287 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
In [55]:
try:
    print(int(input("Enter an integer: ")))
except ValueError:
    print("This is not an integer")
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[55], line 2
      1 try:
----> 2     print(int(input("Enter an integer: ")))
      3 except ValueError:
      4     print("This is not an integer")

File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
   1279 if not self._allow_stdin:
   1280     msg = "raw_input was called, but this frontend does not support input requests."
-> 1281     raise StdinNotImplementedError(msg)
   1282 return self._input_request(
   1283     str(prompt),
   1284     self._parent_ident["shell"],
   1285     self.get_parent("shell"),
   1286     password=False,
   1287 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.

Hints:

  • The easiest way to determine which error to except is to trigger it.
  • Try keep the code inside the try-block minimal. (Otherwise the chances for errors in there rise)
  • You can have several except clauses following one try block.
In [56]:
input_ = input("Enter an integer: ")
try:
    print(1/int(input_))
except ValueError:
    print("This is not an integer")
except ZeroDivisionError:
    print("Division by Zero")
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[56], line 1
----> 1 input_ = input("Enter an integer: ")
      2 try:
      3     print(1/int(input_))

File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
   1279 if not self._allow_stdin:
   1280     msg = "raw_input was called, but this frontend does not support input requests."
-> 1281     raise StdinNotImplementedError(msg)
   1282 return self._input_request(
   1283     str(prompt),
   1284     self._parent_ident["shell"],
   1285     self.get_parent("shell"),
   1286     password=False,
   1287 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.

If you have a block of code you want to execute even if an error occured, you can use fianlly:

In [57]:
input_ = input("Enter an integer: ")
try:
    print(int(input_))
finally:
    print("Goodbye")
    
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[57], line 1
----> 1 input_ = input("Enter an integer: ")
      2 try:
      3     print(int(input_))

File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
   1279 if not self._allow_stdin:
   1280     msg = "raw_input was called, but this frontend does not support input requests."
-> 1281     raise StdinNotImplementedError(msg)
   1282 return self._input_request(
   1283     str(prompt),
   1284     self._parent_ident["shell"],
   1285     self.get_parent("shell"),
   1286     password=False,
   1287 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.

This is most useful when you need to clean up something. (eg. close an open file or disconnect from a database)

Reading and Writing Files¶

So far we used numpy.loadtxt to read files. This is a good choice for simple data files. In addition we will look at other modules for reading data from files later. Here I want to introduce the most basic file access available in Python directly.

The concept to keep in mind when working with files directly is, that files are a long chain of symbols. When you open them, the operating system puts a pointer onto the first symbol and then advances the pointer as you read the file.

In [58]:
f = open("data/countries-europe.csv")
f.read(1)
Out[58]:
'#'
In [59]:
for _ in range(6):
    print(f.read(1))
p
y
t
h
o
n

if you repeat the last cell it will keep reading new symbols until you reach the end of the file. Then it will keep returning empty strings, as the pointer is now stuck at the end of the file.

If you want to change the position of the pointer you can use seek. A common use case is to reset the pointer to the beginning of a file without opening the file again:

In [60]:
f.seek(0)
for _ in range(7):
    print(f.read(1))
#
p
y
t
h
o
n

In most cases you are probably not interested in individual symbols but in full lines. Python provides two useful functions for this:

  • readline() to read until the next line break.
  • readlines() to read all remaining lines in a file, store as list
In [61]:
f.readline()
Out[61]:
' row index,country\n'
In [62]:
f.readlines()
Out[62]:
['0,Austria\n',
 '1,Belgium\n',
 '2,Bulgaria\n',
 '3,Croatia\n',
 '4,Cyprus\n',
 '5,Czechia\n',
 '6,Denmark\n',
 '7,Estonia\n',
 '8,Finland\n',
 '9,France\n',
 '10,Germany\n',
 '11,Greece\n',
 '12,Hungary\n',
 '13,Ireland\n',
 '14,Italy\n',
 '15,Latvia\n',
 '16,Lithuania\n',
 '17,Luxembourg\n',
 '18,Malta\n',
 '19,Netherlands\n',
 '20,Poland\n',
 '21,Portugal\n',
 '22,Romania\n',
 '23,Slovak Republic\n',
 '24,Slovenia\n',
 '25,Spain\n',
 '26,Sweden\n',
 '27,United Kingdom\n']

readlines is usually the easiest way to work with files. However for very large files this needs a lot of memory. Using readline can avoid this problem.

Once you are done working with a file you should close it

In [63]:
f.close()

Using a Context Manager¶

As mentioned in the last section, we can wrap our file handling code in try...finally... if we want to make sure the file is closed properly even in case of an error.

However, since this is a very common pattern, there is an even easier solution:

In [64]:
with open("data/countries-europe.csv") as f:
    lines = f.readlines()
print(lines[4])
#f.seek(0)
3,Croatia

with uses a Context Manager to execute a block of code. A context manager defines try...except...finaly... blocks and wraps them around a given code-block.

Writing Files¶

Writing a file is not much more difficult then reading it:

In [65]:
with open("data/tmp.txt", "w") as f:
    for i in range(10):
        f.write("Yes!\n")
with open("data/tmp.txt", "a") as f:
    f.write("NO!\n")
In [66]:
!cat data/tmp.txt
Yes!
Yes!
Yes!
Yes!
Yes!
Yes!
Yes!
Yes!
Yes!
Yes!
NO!

The second argument to open defines the mode how the file should be opened. The default is for reading text, the most common other modes are:

  • w for writing a file (deleting content if the file exists)
  • a for appending to a file

For more details, see the documentation.

Strings in Depth¶

Strings have several more complex topics as well.

String functions¶

Strings come with a large number of helpful functions: https://docs.python.org/3/library/stdtypes.html#string-methods

  • you can search and replace in the string
  • you can split the string on given characters
  • you can strip characters from its ends
  • you can change the case
  • ...
In [67]:
"This is an Example String".find("is") #vs " is "; vs `in`
Out[67]:
2
In [68]:
"This is an Example String".replace("Example", "Demo")
Out[68]:
'This is an Demo String'
In [69]:
"This is an Example String".split() # vs. split("a")
Out[69]:
['This', 'is', 'an', 'Example', 'String']
In [70]:
print("    This is an Example String   \n".strip())
This is an Example String
In [71]:
"this and that".strip("t") # vs lstrip, rstrip
Out[71]:
'his and tha'
In [72]:
"This is an Example String".lower() # upper, title
Out[72]:
'this is an example string'

Encoding¶

Internally a computer stores strings as a sequence of numbers. What letters the individual numbers represent is defined by the encoding. Unfortunately the information what encoding was used is often not provided with the string. This can cause a lot of annyoing problems.

Python3 strings always use "utf-8" encoding and avoids many problems. However input you receive might still have been stored in some other encoding. A wrong encoding can result in strange characters displayed or in error messages.

Different operating systems use different default encodings for text files and Python tries the default encoding when opening a file.

In [73]:
with open("data/preamble.txt") as file_: # , encoding="iso-8859-1"
    for l in file_.readlines()[:5]:
        print(l)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[73], line 2
      1 with open("data/preamble.txt") as file_: # , encoding="iso-8859-1"
----> 2     for l in file_.readlines()[:5]:
      3         print(l)

File /usr/lib/python3.12/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 2: invalid continuation byte

When reading data with a different encoding you need to tell Python what encoding your input is. Python will then covert this to utf-8.

If for some reason you need to output text in some other encoding, you will have to convert again. Luckily this is trivial.

In [74]:
"Grüäzi".encode("iso-8859-1")
Out[74]:
b'Gr\xfc\xe4zi'

str.encode returns a bytes-Object. This is similar to a string, but not considered text by python. Instead bytes objects actually behave like immutable sequences of integers but are displayed in a strange way:

  • integers representing "useful" symbols in ASCII (American Standard Code for Information Interchange) encoding, are displayed as that symbol
  • all other integers are given in HEX, prepended with \x

(If you did not understand a lot, you don't miss much)

In [75]:
bytes.fromhex('00 01 02 03 41 61 30 2B')
Out[75]:
b'\x00\x01\x02\x03Aa0+'

To turn bytes into strings, you can use decode:

In [76]:
"Grüäzi".encode("iso-8859-1").decode("iso-8859-1")
Out[76]:
'Grüäzi'
In [77]:
## writing files with a different encoding
#with open("data/tmp.txt", "w", encoding="iso-8859-1") as f:
#    f.write("Grüäzi\n")
#    
#with open("data/tmp.txt", "w+b") as f:
#    f.write("Grüäzi\n".encode("iso-8859-1"))

Raw Strings¶

By convention a backslash is used in most string processing to introduce special character combinations.

In [78]:
print("One\n\tTwo")
One
	Two

Of course, it might be that you need such a combination in your string without it's special meaning (for example if you want to use LaTeX syntax in your plot-labels). In this case you have two options:

  • you can escape the backslash with an additional backslash
  • you can prefixed your string with an r, turning it into a raw-string
In [79]:
print("One\\n\\tTwo")
One\n\tTwo
In [80]:
print(r"One\n\tTwo")
One\n\tTwo

String Formatting¶

Rather sooner than later, you will want to construct more complex strings. There are several different options for this. We will look at the most modern only.

We return to our compound interest code and change it such that it prints a nicer message.

In [81]:
def with_interest(value, rate, years):
    return value*(1+rate/100)**years

def compound_interest(value, rate, years):
    return with_interest(value, rate, years) - value

balance = 1000
interest = compound_interest(balance, 0.5, 5)
print(f"The interest earned on {balance} CHF over 5 years will be {interest}")
The interest earned on 1000 CHF over 5 years will be 25.25125312812429

An f-string starts with f" (or f' or f""" or ...) and can contain variables enclosed by curly-brackets {}. Python will look for the referenced variables in the current scope:

In [82]:
print(f"At {rate} percent, the interest earned on {balance} CHF over 5 years will be {interest}")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[82], line 1
----> 1 print(f"At {rate} percent, the interest earned on {balance} CHF over 5 years will be {interest}")

NameError: name 'rate' is not defined

Note: rate exists inside the two functions, but not outside of them.

If we insert numbers into a string, we can even specify how they are displayed:

In [83]:
balance = 1000
rate = 0.5
duration = 5
interest = compound_interest(balance, rate, duration)
print(f"At {rate:.2f} percent, the interest earned on \
{balance:.2f} CHF over {duration} years will be {interest:.2f}")
At 0.50 percent, the interest earned on 1000.00 CHF over 5 years will be 25.25

For details on number-formatting, have a look at https://docs.python.org/3/library/string.html#formatspec. (.format works similar to f-strings, but needs variables passed explicitly.)

There are a lot more things you can do with f-strings, try to avoid making things too complex.

Command Line Arguments¶

So far we started all our scripts from the command line with python3 interest.py

and used input() if we wanted to obtain additional input from the user.

We could let the user pass additional input directly on the command line python interest.py 1000 5 .5

Of course we need to adapt our programm for this to work.

The simplest approach is to use the argument vector provided by the sys module. The argument vector is a list with the name of the script running at postion 0. This is followed by any other "word" given on the command line.

In [84]:
## this code does not work in a notebook
#import sys
#
#def with_interest(value, rate, years):
#    return value*(1+rate/100)**years
#
#def compound_interest(value, rate, years):
#    return with_interest(value, rate, years) - value
#
# print(sys.argv)
#balance = float(sys.argv[1])
#duration = float(sys.argv[2])
#rate = float(sys.argv[3])
#res = compound_interest(balance, rate, duration)
#print("At {rate} percent, the interest earned on {balance} CHF over {duration} years will be {interest}".format(
#        rate=rate, balance=balance, duration=duration, interest=res))

A more elegant solution is to use the argparse module. This allows you to create more complex commandline interfaces easily.

In [85]:
#import argparse
#
#def with_interest(value, rate, years):
#    return value*(1+rate/100)**years
#
#def compound_interest(value, rate, years):
#    return with_interest(value, rate, years) - value
#
#parser = argparse.ArgumentParser(description='calculate compound interest')
#parser.add_argument('balance', type=float, help='the starting balance')
#parser.add_argument('duration', type=int, help='the number of years to accumulate interest for')
#parser.add_argument('--rate', type=float, default=0.5, help='the interest rate in percent')
#args = parser.parse_args()
#    
#res = compound_interest(args.balance, args.rate, args.duration)
#print("At {rate} percent, the interest earned on {balance} CHF over {duration} years will be {interest}".format(
#        rate=args.rate, balance=args.balance, duration=args.duration, interest=res))

The help of argparse is rather extenisve and it is easy to find examples online.

Writing modules¶

Once your code gets bigger, it is a good idea to split it into several smaler files.

The easiest way to do this is to store all files in the same folder. In Python you can then import functions from other files just like you do this with system modules

In [86]:
!cat interest.py
def with_interest(value, rate, years):
    return value*(1+rate/100)**years

def compound_interest(value, rate, years):
    return with_interest(value, rate, years) - value
In [87]:
from interest import compound_interest
In [88]:
compound_interest(1000,.5,10)
Out[88]:
51.140132040789695

It is not much more difficult to organise your files into a folder structure. All you need to do is to create a file __init__.py in each folder of your module. In most cases you can leave __init__.py empty.

In [89]:
from tools.interest import with_interest
In [90]:
with_interest(100,5,1)
Out[90]:
105.0

If inside your module folder you want to load another file, you can use the same syntax as you would use outside, giving the absolute module name.

In [91]:
!cat tools/compound.py
from tools.interest import with_interest

def compound_interest(value, rate, years):
    return with_interest(value, rate, years) - value

Importing Scripts¶

You can import from any python file. However if you have a script with code in the global scope, that code will be executed on import.

In [92]:
!cat interest_script.py
def with_interest(value, rate, years):
    return value*(1+rate/100)**years

def compound_interest(value, rate, years):
    return with_interest(value, rate, years) - value

print("Result:", compound_interest(1000, 0.5, 10))
In [93]:
from interest_script import with_interest
Result: 51.140132040789695

This is usually not what you want. To avoid this you can guard the executable part of your script with a special if statement.

In [94]:
!cat interest_guarded.py
def with_interest(value, rate, years):
    return value*(1+rate/100)**years

def compound_interest(value, rate, years):
    return with_interest(value, rate, years) - value

if __name__ == "__main__":
    print("Result:", compound_interest(1000, 0.5, 10))
In [95]:
from interest_guarded import with_interest

Useful Standard Modules¶

sys¶

We have already seen the sys module. It provides access to some objects used or maintained by the interpreter. One of these (sys.argv) we have seen above. For more details, look at the documentation. (https://docs.python.org/3/library/sys.html)

os¶

OS routines for the system your running on. Especially useful are the os.path functions (https://docs.python.org/3/library/os.html)

In [96]:
import os.path
os.path.join("dir", "filename.py")
Out[96]:
'dir/filename.py'
In [97]:
os.path.basename("dir/filename.py")
Out[97]:
'filename.py'
In [98]:
os.path.dirname("dir/subdir/file.py")
Out[98]:
'dir/subdir'
In [99]:
for path in os.walk("example_tree"):
    print(path)
('example_tree', ['subdir'], ['fileB', 'fileA'])
('example_tree/subdir', [], ['file2', 'file1'])

You might also want to use glob

In [100]:
import glob

matches = glob.glob("example_tree/**/file*", recursive=True)
print("\n".join(matches))
example_tree/fileB
example_tree/fileA
example_tree/subdir/file2
example_tree/subdir/file1

re¶

Support for regular expressions. Especially useful when you need to parse strings. (https://docs.python.org/3/library/re.html)

In [101]:
import re
print(re.findall(r"\d+.\d{2}", "At 0.50 percent, the interest earned on 1000.00 CHF over 5 years will be 25.25"))
['0.50', '1000.00', '25.25']

Look at the website for many more options and a detailed explanation of the syntax of regular expressions.

datetime¶

Support for working with dates and time. (https://docs.python.org/3/library/datetime.html)

  • Most important class datetime to represent date (year, month, day) and time (hour, minute, second, millisecond)
  • strptime to load dates from a string
  • strftime to print dates to a string
  • Timezone info encodable via abstract base class of tzinfo, e.g. pytz
  • timedelta is the difference between datetime objects and allows to make calculations
In [102]:
import datetime as dt

defining a simple time is easy

In [103]:
t1 = dt.datetime(2017,9,11,13,15)
print(t1)
print(t1.utctimetuple().tm_hour) # uses time-zone info to calculate UTC time
2017-09-11 13:15:00
13

time zone info can be added using timedelta

In [104]:
tz_cest = dt.timezone(dt.timedelta(hours=+2))
t2 = dt.datetime(2017,9,11,13,15,tzinfo=tz_cest)
print(t2.hour)
print(t2.utctimetuple().tm_hour) # uses time-zone info to calculate UTC time
13
11

create a string from time according to a given format

In [105]:
output_string = t1.strftime("%d %b %Y %I:%M:%S %p")
print(output_string)
11 Sep 2017 01:15:00 PM

extract the date and time info from a string of given format

In [106]:
input_string = "6 June 2016 8h45'32''"
t3 = dt.datetime.strptime(input_string,"%d %B %Y %Hh%M'%S''")
print(t3)
2016-06-06 08:45:32

pytz provides predefined timezone objects that can be used to create datetime-objects with the right timezone info.

In [107]:
import pytz
tz_zurich = pytz.timezone("Europe/Zurich")
# do not pass tz_zurich to the datetime constructor
# tzinfo can not handle the varying timezone offsets stored by pytz
t_zh = dt.datetime(2017,9,11,13,15,tzinfo=tz_zurich)  
print("bad: ", t_zh)
t_zh = tz_zurich.localize(dt.datetime(2017,9,11,13,15))
print("good:", t_zh)
t_ny = t_zh.astimezone(pytz.timezone("America/New_York"))
print(t_ny)
bad:  2017-09-11 13:15:00+00:34
good: 2017-09-11 13:15:00+02:00
2017-09-11 07:15:00-04:00
In [108]:
tz_kolkata = dt.timezone(dt.timedelta(hours=5, minutes=30))
t_zh.astimezone(tz_kolkata)
Out[108]:
datetime.datetime(2017, 9, 11, 16, 45, tzinfo=datetime.timezone(datetime.timedelta(seconds=19800)))

you can do calculations with datetime objects

In [109]:
tdelta = t1-t3
print(tdelta)
print(type(tdelta))
462 days, 4:29:28
<class 'datetime.timedelta'>
In [110]:
tdelta +=dt.timedelta(1)  # add a day
print(tdelta)
tdelta+=dt.timedelta(hours=-5) # subtract  5 hours
print(tdelta)
print(tdelta+t2) # add the time delta to a datetime object
463 days, 4:29:28
462 days, 23:29:28
2018-12-18 12:44:28+02:00

CSV¶

If you need to read CSV data, python provides the relevant module. (https://docs.python.org/3/library/csv.html)

In [111]:
import csv
In [112]:
data = list()
with open("data/countries-europe.csv") as csv_f:
    csv_reader = csv.reader(csv_f)
    for row in csv_reader:
        data.append(row)
        print(row)
['#python row index', 'country']
['0', 'Austria']
['1', 'Belgium']
['2', 'Bulgaria']
['3', 'Croatia']
['4', 'Cyprus']
['5', 'Czechia']
['6', 'Denmark']
['7', 'Estonia']
['8', 'Finland']
['9', 'France']
['10', 'Germany']
['11', 'Greece']
['12', 'Hungary']
['13', 'Ireland']
['14', 'Italy']
['15', 'Latvia']
['16', 'Lithuania']
['17', 'Luxembourg']
['18', 'Malta']
['19', 'Netherlands']
['20', 'Poland']
['21', 'Portugal']
['22', 'Romania']
['23', 'Slovak Republic']
['24', 'Slovenia']
['25', 'Spain']
['26', 'Sweden']
['27', 'United Kingdom']

Other Useful Modules¶

There are many other useful modules. Based on your answers to my questions, I think the following modules might be of interest to many of you.

Data Analysis¶

  • scipy - the main module for data-analysis functionality
  • statsmodels - statistical models, statistical tests, data exploration
  • pandas - powerful data analysis and manipulation library

Web Scraping¶

  • requests - to access web-content
  • bs4 - to parse html or xml
  • selenium - to control your browser

Web Development¶

  • django - web development framework
  • flask - micro web framework

Machine Learning¶

  • sklearn - the main machine learning module

Databases¶

  • sqlite3 - simple file based database

(I have not enough experience with other database-modules to recommend one.)

Other¶

  • subprocess - Start other processes and access their input and output
  • asyncio - to parallelize io-bound code
  • multiprocessing - to parallelize cpu-bound code