An Overview of Python¶
import numpy as np
# define data variables
years = range(1990,2023)
countries = ["Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovak Republic", "Slovenia", "Spain", "Sweden", "United Kingdom",]
refugees = np.loadtxt("data/refugees-europe_0.txt") # data from World Bank
More Containers¶
So far we have used the builtin container list. Additionally we used arrays from numpy.
Next we will look at some additional builtin containers. Choosing the right container can simplify a problem considerably. And of course you can combine containers to get more complex data structures.
But first, I would like to introduce a very useful shorthand and a caveat with arrays:
List comprehension¶
[x**2 for x in range(10)] # if x % 3 == 0
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
List comprehension is a one-line for-loop creating a list. It can optionally contain an if condition.
Comparing Arrays¶
Arrays are compared element-wise
a = np.array([1,2,3])
b = np.array([1,2,3])
a == b
array([ True, True, True])
this means the result of a comparison can not be used directly in a condition
if a == b:
print("equal")
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[4], line 1 ----> 1 if a == b: 2 print("equal") ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
if (a == b).all():
print("equal")
equal
Tuples¶
Sometimes you want to return several values form a function.
This can be done by listing them comma-separated after return. Python will then create a tuple.
def country_stats(data):
min_ = data.min()
max_ = data.max()
mean = data.mean()
std = data.std()
return min_, max_, round(mean, 1), round(std, 1)
ret = country_stats(refugees[10])
type(ret)
tuple
ret[1]
2075445.0
Tuples are similar to lists, but once created they can not be changed. Tuples can be created by listing them with a comma (with or without round brackets):
my_tuple = (1,"foo",1)
print(my_tuple)
(1, 'foo', 1)
my_tuple[1] = 2
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[9], line 1 ----> 1 my_tuple[1] = 2 TypeError: 'tuple' object does not support item assignment
For a tuple with one element only, a comma is mandatory
one_tuple = 42,
print(one_tuple)
(42,)
like lists, tuples can be unpacked in one go
a, b, c = my_tuple
print(b)
foo
When comparing two tuples, python will compare one position after the other, using the specified comparison operator
(1,2,3) < (1,2,4)
True
Note: This will fail, if the tuples contain numpy-arrays.
(1,2,np.array([1,2])) < (1,2,np.array([1,2]))
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[13], line 1 ----> 1 (1,2,np.array([1,2])) < (1,2,np.array([1,2])) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Sets¶
While a tuple is an ordered collection and allows repetitions, a set is an unordered collection of unique elements
set([1,1,2,3,5])
{1, 2, 3, 5}
s = {5,1,2,3,1,5}
print(s)
{1, 2, 3, 5}
Dictionaries¶
Dictionaries are similar to lists as well, but instead of continuously numbering the entries you can define your own keys.
def country_stats(data):
res = {}
res["std"] = data.std()
res["min"] = data.min()
res["max"] = data.max()
res["mean"] = data.mean()
return res
country_stats(refugees[10])
{'std': 377344.61433153594,
'min': 187545.0,
'max': 2075445.0,
'mean': 900264.7878787878}
Note: In older versions of Python (before 3.7) dictionaries do not preserve the order of the elements from assignment.
Accessing an element in a dict, works as expected
def country_stats(data):
res = {}
res["min"] = data.min()
res["max"] = data.max()
res["mean"] = data.mean()
res["std"] = data.std()
return res
stats = country_stats(refugees[10])
stats["mean"]
900264.7878787878
The help shows many ways to define a dictionary:
dict?
dict([(1,2), (3,4)])
{1: 2, 3: 4}
dict(a=1, b=2)
{'a': 1, 'b': 2}
Dict comprehension exists as well:
{k: v for k, v in enumerate("abc")}
{0: 'a', 1: 'b', 2: 'c'}
A key in a dictionary can be almost anything (anything that is hashable)
d = {range(5): "this is a range object", print: "the print function"}
d[print]
'the print function'
Looping Over Dicts¶
The naive for-loop on a dict iterates over the dict keys:
for stat in stats:
print(stat)
min max mean std
However dicts offer different functions to access the keys, the values or both when iterating:
for stat in stats.values():
print(stat)
187545.0 2075445.0 900264.7878787878 377344.61433153594
for stat_name, stat_value in stats.items():
print(stat_name, int(stat_value))
min 187545 max 2075445 mean 900264 std 377344
As always, tab-completion is very useful:
#stats.
We can specify default values for the parameters of a function. This turns them into optional-arguments. Optional arguments can be passed either based on their position or on their keyword.
Keyword-arguments:
- must be defined and passed after the positional arguments
- their order is irrelevant
- must use an existing parameter-name
import numpy as np
import matplotlib.pyplot as plt
def plot_refugees(years, data, country, color="r", ylabel="refugees"):
# create axes object
figure = plt.figure()
axes = figure.subplots()
# plot the data
axes.plot(years, data, color=color)
# add labels and title
axes.set_xlabel("year")
axes.set_ylabel(ylabel)
axes.set_title("Refugees in "+country)
# define data variables
years = range(1990,2023)
countries = ["Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovak Republic", "Slovenia", "Spain", "Sweden", "United Kingdom",]
refugees = np.loadtxt("data/refugees-europe_1.csv", delimiter=",") # data from World Bank
means = refugees.mean(axis=1)
for row in range(len(countries))[:4]:
if means[row] > 10000:
plot_refugees(years, country=countries[row], data=refugees[row], ylabel="number", color="b")
# display the plot
plt.show()
Important: You should not use mutable objects as default arguments. They are created once at load-time and not reset for individual calls.
def append_one(list_=[]):
""" Do not use mutable default values """
list_.append(1)
return list_
my_list = []
print(append_one(my_list))
print(append_one(my_list))
my_list = append_one()
print(my_list)
print(append_one(my_list))
my_list = append_one()
print(my_list)
[1] [1, 1] [1] [1, 1] [1, 1, 1]
Arbitrary Argument Lists¶
In addition to listing individual paraemters, you can also tell python to accept all arguments and/or keywordarguments
def i_want_it_all(*args, **kwargs):
print(args)
print(kwargs)
i_want_it_all(1,2,3,foo="so long", bar="and thanks for all the fish")
(1, 2, 3)
{'foo': 'so long', 'bar': 'and thanks for all the fish'}
As you see, *args creates a tuple of all passed positional arguments, while **kwargs creates a dict. You can use the same syntax to pass tuples and dicts to the function when calling it:
i_want_it_all(*("a", "b"), **{"robot": "marvin"})
('a', 'b')
{'robot': 'marvin'}
Docstrings¶
When defining functions, you should always document them as well. Documentation will help others and your future self.
One very nice feature of python is, the ability to access documentation quickly. A docstring is a string given directly after the function definition. Python will display that string when you ask for help on the function.
def country_stats(data):
"""Calculate min, max, mean and standard deviation from data
Arguments
---------
data : numpy-array
data to calculate statistics on
Returns
-------
stats : dict
dictionary containing min, max, mean and standard deviation
"""
res = {}
res["min"] = data.min()
res["max"] = data.max()
res["mean"] = data.mean()
res["std"] = data.std()
return res
country_stats?
np.abs?
More on Loops¶
We looked at the for loop in the first part. We will add a second loop type (the while loop) and look at more tricks avaibale for both loop types.
Break and continue¶
Break and continue are keywords to interrupt the normal execution of loops. They are quiet similar, so it is important to understand the difference.
break exits the loop completely and continues execution after the loop.
continue skips the rest of this iteration and moves on to the next round.
for i in range(30):
if i % 2 == 0: # % is modulo division -> i is even
continue
elif i > 20:
break
print(i)
print("We are done.")
1 3 5 7 9 11 13 15 17 19 We are done.
else¶
Once you use break, you might want to execute code only if your loop terminates normaly. A typical example is when you search for someting in a list:
list_ = ["hay", "hay", "hay", "hay", "hay", "needle", "hay", "hay", "hay"]
#list_ = ["hay", "hay", "hay", "hay", "hay", "hay", "hay", "hay"]
for value in list_:
if value == "needle":
print("found")
break
found
You could add an additional variable to track if you found the needle:
list_ = ["hay", "hay", "hay", "hay", "hay", "needle", "hay", "hay", "hay"]
list_ = ["hay", "hay", "hay", "hay", "hay", "hay", "hay", "hay"]
found = False
for value in list_:
if value == "needle":
print("found")
found = True
break
if not found:
print("not found")
not found
using else no additional variable is necessary:
list_ = ["hay", "hay", "hay", "hay", "hay", "needle", "hay", "hay", "hay"]
#list_ = ["hay", "hay", "hay", "hay", "hay", "hay", "hay", "hay"]
for value in list_:
if value == "needle":
print("found")
break
else:
print("not found")
found
while¶
We can use a for loop to iterate over a given list of items (or with range for a fixed number of iterations).
However in many cases we do not know how many iterations we will need, but we know what condition we want to reach.
In this case we use a while-loop:
value = 1
while value < 1000:
print(value)
value = value * 2
1 2 4 8 16 32 64 128 256 512
Again you can use break, continue and else
tries = 0
while tries < 3:
pin = input("Enter PIN: ")
if pin == "":
continue
if pin == "123":
break
tries = tries + 1
else:
print("Too many bad tries, terminating!")
#import sys
#sys.exit()
print("PIN accepted, welcome back!")
--------------------------------------------------------------------------- StdinNotImplementedError Traceback (most recent call last) Cell In[39], line 3 1 tries = 0 2 while tries < 3: ----> 3 pin = input("Enter PIN: ") 4 if pin == "": 5 continue File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt) 1279 if not self._allow_stdin: 1280 msg = "raw_input was called, but this frontend does not support input requests." -> 1281 raise StdinNotImplementedError(msg) 1282 return self._input_request( 1283 str(prompt), 1284 self._parent_ident["shell"], 1285 self.get_parent("shell"), 1286 password=False, 1287 ) StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
A note on security¶
The code above is of course a bad exampale: You should not store passwords in cleartext. Instead you should use a specialised password-hash function to store a salted and hashed value.
from passlib.hash import argon2
tries = 0
while tries < 3:
pin = input("Enter PIN: ")
if pin == "":
continue
if argon2.verify(pin, '$argon2id$v=19$m=102400,t=2,p=8$s9a6F4Jwbg0BgBCCsLY2Rg$cpl1xrY0GAc8tAxkFmXN/A'):
break
tries = tries + 1
else:
print("Too many bad tries, terminating!")
#import sys
#sys.exit()
print("PIN accepted, welcome back!")
--------------------------------------------------------------------------- StdinNotImplementedError Traceback (most recent call last) Cell In[40], line 5 3 tries = 0 4 while tries < 3: ----> 5 pin = input("Enter PIN: ") 6 if pin == "": 7 continue File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt) 1279 if not self._allow_stdin: 1280 msg = "raw_input was called, but this frontend does not support input requests." -> 1281 raise StdinNotImplementedError(msg) 1282 return self._input_request( 1283 str(prompt), 1284 self._parent_ident["shell"], 1285 self.get_parent("shell"), 1286 password=False, 1287 ) StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
#argon2.hash("123")
import matplotlib.pyplot as plt
plt.plot([1,2,3],[1])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[42], line 2 1 import matplotlib.pyplot as plt ----> 2 plt.plot([1,2,3],[1]) File /usr/lib/python3/dist-packages/matplotlib/pyplot.py:3590, in plot(scalex, scaley, data, *args, **kwargs) 3582 @_copy_docstring_and_deprecators(Axes.plot) 3583 def plot( 3584 *args: float | ArrayLike | str, (...) 3588 **kwargs, 3589 ) -> list[Line2D]: -> 3590 return gca().plot( 3591 *args, 3592 scalex=scalex, 3593 scaley=scaley, 3594 **({"data": data} if data is not None else {}), 3595 **kwargs, 3596 ) File /usr/lib/python3/dist-packages/matplotlib/axes/_axes.py:1724, in Axes.plot(self, scalex, scaley, data, *args, **kwargs) 1481 """ 1482 Plot y versus x as lines and/or markers. 1483 (...) 1721 (``'green'``) or hex strings (``'#008000'``). 1722 """ 1723 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D) -> 1724 lines = [*self._get_lines(self, *args, data=data, **kwargs)] 1725 for line in lines: 1726 self.add_line(line) File /usr/lib/python3/dist-packages/matplotlib/axes/_base.py:303, in _process_plot_var_args.__call__(self, axes, data, *args, **kwargs) 301 this += args[0], 302 args = args[1:] --> 303 yield from self._plot_args( 304 axes, this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey) File /usr/lib/python3/dist-packages/matplotlib/axes/_base.py:499, in _process_plot_var_args._plot_args(self, axes, tup, kwargs, return_kwargs, ambiguous_fmt_datakey) 496 axes.yaxis.update_units(y) 498 if x.shape[0] != y.shape[0]: --> 499 raise ValueError(f"x and y must have same first dimension, but " 500 f"have shapes {x.shape} and {y.shape}") 501 if x.ndim > 2 or y.ndim > 2: 502 raise ValueError(f"x and y can be no greater than 2D, but have " 503 f"shapes {x.shape} and {y.shape}") ValueError: x and y must have same first dimension, but have shapes (3,) and (1,)
As explained in the introduction, we can learn a lot from this output:
- The traceback tells us that the error occured deep inside the matplotlib module (5 function calls)
- More precisely the error happend when matplotlib wanted to read the x and y coordinates.
- But the top of the traceback also tells us which line of our code caused the problem.
- This is a
ValueError, one of the values we passed is probably wrong. - The final line tells us that we have a problem with the dimensions of our x and y lists.
Common Errors when Passing Values¶
In addition to the value error above, there is also a TypeError, indidcating that we passed the wrong type:
int(["1"])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[43], line 1 ----> 1 int(["1"]) TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'
If you screw up when accessing an element in a container you usually get a IndexError or a KeyError
l = [1,2,3]
l[3]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[44], line 2 1 l = [1,2,3] ----> 2 l[3] IndexError: list index out of range
d = {"a" : 1}
d["b"]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[45], line 2 1 d = {"a" : 1} ----> 2 d["b"] KeyError: 'b'
Common Errors in Syntax¶
123a
Cell In[46], line 1 123a ^ SyntaxError: invalid decimal literal
SyntaxError means that your words do not make any sense...
print("Hallo welt"
a = 123
Cell In[47], line 1 print("Hallo welt" ^ SyntaxError: '(' was never closed
if True:
print("Better")
print("Faster")
Cell In[48], line 2 print("Better") ^ IndentationError: expected an indented block after 'if' statement on line 1
IndentationErrors show up, when you forget to indent or mess up the alignment of your code.
Note: Do not mix Tabs and Spaces
if True:
print("Better")
print("Faster")
print("Stronger")
File <string>:4 print("Stronger") ^ TabError: inconsistent use of tabs and spaces in indentation
variable_with_a_complicated_name = 1
print(variable_with_complicated_name)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[51], line 2 1 variable_with_a_complicated_name = 1 ----> 2 print(variable_with_complicated_name) NameError: name 'variable_with_complicated_name' is not defined
A NameError is raised when you use a name that Python does not know about. Often this means you made a typo. (To avoid these, use tab-completion.) Or we just forgot to define a variable or function.
function_we_never_defined()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[52], line 1 ----> 1 function_we_never_defined() NameError: name 'function_we_never_defined' is not defined
When accessing attributes from an object, the same problem usually triggers an AttributeError
my_dict = {}
my_dict.append(1)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[53], line 2 1 my_dict = {} ----> 2 my_dict.append(1) AttributeError: 'dict' object has no attribute 'append'
Handling Errors¶
In some cases, you might expect an error and know how to handle it. A typical example is when handling user input:
int(input("Enter an integer: "))
--------------------------------------------------------------------------- StdinNotImplementedError Traceback (most recent call last) Cell In[54], line 1 ----> 1 int(input("Enter an integer: ")) File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt) 1279 if not self._allow_stdin: 1280 msg = "raw_input was called, but this frontend does not support input requests." -> 1281 raise StdinNotImplementedError(msg) 1282 return self._input_request( 1283 str(prompt), 1284 self._parent_ident["shell"], 1285 self.get_parent("shell"), 1286 password=False, 1287 ) StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
try:
print(int(input("Enter an integer: ")))
except ValueError:
print("This is not an integer")
--------------------------------------------------------------------------- StdinNotImplementedError Traceback (most recent call last) Cell In[55], line 2 1 try: ----> 2 print(int(input("Enter an integer: "))) 3 except ValueError: 4 print("This is not an integer") File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt) 1279 if not self._allow_stdin: 1280 msg = "raw_input was called, but this frontend does not support input requests." -> 1281 raise StdinNotImplementedError(msg) 1282 return self._input_request( 1283 str(prompt), 1284 self._parent_ident["shell"], 1285 self.get_parent("shell"), 1286 password=False, 1287 ) StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
Hints:
- The easiest way to determine which error to except is to trigger it.
- Try keep the code inside the try-block minimal. (Otherwise the chances for errors in there rise)
- You can have several except clauses following one try block.
input_ = input("Enter an integer: ")
try:
print(1/int(input_))
except ValueError:
print("This is not an integer")
except ZeroDivisionError:
print("Division by Zero")
--------------------------------------------------------------------------- StdinNotImplementedError Traceback (most recent call last) Cell In[56], line 1 ----> 1 input_ = input("Enter an integer: ") 2 try: 3 print(1/int(input_)) File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt) 1279 if not self._allow_stdin: 1280 msg = "raw_input was called, but this frontend does not support input requests." -> 1281 raise StdinNotImplementedError(msg) 1282 return self._input_request( 1283 str(prompt), 1284 self._parent_ident["shell"], 1285 self.get_parent("shell"), 1286 password=False, 1287 ) StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
If you have a block of code you want to execute even if an error occured, you can use fianlly:
input_ = input("Enter an integer: ")
try:
print(int(input_))
finally:
print("Goodbye")
--------------------------------------------------------------------------- StdinNotImplementedError Traceback (most recent call last) Cell In[57], line 1 ----> 1 input_ = input("Enter an integer: ") 2 try: 3 print(int(input_)) File /usr/lib/python3/dist-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt) 1279 if not self._allow_stdin: 1280 msg = "raw_input was called, but this frontend does not support input requests." -> 1281 raise StdinNotImplementedError(msg) 1282 return self._input_request( 1283 str(prompt), 1284 self._parent_ident["shell"], 1285 self.get_parent("shell"), 1286 password=False, 1287 ) StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
This is most useful when you need to clean up something. (eg. close an open file or disconnect from a database)
Reading and Writing Files¶
So far we used numpy.loadtxt to read files. This is a good choice for simple data files. In addition we will look at other modules for reading data from files later. Here I want to introduce the most basic file access available in Python directly.
The concept to keep in mind when working with files directly is, that files are a long chain of symbols. When you open them, the operating system puts a pointer onto the first symbol and then advances the pointer as you read the file.
f = open("data/countries-europe.csv")
f.read(1)
'#'
for _ in range(6):
print(f.read(1))
p y t h o n
if you repeat the last cell it will keep reading new symbols until you reach the end of the file. Then it will keep returning empty strings, as the pointer is now stuck at the end of the file.
If you want to change the position of the pointer you can use seek. A common use case is to reset the pointer to the beginning of a file without opening the file again:
f.seek(0)
for _ in range(7):
print(f.read(1))
# p y t h o n
In most cases you are probably not interested in individual symbols but in full lines. Python provides two useful functions for this:
readline()to read until the next line break.readlines()to read all remaining lines in a file, store as list
f.readline()
' row index,country\n'
f.readlines()
['0,Austria\n', '1,Belgium\n', '2,Bulgaria\n', '3,Croatia\n', '4,Cyprus\n', '5,Czechia\n', '6,Denmark\n', '7,Estonia\n', '8,Finland\n', '9,France\n', '10,Germany\n', '11,Greece\n', '12,Hungary\n', '13,Ireland\n', '14,Italy\n', '15,Latvia\n', '16,Lithuania\n', '17,Luxembourg\n', '18,Malta\n', '19,Netherlands\n', '20,Poland\n', '21,Portugal\n', '22,Romania\n', '23,Slovak Republic\n', '24,Slovenia\n', '25,Spain\n', '26,Sweden\n', '27,United Kingdom\n']
readlines is usually the easiest way to work with files. However for very large files this needs a lot of memory. Using readline can avoid this problem.
Once you are done working with a file you should close it
f.close()
Using a Context Manager¶
As mentioned in the last section, we can wrap our file handling code in try...finally... if we want to make sure the file is closed properly even in case of an error.
However, since this is a very common pattern, there is an even easier solution:
with open("data/countries-europe.csv") as f:
lines = f.readlines()
print(lines[4])
#f.seek(0)
3,Croatia
with uses a Context Manager to execute a block of code. A context manager defines try...except...finaly... blocks and wraps them around a given code-block.
Writing Files¶
Writing a file is not much more difficult then reading it:
with open("data/tmp.txt", "w") as f:
for i in range(10):
f.write("Yes!\n")
with open("data/tmp.txt", "a") as f:
f.write("NO!\n")
!cat data/tmp.txt
Yes! Yes! Yes! Yes! Yes! Yes! Yes! Yes! Yes! Yes! NO!
The second argument to open defines the mode how the file should be opened. The default is for reading text, the most common other modes are:
wfor writing a file (deleting content if the file exists)afor appending to a file
For more details, see the documentation.
Strings in Depth¶
Strings have several more complex topics as well.
String functions¶
Strings come with a large number of helpful functions: https://docs.python.org/3/library/stdtypes.html#string-methods
- you can search and replace in the string
- you can split the string on given characters
- you can strip characters from its ends
- you can change the case
- ...
"This is an Example String".find("is") #vs " is "; vs `in`
2
"This is an Example String".replace("Example", "Demo")
'This is an Demo String'
"This is an Example String".split() # vs. split("a")
['This', 'is', 'an', 'Example', 'String']
print(" This is an Example String \n".strip())
This is an Example String
"this and that".strip("t") # vs lstrip, rstrip
'his and tha'
"This is an Example String".lower() # upper, title
'this is an example string'
Encoding¶
Internally a computer stores strings as a sequence of numbers. What letters the individual numbers represent is defined by the encoding. Unfortunately the information what encoding was used is often not provided with the string. This can cause a lot of annyoing problems.
Python3 strings always use "utf-8" encoding and avoids many problems. However input you receive might still have been stored in some other encoding. A wrong encoding can result in strange characters displayed or in error messages.
Different operating systems use different default encodings for text files and Python tries the default encoding when opening a file.
with open("data/preamble.txt") as file_: # , encoding="iso-8859-1"
for l in file_.readlines()[:5]:
print(l)
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[73], line 2 1 with open("data/preamble.txt") as file_: # , encoding="iso-8859-1" ----> 2 for l in file_.readlines()[:5]: 3 print(l) File /usr/lib/python3.12/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final) 319 def decode(self, input, final=False): 320 # decode input (taking the buffer into account) 321 data = self.buffer + input --> 322 (result, consumed) = self._buffer_decode(data, self.errors, final) 323 # keep undecoded input until the next call 324 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 2: invalid continuation byte
When reading data with a different encoding you need to tell Python what encoding your input is. Python will then covert this to utf-8.
If for some reason you need to output text in some other encoding, you will have to convert again. Luckily this is trivial.
"Grüäzi".encode("iso-8859-1")
b'Gr\xfc\xe4zi'
str.encode returns a bytes-Object. This is similar to a string, but not considered text by python. Instead bytes objects actually behave like immutable sequences of integers but are displayed in a strange way:
- integers representing "useful" symbols in ASCII (American Standard Code for Information Interchange) encoding, are displayed as that symbol
- all other integers are given in HEX, prepended with
\x
(If you did not understand a lot, you don't miss much)
bytes.fromhex('00 01 02 03 41 61 30 2B')
b'\x00\x01\x02\x03Aa0+'
To turn bytes into strings, you can use decode:
"Grüäzi".encode("iso-8859-1").decode("iso-8859-1")
'Grüäzi'
## writing files with a different encoding
#with open("data/tmp.txt", "w", encoding="iso-8859-1") as f:
# f.write("Grüäzi\n")
#
#with open("data/tmp.txt", "w+b") as f:
# f.write("Grüäzi\n".encode("iso-8859-1"))
Raw Strings¶
By convention a backslash is used in most string processing to introduce special character combinations.
print("One\n\tTwo")
One Two
Of course, it might be that you need such a combination in your string without it's special meaning (for example if you want to use LaTeX syntax in your plot-labels). In this case you have two options:
- you can escape the backslash with an additional backslash
- you can prefixed your string with an
r, turning it into a raw-string
print("One\\n\\tTwo")
One\n\tTwo
print(r"One\n\tTwo")
One\n\tTwo
String Formatting¶
Rather sooner than later, you will want to construct more complex strings. There are several different options for this. We will look at the most modern only.
We return to our compound interest code and change it such that it prints a nicer message.
def with_interest(value, rate, years):
return value*(1+rate/100)**years
def compound_interest(value, rate, years):
return with_interest(value, rate, years) - value
balance = 1000
interest = compound_interest(balance, 0.5, 5)
print(f"The interest earned on {balance} CHF over 5 years will be {interest}")
The interest earned on 1000 CHF over 5 years will be 25.25125312812429
An f-string starts with f" (or f' or f""" or ...) and can contain variables enclosed by curly-brackets {}.
Python will look for the referenced variables in the current scope:
print(f"At {rate} percent, the interest earned on {balance} CHF over 5 years will be {interest}")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[82], line 1 ----> 1 print(f"At {rate} percent, the interest earned on {balance} CHF over 5 years will be {interest}") NameError: name 'rate' is not defined
Note: rate exists inside the two functions, but not outside of them.
If we insert numbers into a string, we can even specify how they are displayed:
balance = 1000
rate = 0.5
duration = 5
interest = compound_interest(balance, rate, duration)
print(f"At {rate:.2f} percent, the interest earned on \
{balance:.2f} CHF over {duration} years will be {interest:.2f}")
At 0.50 percent, the interest earned on 1000.00 CHF over 5 years will be 25.25
For details on number-formatting, have a look at https://docs.python.org/3/library/string.html#formatspec. (.format works similar to f-strings, but needs variables passed explicitly.)
There are a lot more things you can do with f-strings, try to avoid making things too complex.
Command Line Arguments¶
So far we started all our scripts from the command line with
python3 interest.py
and used input() if we wanted to obtain additional input from the user.
We could let the user pass additional input directly on the command line
python interest.py 1000 5 .5
Of course we need to adapt our programm for this to work.
The simplest approach is to use the argument vector provided by the sys module.
The argument vector is a list with the name of the script running at postion 0.
This is followed by any other "word" given on the command line.
## this code does not work in a notebook
#import sys
#
#def with_interest(value, rate, years):
# return value*(1+rate/100)**years
#
#def compound_interest(value, rate, years):
# return with_interest(value, rate, years) - value
#
# print(sys.argv)
#balance = float(sys.argv[1])
#duration = float(sys.argv[2])
#rate = float(sys.argv[3])
#res = compound_interest(balance, rate, duration)
#print("At {rate} percent, the interest earned on {balance} CHF over {duration} years will be {interest}".format(
# rate=rate, balance=balance, duration=duration, interest=res))
A more elegant solution is to use the argparse module. This allows you to create more complex commandline interfaces easily.
#import argparse
#
#def with_interest(value, rate, years):
# return value*(1+rate/100)**years
#
#def compound_interest(value, rate, years):
# return with_interest(value, rate, years) - value
#
#parser = argparse.ArgumentParser(description='calculate compound interest')
#parser.add_argument('balance', type=float, help='the starting balance')
#parser.add_argument('duration', type=int, help='the number of years to accumulate interest for')
#parser.add_argument('--rate', type=float, default=0.5, help='the interest rate in percent')
#args = parser.parse_args()
#
#res = compound_interest(args.balance, args.rate, args.duration)
#print("At {rate} percent, the interest earned on {balance} CHF over {duration} years will be {interest}".format(
# rate=args.rate, balance=args.balance, duration=args.duration, interest=res))
The help of argparse is rather extenisve and it is easy to find examples online.
Writing modules¶
Once your code gets bigger, it is a good idea to split it into several smaler files.
The easiest way to do this is to store all files in the same folder. In Python you can then import functions from other files just like you do this with system modules
!cat interest.py
def with_interest(value, rate, years):
return value*(1+rate/100)**years
def compound_interest(value, rate, years):
return with_interest(value, rate, years) - value
from interest import compound_interest
compound_interest(1000,.5,10)
51.140132040789695
It is not much more difficult to organise your files into a folder structure. All you need to do is to create a file __init__.py in each folder of your module. In most cases you can leave __init__.py empty.
from tools.interest import with_interest
with_interest(100,5,1)
105.0
If inside your module folder you want to load another file, you can use the same syntax as you would use outside, giving the absolute module name.
!cat tools/compound.py
from tools.interest import with_interest
def compound_interest(value, rate, years):
return with_interest(value, rate, years) - value
Importing Scripts¶
You can import from any python file. However if you have a script with code in the global scope, that code will be executed on import.
!cat interest_script.py
def with_interest(value, rate, years):
return value*(1+rate/100)**years
def compound_interest(value, rate, years):
return with_interest(value, rate, years) - value
print("Result:", compound_interest(1000, 0.5, 10))
from interest_script import with_interest
Result: 51.140132040789695
This is usually not what you want. To avoid this you can guard the executable part of your script with a special if statement.
!cat interest_guarded.py
def with_interest(value, rate, years):
return value*(1+rate/100)**years
def compound_interest(value, rate, years):
return with_interest(value, rate, years) - value
if __name__ == "__main__":
print("Result:", compound_interest(1000, 0.5, 10))
from interest_guarded import with_interest
Useful Standard Modules¶
sys¶
We have already seen the sys module. It provides access to some objects used or maintained by the interpreter. One of these (sys.argv) we have seen above. For more details, look at the documentation. (https://docs.python.org/3/library/sys.html)
os¶
OS routines for the system your running on. Especially useful are the os.path functions (https://docs.python.org/3/library/os.html)
import os.path
os.path.join("dir", "filename.py")
'dir/filename.py'
os.path.basename("dir/filename.py")
'filename.py'
os.path.dirname("dir/subdir/file.py")
'dir/subdir'
for path in os.walk("example_tree"):
print(path)
('example_tree', ['subdir'], ['fileB', 'fileA'])
('example_tree/subdir', [], ['file2', 'file1'])
You might also want to use glob
import glob
matches = glob.glob("example_tree/**/file*", recursive=True)
print("\n".join(matches))
example_tree/fileB example_tree/fileA example_tree/subdir/file2 example_tree/subdir/file1
re¶
Support for regular expressions. Especially useful when you need to parse strings. (https://docs.python.org/3/library/re.html)
import re
print(re.findall(r"\d+.\d{2}", "At 0.50 percent, the interest earned on 1000.00 CHF over 5 years will be 25.25"))
['0.50', '1000.00', '25.25']
Look at the website for many more options and a detailed explanation of the syntax of regular expressions.
datetime¶
Support for working with dates and time. (https://docs.python.org/3/library/datetime.html)
- Most important class datetime to represent date (year, month, day) and time (hour, minute, second, millisecond)
strptimeto load dates from a stringstrftimeto print dates to a string- Timezone info encodable via abstract base class of tzinfo, e.g.
pytz timedeltais the difference between datetime objects and allows to make calculations
import datetime as dt
defining a simple time is easy
t1 = dt.datetime(2017,9,11,13,15)
print(t1)
print(t1.utctimetuple().tm_hour) # uses time-zone info to calculate UTC time
2017-09-11 13:15:00 13
time zone info can be added using timedelta
tz_cest = dt.timezone(dt.timedelta(hours=+2))
t2 = dt.datetime(2017,9,11,13,15,tzinfo=tz_cest)
print(t2.hour)
print(t2.utctimetuple().tm_hour) # uses time-zone info to calculate UTC time
13 11
create a string from time according to a given format
output_string = t1.strftime("%d %b %Y %I:%M:%S %p")
print(output_string)
11 Sep 2017 01:15:00 PM
extract the date and time info from a string of given format
input_string = "6 June 2016 8h45'32''"
t3 = dt.datetime.strptime(input_string,"%d %B %Y %Hh%M'%S''")
print(t3)
2016-06-06 08:45:32
pytz provides predefined timezone objects that can be used to create datetime-objects with the right timezone info.
import pytz
tz_zurich = pytz.timezone("Europe/Zurich")
# do not pass tz_zurich to the datetime constructor
# tzinfo can not handle the varying timezone offsets stored by pytz
t_zh = dt.datetime(2017,9,11,13,15,tzinfo=tz_zurich)
print("bad: ", t_zh)
t_zh = tz_zurich.localize(dt.datetime(2017,9,11,13,15))
print("good:", t_zh)
t_ny = t_zh.astimezone(pytz.timezone("America/New_York"))
print(t_ny)
bad: 2017-09-11 13:15:00+00:34 good: 2017-09-11 13:15:00+02:00 2017-09-11 07:15:00-04:00
tz_kolkata = dt.timezone(dt.timedelta(hours=5, minutes=30))
t_zh.astimezone(tz_kolkata)
datetime.datetime(2017, 9, 11, 16, 45, tzinfo=datetime.timezone(datetime.timedelta(seconds=19800)))
you can do calculations with datetime objects
tdelta = t1-t3
print(tdelta)
print(type(tdelta))
462 days, 4:29:28 <class 'datetime.timedelta'>
tdelta +=dt.timedelta(1) # add a day
print(tdelta)
tdelta+=dt.timedelta(hours=-5) # subtract 5 hours
print(tdelta)
print(tdelta+t2) # add the time delta to a datetime object
463 days, 4:29:28 462 days, 23:29:28 2018-12-18 12:44:28+02:00
CSV¶
If you need to read CSV data, python provides the relevant module. (https://docs.python.org/3/library/csv.html)
import csv
data = list()
with open("data/countries-europe.csv") as csv_f:
csv_reader = csv.reader(csv_f)
for row in csv_reader:
data.append(row)
print(row)
['#python row index', 'country'] ['0', 'Austria'] ['1', 'Belgium'] ['2', 'Bulgaria'] ['3', 'Croatia'] ['4', 'Cyprus'] ['5', 'Czechia'] ['6', 'Denmark'] ['7', 'Estonia'] ['8', 'Finland'] ['9', 'France'] ['10', 'Germany'] ['11', 'Greece'] ['12', 'Hungary'] ['13', 'Ireland'] ['14', 'Italy'] ['15', 'Latvia'] ['16', 'Lithuania'] ['17', 'Luxembourg'] ['18', 'Malta'] ['19', 'Netherlands'] ['20', 'Poland'] ['21', 'Portugal'] ['22', 'Romania'] ['23', 'Slovak Republic'] ['24', 'Slovenia'] ['25', 'Spain'] ['26', 'Sweden'] ['27', 'United Kingdom']
Other Useful Modules¶
There are many other useful modules. Based on your answers to my questions, I think the following modules might be of interest to many of you.
Data Analysis¶
scipy- the main module for data-analysis functionalitystatsmodels- statistical models, statistical tests, data explorationpandas- powerful data analysis and manipulation library
Web Scraping¶
requests- to access web-contentbs4- to parse html or xmlselenium- to control your browser
Web Development¶
django- web development frameworkflask- micro web framework
Machine Learning¶
sklearn- the main machine learning module
Databases¶
sqlite3- simple file based database
(I have not enough experience with other database-modules to recommend one.)
Other¶
subprocess- Start other processes and access their input and outputasyncio- to parallelize io-bound codemultiprocessing- to parallelize cpu-bound code