Chapter 4. Data Types and Structures
Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
— Linus Torvalds
This chapter introduces basic data types and data structures of Python
. Although the Python
interpreter itself already brings a rich variety of data structures with it, NumPy
and other libraries add to these in a valuable fashion.
The chapter is organized as follows:
- Basic data types
-
The first section introduces basic data types such as
int
,float
, andstring
. - Basic data structures
-
The next section introduces the fundamental data structures of Python (e.g.,
list
objects) and illustrates control structures, functional programming paradigms, and anonymous functions. - NumPy data structures
-
The following section is devoted to the characteristics and capabilities of the
NumPy
ndarray
class and illustrates some of the benefits of this class for scientific and financial applications. - Vectorization of code
-
As the final section illustrates, thanks to
NumPy
’s array class vectorized code is easily implemented, leading to more compact and also better-performing code.
The spirit of this chapter is to provide a general introduction to Python
specifics when it comes to data types and structures. If you are equipped with a background from another programing language, say C
or Matlab
, you should be able to easily grasp the differences that Python
usage might bring along. The topics introduced here are all important and fundamental for the chapters to come.
Basic Data Types
Python
is a dynamically typed language, which means that the Python
interpreter infers the type of an object at runtime. In comparison, compiled languages like C
are generally statically typed. In these cases, the type of an object has to be attached to the object before compile time.[18]
Integers
One of the most fundamental data types is the integer, or int
:
In
[
1
]:
a
=
10
type
(
a
)
Out[1]: int
The built-in function type
provides type information for all objects with standard and built-in types as well as for newly created classes and objects. In the latter case, the information provided depends on the description the programmer has stored with the class. There is a saying that “everything in Python
is an object.” This means, for example, that even simple objects like the int
object we just defined have built-in methods. For example, you can get the number of bits needed to represent the int
object in-memory by calling the method bit_length
:
In
[
2
]:
a
.
bit_length
()
Out[2]: 4
You will see that the number of bits needed increases the higher the integer value is that we assign to the object:
In
[
3
]:
a
=
100000
a
.
bit_length
()
Out[3]: 17
In general, there are so many different methods that it is hard to memorize all methods of all classes and objects. Advanced Python
environments, like IPython
, provide tab completion capabilities that show all methods attached to an object. You simply type the object name followed by a dot (e.g., a.
) and then press the Tab key, e.g., a.
. This then provides a collection of methods you can call on the object. Alternatively, the tab
Python
built-in function dir
gives a complete list of attributes and methods of any object.
A specialty of Python
is that integers can be arbitrarily large. Consider, for example, the googol number 10100. Python
has no problem with such large numbers, which are technically long
objects:
In
[
4
]:
googol
=
10
**
100
googol
Out[4]: 100000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000L
In
[
5
]:
googol
.
bit_length
()
Out[5]: 333
Large Integers
Python
integers can be arbitrarily large. The interpreter simply uses as many bits/bytes as needed to represent the numbers.
It is important to note that mathematical operations on int
objects return int
objects. This can sometimes lead to confusion and/or hard-to-detect errors in mathematical routines. The following expression yields the expected result:
In
[
6
]:
1
+
4
Out[6]: 5
However, the next case may return a somewhat surprising result:
In
[
7
]:
1
/
4
Out[7]: 0
In
[
8
]:
type
(
1
/
4
)
Out[8]: int
Floats
For the last expression to return the generally desired result of 0.25, we must operate on float
objects, which brings us naturally to the next basic data type. Adding a dot to an integer value, like in 1.
or 1.0
, causes Python
to interpret the object as a float
. Expressions involving a float
also return a float
object in general:[19]
In
[
9
]:
1.
/
4
Out[9]: 0.25
In
[
10
]:
type
(
1.
/
4
)
Out[10]: float
A float
is a bit more involved in that the computerized representation of rational or real numbers is in general not exact and depends on the specific technical approach taken. To illustrate what this implies, let us define another float
object:
In
[
11
]:
b
=
0.35
type
(
b
)
Out[11]: float
float
objects like this one are always represented internally up to a certain degree of accuracy only. This becomes evident when adding 0.1 to b
:
In
[
12
]:
b
+
0.1
Out[12]: 0.44999999999999996
The reason for this is that float
s are internally represented in binary format; that is, a decimal number 0 < n < 1 is represented by a series of the form . For certain floating-point numbers the binary representation might involve a large number of elements or might even be an infinite series. However, given a fixed number of bits used to represent such a number—i.e., a fixed number of terms in the representation series—inaccuracies are the consequence. Other numbers can be represented perfectly and are therefore stored exactly even with a finite number of bits available. Consider the following example:
In
[
13
]:
c
=
0.5
c
.
as_integer_ratio
()
Out[13]: (1, 2)
One half, i.e., 0.5, is stored exactly because it has an exact (finite) binary representation as . However, for b = 0.35
we get something different than the expected rational number :
In
[
14
]:
b
.
as_integer_ratio
()
Out[14]: (3152519739159347, 9007199254740992)
The precision is dependent on the number of bits used to represent the number. In general, all platforms that Python
runs on use the IEEE 754 double-precision standard (i.e., 64 bits), for internal representation.[20] This translates into a 15-digit relative accuracy.
Since this topic is of high importance for several application areas in finance, it is sometimes necessary to ensure the exact, or at least best possible, representation of numbers. For example, the issue can be of importance when summing over a large set of numbers. In such a situation, a certain kind and/or magnitude of representation error might, in aggregate, lead to significant deviations from a benchmark value.
The module decimal
provides an arbitrary-precision object for floating-point numbers and several options to address precision issues when working with such numbers:
In
[
15
]:
import
decimal
from
decimal
import
Decimal
In
[
16
]:
decimal
.
getcontext
()
Out[16]: Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999999, Emax=999999 999, capitals=1, flags=[], traps=[Overflow, InvalidOperation, DivisionB yZero])
In
[
17
]:
d
=
Decimal
(
1
)
/
Decimal
(
11
)
d
Out[17]: Decimal('0.09090909090909090909090909091')
You can change the precision of the representation by changing the respective attribute value of the Context
object:
In
[
18
]:
decimal
.
getcontext
()
.
prec
=
4
# lower precision than default
In
[
19
]:
e
=
Decimal
(
1
)
/
Decimal
(
11
)
e
Out[19]: Decimal('0.09091')
In
[
20
]:
decimal
.
getcontext
()
.
prec
=
50
# higher precision than default
In
[
21
]:
f
=
Decimal
(
1
)
/
Decimal
(
11
)
f
Out[21]: Decimal('0.090909090909090909090909090909090909090909090909091')
If needed, the precision can in this way be adjusted to the exact problem at hand and one can operate with floating-point objects that exhibit different degrees of accuracy:
In
[
22
]:
g
=
d
+
e
+
f
g
Out[22]: Decimal('0.27272818181818181818181818181909090909090909090909')
Strings
Now that we can represent natural and floating-point numbers, we turn to text. The basic data type to represent text in Python
is the string
. The string
object has a number of really helpful built-in methods. In fact, Python
is generally considered to be a good choice when it comes to working with text files of any kind and any size. A string
object is generally defined by single or double quotation marks or by converting another object using the str
function (i.e., using the object’s standard or user-defined string
representation):
In
[
23
]:
t
=
'this is a string object'
With regard to the built-in methods, you can, for example, capitalize the first word in this object:
In
[
24
]:
t
.
capitalize
()
Out[24]: 'This is a string object'
Or you can split it into its single-word components to get a list
object of all the words (more on list
objects later):
In
[
25
]:
t
.
split
()
Out[25]: ['this', 'is', 'a', 'string', 'object']
You can also search for a word and get the position (i.e., index value) of the first letter of the word back in a successful case:
In
[
26
]:
t
.
find
(
'string'
)
Out[26]: 10
If the word is not in the string
object, the method returns -1:
In
[
27
]:
t
.
find
(
'Python'
)
Out[27]: -1
Replacing characters in a string is a typical task that is easily accomplished with the replace
method:
In
[
28
]:
t
.
replace
(
' '
,
'|'
)
Out[28]: 'this|is|a|string|object'
The stripping of strings—i.e., deletion of certain leading/lagging characters—is also often necessary:
In
[
29
]:
'http://www.python.org'
.
strip
(
'htp:/'
)
Out[29]: 'www.python.org'
Table 4-1 lists a number of helpful methods of the string
object.
Method | Arguments | Returns/result |
|
| Copy of the string with first letter capitalized |
|
| Count of the number of occurrences of substring |
|
| Decoded version of the string, using |
|
| Encoded version of the string |
|
| (Lowest) index where substring is found |
|
| Concatenation of strings in sequence |
|
| Replaces |
|
| List of words in string with |
|
| Separated lines with line ends/breaks if |
|
| Copy of string with leading/lagging characters in |
|
| Copy with all letters capitalized |
A powerful tool when working with string
objects is regular expressions. Python
provides such functionality in the module re
:
In
[
30
]:
import
re
Suppose you are faced with a large text file, such as a comma-separated value (CSV
) file, which contains certain time series and respective date-time information. More often than not, the date-time information is delivered in a format that Python
cannot interpret directly. However, the date-time information can generally be described by a regular expression. Consider the following string
object, containing three date-time elements, three integers, and three strings. Note that triple quotation marks allow the definition of strings over multiple rows:
In
[
31
]:
series
=
"""
'01/18/2014 13:00:00', 100, '1st';
'01/18/2014 13:30:00', 110, '2nd';
'01/18/2014 14:00:00', 120, '3rd'
"""
The following regular expression describes the format of the date-time information provided in the string
object:[21]
In
[
32
]:
dt
=
re
.
compile
(
"'[0-9/:\s]+'"
)
# datetime
Equipped with this regular expression, we can go on and find all the date-time elements. In general, applying regular expressions to string
objects also leads to performance improvements for typical parsing tasks:
In
[
33
]:
result
=
dt
.
findall
(
series
)
result
Out[33]: ["'01/18/2014 13:00:00'", "'01/18/2014 13:30:00'", "'01/18/2014 14:00:0 0'"]
Regular Expressions
When parsing string
objects, consider using regular expressions, which can bring both convenience and performance to such operations.
The resulting string
objects can then be parsed to generate Python datetime
objects (cf. Appendix C for an overview of handling date and time data with Python
). To parse the string
objects containing the date-time information, we need to provide information of how to parse—again as a string
object:
In
[
34
]:
from
datetime
import
datetime
pydt
=
datetime
.
strptime
(
result
[
0
]
.
replace
(
"'"
,
""
),
'%m/
%d
/%Y %H:%M:%S'
)
pydt
Out[34]: datetime.datetime(2014, 1, 18, 13, 0)
In
[
35
]:
pydt
Out[35]: 2014-01-18 13:00:00
In
[
36
]:
type
(
pydt
)
Out[36]: <type 'datetime.datetime'>
Later chapters provide more information on date-time data, the handling of such data, and datetime
objects and their methods. This is just meant to be a teaser for this important topic in finance.
Basic Data Structures
As a general rule, data structures are objects that contain a possibly large number of other objects. Among those that Python
provides as built-in structures are:
-
tuple
- A collection of arbitrary objects; only a few methods available
-
list
- A collection of arbitrary objects; many methods available
-
dict
- A key-value store object
-
set
- An unordered collection object for other unique objects
Tuples
A tuple
is an advanced data structure, yet it’s still quite simple and limited in its applications. It is defined by providing objects in parentheses:
In
[
37
]:
t
=
(
1
,
2.5
,
'data'
)
type
(
t
)
Out[37]: tuple
You can even drop the parentheses and provide multiple objects separated by commas:
In
[
38
]:
t
=
1
,
2.5
,
'data'
type
(
t
)
Out[38]: tuple
Like almost all data structures in Python
the tuple
has a built-in index, with the help of which you can retrieve single or multiple elements of the tuple
. It is important to remember that Python
uses zero-based numbering, such that the third element of a tuple
is at index position 2:
In
[
39
]:
t
[
2
]
Out[39]: 'data'
In
[
40
]:
type
(
t
[
2
])
Out[40]: str
Zero-Based Numbering
In contrast to some other programming languages like Matlab
, Python
uses zero-based numbering schemes. For example, the first element of a tuple
object has index value 0.
There are only two special methods that this object type provides: count
and index
. The first counts the number of occurrences of a certain object and the second gives the index value of the first appearance of it:
In
[
41
]:
t
.
count
(
'data'
)
Out[41]: 1
In
[
42
]:
t
.
index
(
1
)
Out[42]: 0
tuple
objects are not very flexible since, once defined, they cannot be changed easily.
Lists
Objects of type list
are much more flexible and powerful in comparison to tuple
objects. From a finance point of view, you can achieve a lot working only with list
objects, such as storing stock price quotes and appending new data. A list
object is defined through brackets and the basic capabilities and behavior are similar to those of tuple
objects:
In
[
43
]:
l
=
[
1
,
2.5
,
'data'
]
l
[
2
]
Out[43]: 'data'
list
objects can also be defined or converted by using the function list
. The following code generates a new list
object by converting the tuple
object from the previous example:
In
[
44
]:
l
=
list
(
t
)
l
Out[44]: [1, 2.5, 'data']
In
[
45
]:
type
(
l
)
Out[45]: list
In addition to the characteristics of tuple
objects, list
objects are also expandable and reducible via different methods. In other words, whereas string
and tuple
objects are immutable sequence objects (with indexes) that cannot be changed once created, list
objects are mutable and can be changed via different operations. You can append list
objects to an existing list
object, and more:
In
[
46
]:
l
.
append
([
4
,
3
])
# append list at the end
l
Out[46]: [1, 2.5, 'data', [4, 3]]
In
[
47
]:
l
.
extend
([
1.0
,
1.5
,
2.0
])
# append elements of list
l
Out[47]: [1, 2.5, 'data', [4, 3], 1.0, 1.5, 2.0]
In
[
48
]:
l
.
insert
(
1
,
'insert'
)
# insert object before index position
l
Out[48]: [1, 'insert', 2.5, 'data', [4, 3], 1.0, 1.5, 2.0]
In
[
49
]:
l
.
remove
(
'data'
)
# remove first occurrence of object
l
Out[49]: [1, 'insert', 2.5, [4, 3], 1.0, 1.5, 2.0]
In
[
50
]:
p
=
l
.
pop
(
3
)
# removes and returns object at index
l
,
p
Out[50]: [1, 'insert', 2.5, 1.0, 1.5, 2.0] [4, 3]
Slicing is also easily accomplished. Here, slicing refers to an operation that breaks down a data set into smaller parts (of interest):
In
[
51
]:
l
[
2
:
5
]
# 3rd to 5th elements
Out[51]: [2.5, 1.0, 1.5]
Table 4-2 provides a summary of selected operations and methods of the list
object.
Method | Arguments | Returns/result |
|
| Replaces |
|
| Replaces every |
|
| Appends |
|
| Number of occurrences of object |
|
| Deletes elements with index values |
|
| Appends all elements of |
|
| First index of |
|
| Inserts |
|
| Removes element with index |
|
| Removes element with index |
|
| Reverses all items in place |
|
| Sorts all items in place |
Excursion: Control Structures
Although a topic in itself, control structures like for
loops are maybe best introduced in Python
based on list
objects. This is due to the fact that looping in general takes place over list
objects, which is quite different to what is often the standard in other languages. Take the following example. The for
loop loops over the elements of the list
object l
with index values 2 to 4 and prints the square of the respective elements. Note the importance of the indentation (whitespace) in the second line:
In
[
52
]:
for
element
in
l
[
2
:
5
]:
element
**
2
Out[52]: 6.25 1.0 2.25
This provides a really high degree of flexibility in comparison to the typical counter-based looping. Counter-based looping is also an option with Python
, but is accomplished based on the (standard) list
object range
:
In
[
53
]:
r
=
range
(
0
,
8
,
1
)
# start, end, step width
r
Out[53]: [0, 1, 2, 3, 4, 5, 6, 7]
In
[
54
]:
type
(
r
)
Out[54]: list
For comparison, the same loop is implemented using range
as follows:
In
[
55
]:
for
i
in
range
(
2
,
5
):
l
[
i
]
**
2
Out[55]: 6.25 1.0 2.25
Looping over Lists
In Python
you can loop over arbitrary list
objects, no matter what the content of the object is. This often avoids the introduction of a counter.
Python
also provides the typical (conditional) control elements if
, elif
, and else
. Their use is comparable in other languages:
In
[
56
]:
for
i
in
range
(
1
,
10
):
if
i
%
2
==
0
:
# % is for modulo
"
%d
is even"
%
i
elif
i
%
3
==
0
:
"
%d
is multiple of 3"
%
i
else
:
"
%d
is odd"
%
i
Out[56]: 1 is odd 2 is even 3 is multiple of 3 4 is even 5 is odd 6 is even 7 is odd 8 is even 9 is multiple of 3
Similarly, while
provides another means to control the flow:
In
[
57
]:
total
=
0
while
total
<
100
:
total
+=
1
total
Out[57]: 100
A specialty of Python
is so-called list
comprehensions. Instead of looping over existing list
objects, this approach generates list
objects via loops in a rather compact fashion:
In
[
58
]:
m
=
[
i
**
2
for
i
in
range
(
5
)]
m
Out[58]: [0, 1, 4, 9, 16]
In a certain sense, this already provides a first means to generate “something like” vectorized code in that loops are rather more implicit than explicit (vectorization of code is discussed in more detail later in this chapter).
Excursion: Functional Programming
Python
provides a number of tools for functional programming support as well—i.e., the application of a function to a whole set of inputs (in our case list
objects). Among these tools are filter
, map
, and reduce
. However, we need a function definition first. To start with something really simple, consider a function f
that returns the square of the input x
:
In
[
59
]:
def
f
(
x
):
return
x
**
2
f
(
2
)
Out[59]: 4
Of course, functions can be arbitrarily complex, with multiple input/parameter objects and even multiple outputs, (return objects). However, consider the following function:
In
[
60
]:
def
even
(
x
):
return
x
%
2
==
0
even
(
3
)
Out[60]: False
The return object is a Boolean. Such a function can be applied to a whole list
object by using map
:
In
[
61
]:
map
(
even
,
range
(
10
))
Out[61]: [True, False, True, False, True, False, True, False, True, False]
To this end, we can also provide a function definition directly as an argument to map
, by using lambda
or anonymous functions:
In
[
62
]:
map
(
lambda
x
:
x
**
2
,
range
(
10
))
Out[62]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Functions can also be used to filter a list
object. In the following example, the filter returns elements of a list
object that match the Boolean condition as defined by the even
function:
In
[
63
]:
filter
(
even
,
range
(
15
))
Out[63]: [0, 2, 4, 6, 8, 10, 12, 14]
Finally, reduce
helps when we want to apply a function to all elements of a list
object that returns a single value only. An example is the cumulative sum of all elements in a list
object (assuming that summation is defined for the objects contained in the list):
In
[
64
]:
reduce
(
lambda
x
,
y
:
x
+
y
,
range
(
10
))
Out[64]: 45
An alternative, nonfunctional implementation could look like the following:
In
[
65
]:
def
cumsum
(
l
):
total
=
0
for
elem
in
l
:
total
+=
elem
return
total
cumsum
(
range
(
10
))
Out[65]: 45
List Comprehensions, Functional Programming, Anonymous Functions
It can be considered good practice to avoid loops on the Python
level as far as possible. list
comprehensions and functional programming tools like map
, filter
, and reduce
provide means to write code without loops that is both compact and in general more readable. lambda
or anonymous functions are also powerful tools in this context.
Dicts
dict
objects are dictionaries, and also mutable sequences, that allow data retrieval by keys that can, for example, be string
objects. They are so-called key-value stores. While list
objects are ordered and sortable, dict
objects are unordered and unsortable. An example best illustrates further differences to list
objects. Curly brackets are what define dict
objects:
In
[
66
]:
d
=
{
'Name'
:
'Angela Merkel'
,
'Country'
:
'Germany'
,
'Profession'
:
'Chancelor'
,
'Age'
:
60
}
type
(
d
)
Out[66]: dict
In
[
67
]:
d
[
'Name'
],
d
[
'Age'
]
Out[67]: Angela Merkel 60
Again, this class of objects has a number of built-in methods:
In
[
68
]:
d
.
keys
()
Out[68]: ['Country', 'Age', 'Profession', 'Name']
In
[
69
]:
d
.
values
()
Out[69]: ['Germany', 60, 'Chancelor', 'Angela Merkel']
In
[
70
]:
d
.
items
()
Out[70]: [('Country', 'Germany'), ('Age', 60), ('Profession', 'Chancelor'), ('Name', 'Angela Merkel')]
In
[
71
]:
birthday
=
True
if
birthday
is
True
:
d
[
'Age'
]
+=
1
d
[
'Age'
]
Out[71]: 61
There are several methods to get iterator
objects from the dict
object. The objects behave like list
objects when iterated over:
In
[
72
]:
for
item
in
d
.
iteritems
():
item
Out[72]: ('Country', 'Germany') ('Age', 61) ('Profession', 'Chancelor') ('Name', 'Angela Merkel')
In
[
73
]:
for
value
in
d
.
itervalues
():
type
(
value
)
Out[73]: <type 'str'> <type 'int'> <type 'str'> <type 'str'>
Table 4-3 provides a summary of selected operations and methods of the dict
object.
Method | Arguments | Returns/result |
|
| Item of |
|
| Sets item key |
|
| Deletes item with key |
|
| Removes all items |
|
| Makes a copy |
|
|
|
|
| Copy of all key-value pairs |
|
| Iterator over all items |
|
| Iterator over all keys |
|
| Iterator over all values |
|
| Copy of all keys |
|
| Returns and removes item with key |
|
| Updates items with items from |
|
| Copy of all values |
Sets
The last data structure we will consider is the set
object. Although set theory is a cornerstone of mathematics and also finance theory, there are not too many practical applications for set
objects. The objects are unordered collections of other objects, containing every element only once:
In
[
74
]:
s
=
set
([
'u'
,
'd'
,
'ud'
,
'du'
,
'd'
,
'du'
])
s
Out[74]: {'d', 'du', 'u', 'ud'}
In
[
75
]:
t
=
set
([
'd'
,
'dd'
,
'uu'
,
'u'
])
With set
objects, you can implement operations as you are used to in mathematical set theory. For example, you can generate unions, intersections, and differences:
In
[
76
]:
s
.
union
(
t
)
# all of s and t
Out[76]: {'d', 'dd', 'du', 'u', 'ud', 'uu'}
In
[
77
]:
s
.
intersection
(
t
)
# both in s and t
Out[77]: {'d', 'u'}
In
[
78
]:
s
.
difference
(
t
)
# in s but not t
Out[78]: {'du', 'ud'}
In
[
79
]:
t
.
difference
(
s
)
# in t but not s
Out[79]: {'dd', 'uu'}
In
[
80
]:
s
.
symmetric_difference
(
t
)
# in either one but not both
Out[80]: {'dd', 'du', 'ud', 'uu'}
One application of set
objects is to get rid of duplicates in a list
object. For example:
In
[
81
]:
from
random
import
randint
l
=
[
randint
(
0
,
10
)
for
i
in
range
(
1000
)]
# 1,000 random integers between 0 and 10
len
(
l
)
# number of elements in l
Out[81]: 1000
In
[
82
]:
l
[:
20
]
Out[82]: [8, 3, 4, 9, 1, 7, 5, 5, 6, 7, 4, 4, 7, 1, 8, 5, 0, 7, 1, 9]
In
[
83
]:
s
=
set
(
l
)
s
Out[83]: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
NumPy Data Structures
The previous section shows that Python
provides some quite useful and flexible general data structures. In particular, list
objects can be considered a real workhorse with many convenient characteristics and application areas. However, scientific and financial applications generally have a need for high-performing operations on special data structures. One of the most important data structures in this regard is the array. Arrays generally structure other (fundamental) objects in rows and columns.
Assume for the moment that we work with numbers only, although the concept generalizes to other types of data as well. In the simplest case, a one-dimensional array then represents, mathematically speaking, a vector of, in general, real numbers, internally represented by float
objects. It then consists of a single row or column of elements only. In a more common case, an array represents an i × j matrix of elements. This concept generalizes to i × j × k cubes of elements in three dimensions as well as to general n-dimensional arrays of shape i × j × k × l × … .
Mathematical disciplines like linear algebra and vector space theory illustrate that such mathematical structures are of high importance in a number of disciplines and fields. It can therefore prove fruitful to have available a specialized class of data structures explicitly designed to handle arrays conveniently and efficiently. This is where the Python
library NumPy
comes into play, with its ndarray
class.
Arrays with Python Lists
Before we turn to NumPy
, let us first construct arrays with the built-in data structures presented in the previous section. list
objects are particularly suited to accomplishing this task. A simple list
can already be considered a one-dimensional array:
In
[
84
]:
v
=
[
0.5
,
0.75
,
1.0
,
1.5
,
2.0
]
# vector of numbers
Since list
objects can contain arbitrary other objects, they can also contain other list
objects. In that way, two- and higher-dimensional arrays are easily constructed by nested list
objects:
In
[
85
]:
m
=
[
v
,
v
,
v
]
# matrix of numbers
m
Out[85]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
We can also easily select rows via simple indexing or single elements via double indexing (whole columns, however, are not so easy to select):
In
[
86
]:
m
[
1
]
Out[86]: [0.5, 0.75, 1.0, 1.5, 2.0]
In
[
87
]:
m
[
1
][
0
]
Out[87]: 0.5
Nesting can be pushed further for even more general structures:
In
[
88
]:
v1
=
[
0.5
,
1.5
]
v2
=
[
1
,
2
]
m
=
[
v1
,
v2
]
c
=
[
m
,
m
]
# cube of numbers
c
Out[88]: [[[0.5, 1.5], [1, 2]], [[0.5, 1.5], [1, 2]]]
In
[
89
]:
c
[
1
][
1
][
0
]
Out[89]: 1
Note that combining objects in the way just presented generally works with reference pointers to the original objects. What does that mean in practice? Let us have a look at the following operations:
In
[
90
]:
v
=
[
0.5
,
0.75
,
1.0
,
1.5
,
2.0
]
m
=
[
v
,
v
,
v
]
m
Out[90]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
Now change the value of the first element of the v
object and see what happens to the m
object:
In
[
91
]:
v
[
0
]
=
'Python'
m
Out[91]: [['Python', 0.75, 1.0, 1.5, 2.0], ['Python', 0.75, 1.0, 1.5, 2.0], ['Python', 0.75, 1.0, 1.5, 2.0]]
This can be avoided by using the deepcopy
function of the copy
module:
In
[
92
]:
from
copy
import
deepcopy
v
=
[
0.5
,
0.75
,
1.0
,
1.5
,
2.0
]
m
=
3
*
[
deepcopy
(
v
),
]
m
Out[92]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
In
[
93
]:
v
[
0
]
=
'Python'
m
Out[93]: [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]
Regular NumPy Arrays
Obviously, composing array structures with list
objects works, somewhat. But it is not really convenient, and the list
class has not been built with this specific goal in mind. It has rather been built with a much broader and more general scope. From this point of view, some kind of specialized class could therefore be really beneficial to handle array-type structures.
Such a specialized class is numpy.ndarray
, which has been built with the specific goal of handling n-dimensional arrays both conveniently and efficiently—i.e., in a highly performing manner. The basic handling of instances of this class is again best illustrated by examples:
In
[
94
]:
import
numpy
as
np
In
[
95
]:
a
=
np
.
array
([
0
,
0.5
,
1.0
,
1.5
,
2.0
])
type
(
a
)
Out[95]: numpy.ndarray
In
[
96
]:
a
[:
2
]
# indexing as with list objects in 1 dimension
Out[96]: array([ 0. , 0.5])
A major feature of the numpy.ndarray
class is the multitude of built-in methods. For instance:
In
[
97
]:
a
.
sum
()
# sum of all elements
Out[97]: 5.0
In
[
98
]:
a
.
std
()
# standard deviation
Out[98]: 0.70710678118654757
In
[
99
]:
a
.
cumsum
()
# running cumulative sum
Out[99]: array([ 0. , 0.5, 1.5, 3. , 5. ])
Another major feature is the (vectorized) mathematical operations defined on ndarray
objects:
In
[
100
]:
a
*
2
Out[100]: array([ 0., 1., 2., 3., 4.])
In
[
101
]:
a
**
2
Out[101]: array([ 0. , 0.25, 1. , 2.25, 4. ])
In
[
102
]:
np
.
sqrt
(
a
)
Out[102]: array([ 0. , 0.70710678, 1. , 1.22474487, 1.41421356 ])
The transition to more than one dimension is seamless, and all features presented so far carry over to the more general cases. In particular, the indexing system is made consistent across all dimensions:
In
[
103
]:
b
=
np
.
array
([
a
,
a
*
2
])
b
Out[103]: array([[ 0. , 0.5, 1. , 1.5, 2. ], [ 0. , 1. , 2. , 3. , 4. ]])
In
[
104
]:
b
[
0
]
# first row
Out[104]: array([ 0. , 0.5, 1. , 1.5, 2. ])
In
[
105
]:
b
[
0
,
2
]
# third element of first row
Out[105]: 1.0
In
[
106
]:
b
.
sum
()
Out[106]: 15.0
In contrast to our list
object-based approach to constructing arrays, the numpy.ndarray
class knows axes explicitly. Selecting either rows or columns from a matrix is essentially the same:
In
[
107
]:
b
.
sum
(
axis
=
0
)
# sum along axis 0, i.e. column-wise sum
Out[107]: array([ 0. , 1.5, 3. , 4.5, 6. ])
In
[
108
]:
b
.
sum
(
axis
=
1
)
# sum along axis 1, i.e. row-wise sum
Out[108]: array([ 5., 10.])
There are a number of ways to initialize (instantiate) a numpy.ndarray
object. One is as presented before, via np.array
. However, this assumes that all elements of the array are already available. In contrast, one would maybe like to have the numpy.ndarray
objects instantiated first to populate them later with results generated during the execution of code. To this end, we can use the following functions:
In
[
109
]:
c
=
np
.
zeros
((
2
,
3
,
4
),
dtype
=
'i'
,
order
=
'C'
)
# also: np.ones()
c
Out[109]: array([[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]], dtype=int32)
In
[
110
]:
d
=
np
.
ones_like
(
c
,
dtype
=
'f16'
,
order
=
'C'
)
# also: np.zeros_like()
d
Out[110]: array([[[ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0]], [[ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0], [ 1.0, 1.0, 1.0, 1.0]]], dtype=float128)
With all these functions we provide the following information:
-
shape
-
Either an
int
, a sequence ofint
s, or a reference to anothernumpy.ndarray
-
dtype
(optional) -
A
numpy.dtype
—these areNumPy
-specific data types fornumpy.ndarray
objects -
order
(optional) -
The order in which to store elements in memory:
C
forC
-like (i.e., row-wise) orF
forFortran
-like (i.e., column-wise)
Here, it becomes obvious how NumPy
specializes the construction of arrays with the numpy.ndarray
class, in comparison to the list
-based approach:
- The shape/length/size of the array is homogenous across any given dimension.
-
It only allows for a single data type (
numpy.dtype
) for the whole array.
The role of the order
parameter is discussed later in the chapter. Table 4-4 provides an overview of numpy.dtype
objects (i.e., the basic data types NumPy
allows).
dtype | Description | Example |
| Bit field |
|
| Boolean |
|
| Integer |
|
| Unsigned integer |
|
| Floating point |
|
| Complex floating point |
|
| Object |
|
| String |
|
| Unicode |
|
| Other |
|
NumPy
provides a generalization of regular arrays that loosens at least the dtype
restriction, but let us stick with regular arrays for a moment and see what the specialization brings in terms of performance.
As a simple exercise, suppose we want to generate a matrix/array of shape 5,000 × 5,000 elements, populated with (pseudo)random, standard normally distributed numbers. We then want to calculate the sum of all elements. First, the pure Python
approach, where we make heavy use of list
comprehensions and functional programming methods as well as lambda
functions:
In
[
111
]:
import
random
I
=
5000
In
[
112
]:
%
time
mat
=
[[
random
.
gauss
(
0
,
1
)
for
j
in
range
(
I
)]
for
i
in
range
(
I
)]
# a nested list comprehension
Out[112]: CPU times: user 36.5 s, sys: 408 ms, total: 36.9 s Wall time: 36.4 s
In
[
113
]:
%
time
reduce
(
lambda
x
,
y
:
x
+
y
,
\[
reduce
(
lambda
x
,
y
:
x
+
y
,
row
)
\for
row
in
mat
])
Out[113]: CPU times: user 4.3 s, sys: 52 ms, total: 4.35 s Wall time: 4.07 s 678.5908519876674
Let us now turn to NumPy
and see how the same problem is solved there. For convenience, the NumPy
sublibrary random
offers a multitude of functions to initialize a numpy.ndarray
object and populate it at the same time with (pseudo)random numbers:
In
[
114
]:
%
time
mat
=
np
.
random
.
standard_normal
((
I
,
I
))
Out[114]: CPU times: user 1.83 s, sys: 40 ms, total: 1.87 s Wall time: 1.87 s
In
[
115
]:
%
time
mat
.
sum
()
Out[115]: CPU times: user 36 ms, sys: 0 ns, total: 36 ms Wall time: 34.6 ms 349.49777911439384
We observe the following:
- Syntax
-
Although we use several approaches to compact the pure
Python
code, theNumPy
version is even more compact and readable. - Performance
-
The generation of the
numpy.ndarray
object is roughly 20 times faster and the calculation of the sum is roughly 100 times faster than the respective operations in purePython
.
Structured Arrays
The specialization of the numpy.ndarray
class obviously brings a number of really valuable benefits with it. However, a too-narrow specialization might turn out to be too large a burden to carry for the majority of array-based algorithms and applications. Therefore, NumPy
provides structured arrays that allow us to have different NumPy
data types per column, at least. What does “per column” mean? Consider the following initialization of a structured array object:
In
[
116
]:
dt
=
np
.
dtype
([(
'Name'
,
'S10'
),
(
'Age'
,
'i4'
),
(
'Height'
,
'f'
),
(
'Children/Pets'
,
'i4'
,
2
)])
s
=
np
.
array
([(
'Smith'
,
45
,
1.83
,
(
0
,
1
)),
(
'Jones'
,
53
,
1.72
,
(
2
,
2
))],
dtype
=
dt
)
s
Out[116]: array([('Smith', 45, 1.8300000429153442, [0, 1]), ('Jones', 53, 1.7200000286102295, [2, 2])], dtype=[('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Chi ldren/Pets', '<i4', (2,))])
In a sense, this construction comes quite close to the operation for initializing tables in a SQL
database. We have column names and column data types, with maybe some additional information (e.g., maximum number of characters per string
object). The single columns can now be easily accessed by their names:
In
[
117
]:
s
[
'Name'
]
Out[117]: array(['Smith', 'Jones'], dtype='|S10')
In
[
118
]:
s
[
'Height'
]
.
mean
()
Out[118]: 1.7750001
Having selected a specific row and record, respectively, the resulting objects mainly behave like dict
objects, where one can retrieve values via keys:
In
[
119
]:
s
[
1
][
'Age'
]
Out[119]: 53
In summary, structured arrays are a generalization of the regular numpy.ndarray
object types in that the data type only has to be the same per column, as one is used to in the context of tables in SQL
databases. One advantage of structured arrays is that a single element of a column can be another multidimensional object and does not have to conform to the basic NumPy
data types.
Structured Arrays
NumPy
provides, in addition to regular arrays, structured arrays that allow the description and handling of rather complex array-oriented data structures with a variety of different data types and even structures per (named) column. They bring SQL
table-like data structures to Python
, with all the benefits of regular numpy.ndarray
objects (syntax, methods, performance).
Vectorization of Code
Vectorization of code is a strategy to get more compact code that is possibly executed faster. The fundamental idea is to conduct an operation on or to apply a function to a complex object “at once” and not by iterating over the single elements of the object. In Python
, the functional programming tools map
, filter
, and reduce
provide means for vectorization. In a sense, NumPy
has vectorization built in deep down in its core.
Basic Vectorization
As we learned in the previous section, simple mathematical operations can be implemented on numpy.ndarray
objects directly. For example, we can add two NumPy
arrays element-wise as follows:
In
[
120
]:
r
=
np
.
random
.
standard_normal
((
4
,
3
))
s
=
np
.
random
.
standard_normal
((
4
,
3
))
In
[
121
]:
r
+
s
Out[121]: array([[-1.94801686, -0.6855251 , 2.28954806], [ 0.33847593, -1.97109602, 1.30071653], [-1.12066585, 0.22234207, -2.73940339], [ 0.43787363, 0.52938941, -1.38467623]])
NumPy
also supports what is called broadcasting. This allows us to combine objects of different shape within a single operation. We have already made use of this before. Consider the following example:
In
[
122
]:
2
*
r
+
3
Out[122]: array([[ 2.54691692, 1.65823523, 8.14636725], [ 4.94758114, 0.25648128, 1.89566919], [ 0.41775907, 0.58038395, 2.06567484], [ 0.67600205, 3.41004636, 1.07282384]])
In this case, the r
object is multiplied by 2 element-wise and then 3 is added element-wise—the 3 is broadcasted or stretched to the shape of the r
object. It works with differently shaped arrays as well, up to a certain point:
In
[
123
]:
s
=
np
.
random
.
standard_normal
(
3
)
r
+
s
Out[123]: array([[ 0.23324118, -1.09764268, 1.90412565], [ 1.43357329, -1.79851966, -1.22122338], [-0.83133775, -1.63656832, -1.13622055], [-0.70221625, -0.22173711, -1.63264605]])
This broadcasts the one-dimensional array of size 3 to a shape of (4, 3). The same does not work, for example, with a one-dimensional array of size 4:
In
[
124
]:
s
=
np
.
random
.
standard_normal
(
4
)
r
+
s
Out[124]: ValueError operands could not be broadcast together with shapes (4,3) (4,)
However, transposing the r
object makes the operation work again. In the following code, the transpose
method transforms the ndarray
object with shape (4, 3) into an object of the same type with shape (3, 4):
In
[
125
]:
r
.
transpose
()
+
s
Out[125]: array([[-0.63380522, 0.5964174 , 0.88641996, -0.86931849], [-1.07814606, -1.74913253, 0.9677324 , 0.49770367], [ 2.16591995, -0.92953858, 1.71037785, -0.67090759]])
In
[
126
]:
np
.
shape
(
r
.
T
)
Out[126]: (3, 4)
As a general rule, custom-defined Python
functions work with numpy.ndarray
s as well. If the implementation allows, arrays can be used with functions just as int
or float
objects can. Consider the following function:
In
[
127
]:
def
f
(
x
):
return
3
*
x
+
5
We can pass standard Python
objects as well as numpy.ndarray
objects (for which the operations in the function have to be defined, of course):
In
[
128
]:
f
(
0.5
)
# float object
Out[128]: 6.5
In
[
129
]:
f
(
r
)
# NumPy array
Out[129]: array([[ 4.32037538, 2.98735285, 12.71955087], [ 7.9213717 , 0.88472192, 3.34350378], [ 1.1266386 , 1.37057593, 3.59851226], [ 1.51400308, 5.61506954, 2.10923576]])
What NumPy
does is to simply apply the function f
to the object element-wise. In that sense, by using this kind of operation we do not avoid loops; we only avoid them on the Python
level and delegate the looping to NumPy
. On the NumPy
level, looping over the numpy.ndarray
object is taken care of by highly optimized code, most of it written in C
and therefore generally much faster than pure Python
. This explains the “secret” behind the performance benefits of using NumPy
for array-based use cases.
When working with arrays, one has to take care to call the right functions on the respective objects. For example, the sin
function from the standard math
module of Python
does not work with NumPy
arrays:
In
[
130
]:
import
math
math
.
sin
(
r
)
Out[130]: TypeError only length-1 arrays can be converted to Python scalars
The function is designed to handle, for example, float
objects—i.e., single numbers, not arrays. NumPy
provides the respective counterparts as so-called ufuncs, or universal functions:
In
[
131
]:
np
.
sin
(
r
)
# array as input
Out[131]: array([[-0.22460878, -0.62167738, 0.53829193], [ 0.82702259, -0.98025745, -0.52453206], [-0.96114497, -0.93554821, -0.45035471], [-0.91759955, 0.20358986, -0.82124413]])
In
[
132
]:
np
.
sin
(
np
.
pi
)
# float as input
Out[132]: 1.2246467991473532e-16
NumPy
provides a large number of such ufuncs that generalize typical mathematical functions to numpy.ndarray
objects.[22]
Universal Functions
Be careful when using the from library import *
approach to importing. Such an approach can cause the NumPy
reference to the ufunc numpy.sin
to be replaced by the reference to the math
function math.sin
. You should, as a rule, import both libraries by name to avoid confusion: import numpy as np; import math
. Then you can use math.sin
alongside np.sin
.
Memory Layout
When we first initialized numpy.ndarray
objects by using numpy.zero
, we provided an optional argument for the memory layout. This argument specifies, roughly speaking, which elements of an array get stored in memory next to each other. When working with small arrays, this has hardly any measurable impact on the performance of array operations. However, when arrays get large the story is somewhat different, depending on the operations to be implemented on the arrays.
To illustrate this important point for memory-wise handling of arrays in science and finance, consider the following construction of multidimensional numpy.ndarray
objects:
In
[
133
]:
x
=
np
.
random
.
standard_normal
((
5
,
10000000
))
y
=
2
*
x
+
3
# linear equation y = a * x + b
C
=
np
.
array
((
x
,
y
),
order
=
'C'
)
F
=
np
.
array
((
x
,
y
),
order
=
'F'
)
x
=
0.0
;
y
=
0.0
# memory cleanup
In
[
134
]:
C
[:
2
]
.
round
(
2
)
Out[134]: array([[[-0.51, -1.14, -1.07, ..., 0.2 , -0.18, 0.1 ], [-1.22, 0.68, 1.83, ..., 1.23, -0.27, -0.16], [ 0.45, 0.15, 0.01, ..., -0.75, 0.91, -1.12], [-0.16, 1.4 , -0.79, ..., -0.33, 0.54, 1.81], [ 1.07, -1.07, -0.37, ..., -0.76, 0.71, 0.34]], [[ 1.98, 0.72, 0.86, ..., 3.4 , 2.64, 3.21], [ 0.55, 4.37, 6.66, ..., 5.47, 2.47, 2.68], [ 3.9 , 3.29, 3.03, ..., 1.5 , 4.82, 0.76], [ 2.67, 5.8 , 1.42, ..., 2.34, 4.09, 6.63], [ 5.14, 0.87, 2.27, ..., 1.48, 4.43, 3.67]]])
Let’s look at some really fundamental examples and use cases for both types of ndarray
objects:
In
[
135
]:
%
timeit
C
.
sum
()
Out[135]: 10 loops, best of 3: 123 ms per loop
In
[
136
]:
%
timeit
F
.
sum
()
Out[136]: 10 loops, best of 3: 123 ms per loop
When summing up all elements of the arrays, there is no performance difference between the two memory layouts. However, consider the following example with the C-like memory layout:
In
[
137
]:
%
timeit
C
[
0
]
.
sum
(
axis
=
0
)
Out[137]: 10 loops, best of 3: 102 ms per loop
In
[
138
]:
%
timeit
C
[
0
]
.
sum
(
axis
=
1
)
Out[138]: 10 loops, best of 3: 61.9 ms per loop
Summing five large vectors and getting back a single large results vector obviously is slower in this case than summing 10,000,000 small ones and getting back an equal number of results. This is due to the fact that the single elements of the small vectors—i.e., the rows—are stored next to each other. With the Fortran
-like memory layout, the relative performance changes considerably:
In
[
139
]:
%
timeit
F
.
sum
(
axis
=
0
)
Out[139]: 1 loops, best of 3: 801 ms per loop
In
[
140
]:
%
timeit
F
.
sum
(
axis
=
1
)
Out[140]: 1 loops, best of 3: 2.23 s per loop
In
[
141
]:
F
=
0.0
;
C
=
0.0
# memory cleanup
In this case, operating on a few large vectors performs better than operating on a large number of small ones. The elements of the few large vectors are stored in memory next to each other, which explains the relative performance advantage. However, overall the operations are absolutely much slower when compared to the C
-like variant.
Conclusions
Python
provides, in combination with NumPy
, a rich set of flexible data structures. From a finance point of view, the following can be considered the most important ones:
- Basic data types
-
In finance, the classes
int
,float
, andstring
provide the atomic data types. - Standard data structures
-
The classes
tuple
,list
,dict
, andset
have many application areas in finance, withlist
being the most flexible workhorse in general. - Arrays
-
A large class of finance-related problems and algorithms can be cast to an array setting;
NumPy
provides the specialized classnumpy.ndarray
, which provides both convenience and compactness of code as well as high performance.
This chapter shows that both the basic data structures and the NumPy
ones allow for highly vectorized implementation of algorithms. Depending on the specific shape of the data structures, care should be taken with regard to the memory layout of arrays. Choosing the right approach here can speed up code execution by a factor of two or more.
Further Reading
This chapter focuses on those issues that might be of particular importance for finance algorithms and applications. However, it can only represent a starting point for the exploration of data structures and data modeling in Python
. There are a number of valuable resources available to go deeper from here.
Here are some Internet resources to consult:
-
The
Python
documentation is always a good starting point: http://www.python.org/doc/. -
For details on
NumPy
arrays as well as related methods and functions, see http://docs.scipy.org/doc/. -
The
SciPy
lecture notes are also a good source to get started: http://scipy-lectures.github.io/.
Good references in book form are:
- Goodrich, Michael et al. (2013): Data Structures and Algorithms in Python. John Wiley & Sons, Hoboken, NJ.
- Langtangen, Hans Petter (2009): A Primer on Scientific Programming with Python. Springer Verlag, Berlin, Heidelberg.
[18] The Cython
library brings static typing and compiling features to Python
that are comparable to those in C
. In fact, Cython
is a hybrid language of Python
and C
.
[19] Here and in the following discussion, terms like float, float object, etc. are used interchangeably, acknowledging that every float is also an object. The same holds true for other object types.
[21] It is not possible to go into details here, but there is a wealth of information available on the Internet about regular expressions in general and for Python
in particular. For an introduction to this topic, refer to Fitzgerald, Michael (2012): Introducing Regular Expressions. O’Reilly, Sebastopol, CA.
[22] Cf. http://docs.scipy.org/doc/numpy/reference/ufuncs.html for an overview.
Get Python for Finance now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.