Chapter 1. Artificial Intelligence
This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
David Silver et al. (2016)
This chapter introduces general notions, ideas, and definitions from the field of artificial intelligence (AI) for the purposes of this book. It also provides worked-out examples for different types of major learning algorithms. In particular, “Algorithms” takes a broad perspective and categorizes types of data, types of learning, and types of problems typically encountered in an AI context. This chapter also presents examples for unsupervised and reinforcement learning. “Neural Networks” jumps right into the world of neural networks, which not only are central to what follows in later chapters of the book but also have proven to be among the most powerful algorithms AI has to offer nowadays. “Importance of Data” discusses the importance of data volume and variety in the context of AI.
Algorithms
This section introduces basic notions from the field of AI relevant to this book. It discusses the different types of data, learning, problems, and approaches that can be subsumed under the general term AI. Alpaydin (2016) provides an informal introduction to and overview of many of the topics covered only briefly in this section, along with many examples.
Types of Data
Data in general has two major components:
- Features
-
Features data (or input data) is data that is given as input to an algorithm. In a financial context, this might be, for example, the income and the savings of a potential debtor.
- Labels
-
Labels data (or output data) is data that is given as the relevant output to be learned, for example, by a supervised learning algorithm. In a financial context, this might be the creditworthiness of a potential debtor.
Types of Learning
There are three major types of learning algorithms:
- Supervised learning (SL)
-
These are algorithms that learn from a given sample data set of features (input) and labels (output) values. The next section presents examples for such algorithms, like ordinary least-squares (OLS) regression and neural networks. The purpose of supervised learning is to learn the relationship between the input and output values. In finance, such algorithms might be trained to predict whether a potential debtor is creditworthy or not. For the purposes of this book, these are the most important types of algorithms.
- Unsupervised learning (UL)
-
These are algorithms that learn from a given sample data set of features (input) values only, often with the goal of finding structure in the data. They are supposed to learn about the input data set, given, for example, some guiding parameters. Clustering algorithms fall into that category. In a financial context, such algorithms might cluster stocks into certain groups.
- Reinforcement learning (RL)
-
These are algorithms that learn from trial and error by receiving a reward for taking an action. They update an optimal action policy according to what rewards and punishments they receive. Such algorithms are, for example, used for environments where actions need to be taken continuously and rewards are received immediately, such as in a computer game.
Because supervised learning is addressed in the subsequent section in some detail, brief examples will illustrate unsupervised learning and reinforcement learning.
Unsupervised Learning
Simply speaking, a k-means clustering algorithm sorts observations into clusters.
Each observation belongs to the cluster to which its mean (center) is nearest.
The following Python code generates sample data for which the features data is clustered. Figure 1-1 visualizes the clustered sample data and also shows that the scikit-learn
KMeans
algorithm used here has identified the clusters perfectly. The coloring of the dots is based on what the algorithm has learned.1
In
[
1
]
:
import
numpy
as
np
import
pandas
as
pd
from
pylab
import
plt
,
mpl
plt
.
style
.
use
(
'
seaborn
'
)
mpl
.
rcParams
[
'
savefig.dpi
'
]
=
300
mpl
.
rcParams
[
'
font.family
'
]
=
'
serif
'
np
.
set_printoptions
(
precision
=
4
,
suppress
=
True
)
In
[
2
]
:
from
sklearn.cluster
import
KMeans
from
sklearn.datasets
import
make_blobs
In
[
3
]
:
x
,
y
=
make_blobs
(
n_samples
=
100
,
centers
=
4
,
random_state
=
500
,
cluster_std
=
1.25
)
In
[
4
]
:
model
=
KMeans
(
n_clusters
=
4
,
random_state
=
0
)
In
[
5
]
:
model
.
fit
(
x
)
Out
[
5
]
:
KMeans
(
n_clusters
=
4
,
random_state
=
0
)
In
[
6
]
:
y_
=
model
.
predict
(
x
)
In
[
7
]
:
y_
Out
[
7
]
:
array
(
[
3
,
3
,
1
,
2
,
1
,
1
,
3
,
2
,
1
,
2
,
2
,
3
,
2
,
0
,
0
,
3
,
2
,
0
,
2
,
0
,
0
,
3
,
1
,
2
,
1
,
1
,
0
,
0
,
1
,
3
,
2
,
1
,
1
,
0
,
1
,
3
,
1
,
3
,
2
,
2
,
2
,
1
,
0
,
0
,
3
,
1
,
2
,
0
,
2
,
0
,
3
,
0
,
1
,
0
,
1
,
3
,
1
,
2
,
0
,
3
,
1
,
0
,
3
,
2
,
3
,
0
,
1
,
1
,
1
,
2
,
3
,
1
,
2
,
0
,
2
,
3
,
2
,
0
,
2
,
2
,
1
,
3
,
1
,
3
,
2
,
2
,
3
,
2
,
0
,
0
,
0
,
3
,
3
,
3
,
3
,
0
,
3
,
1
,
0
,
0
]
,
dtype
=
int32
)
In
[
8
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
scatter
(
x
[
:
,
0
]
,
x
[
:
,
1
]
,
c
=
y_
,
cmap
=
'
coolwarm
'
)
;
A sample data set is created with clustered features data.
A
KMeans
model object is instantiated, fixing the number of clusters.The model is fitted to the features data.
The predictions are generated given the fitted model.
The predictions are numbers from 0 to 3, each representing one cluster.
Once an algorithm such as KMeans
is trained, it can, for instance, predict the cluster for a new (not yet seen) combination of features values. Assume that such an algorithm is trained on features data that describes potential and real debtors of a bank. It might learn about the creditworthiness of potential debtors by generating two clusters. New potential debtors can then be sorted into a certain cluster: “creditworthy” versus “not creditworthy.”
Reinforcement learning
The following example is based on a coin tossing game that is played with a coin that lands 80% of the time on heads and 20% of the time on tails. The coin tossing game is heavily biased to emphasize the benefits of learning as compared to an uninformed baseline algorithm. The baseline algorithm, which bets randomly and equally distributes on heads and tails, achieves a total reward of around 50, on average, per epoch of 100 bets played:
In
[
9
]
:
ssp
=
[
1
,
1
,
1
,
1
,
0
]
In
[
10
]
:
asp
=
[
1
,
0
]
In
[
11
]
:
def
epoch
(
)
:
tr
=
0
for
_
in
range
(
100
)
:
a
=
np
.
random
.
choice
(
asp
)
s
=
np
.
random
.
choice
(
ssp
)
if
a
==
s
:
tr
+
=
1
return
tr
In
[
12
]
:
rl
=
np
.
array
(
[
epoch
(
)
for
_
in
range
(
15
)
]
)
rl
Out
[
12
]
:
array
(
[
53
,
55
,
50
,
48
,
46
,
41
,
51
,
49
,
50
,
52
,
46
,
47
,
43
,
51
,
52
]
)
In
[
13
]
:
rl
.
mean
(
)
Out
[
13
]
:
48.93333333333333
The state space (1 = heads, 0 = tails).
The action space (1 = bet on heads, 0 = bet on tails).
An action is randomly chosen from the action space.
A state is randomly chosen from the state space.
The total reward
tr
is increased by one if the bet is correct.The game is played for a number of epochs; each epoch is 100 bets.
The average total reward of the epochs played is calculated.
Reinforcement learning tries to learn from what is observed after an action is taken, usually based on a reward. To keep things simple, the following learning algorithm only keeps track of the states that are observed in each round insofar as they are appended to the action space list
object. In this way, the algorithm learns the bias in the game, though maybe not perfectly. By randomly sampling from the updated action space, the bias is reflected because naturally the bet will more often be heads. Over time, heads is chosen, on average, around 80% of the time. The average total reward of around 65 reflects the improvement of the learning algorithm as compared to the uninformed baseline algorithm:
In
[
14
]
:
ssp
=
[
1
,
1
,
1
,
1
,
0
]
In
[
15
]
:
def
epoch
(
)
:
tr
=
0
asp
=
[
0
,
1
]
for
_
in
range
(
100
)
:
a
=
np
.
random
.
choice
(
asp
)
s
=
np
.
random
.
choice
(
ssp
)
if
a
==
s
:
tr
+
=
1
asp
.
append
(
s
)
return
tr
In
[
16
]
:
rl
=
np
.
array
(
[
epoch
(
)
for
_
in
range
(
15
)
]
)
rl
Out
[
16
]
:
array
(
[
64
,
65
,
77
,
65
,
54
,
64
,
71
,
64
,
57
,
62
,
69
,
63
,
61
,
66
,
75
]
)
In
[
17
]
:
rl
.
mean
(
)
Out
[
17
]
:
65.13333333333334
Types of Tasks
Depending on the type of labels data and the problem at hand, two types of tasks to be learned are important:
- Estimation
-
Estimation (or approximation, regression) refers to the cases in which the labels data is real-valued (continuous); that is, it is technically represented as floating point numbers.
- Classification
-
Classification refers to the cases in which the labels data consists of a finite number of classes or categories that are typically represented by discrete values (positive natural numbers), which in turn are represented technically as integers.
The following section provides examples for both types of tasks.
Types of Approaches
Some more definitions might be in order before finishing this section. This book follows the common differentiation between the following three major terms:
- Artificial intelligence (AI)
-
AI encompasses all types of learning (algorithms), as defined before, and some more (for example, expert systems).
- Machine learning (ML)
-
ML is the discipline of learning relationships and other information about given data sets based on an algorithm and a measure of success; a measure of success might, for example, be the mean-squared error (MSE) given labels values and output values to be estimated and the predicted values from the algorithm. ML is a sub-set of AI.
- Deep learning (DL)
-
DL encompasses all algorithms based on neural networks. The term deep is usually only used when the neural network has more than one hidden layer. DL is a sub-set of machine learning and so is therefore also a sub-set of AI.
DL has proven useful for a number of broad problem areas. It is suited for estimation and classification tasks, as well as for RL. In many cases, DL-based approaches perform better than alternative algorithms, such as logistic regression or kernel-based ones, like support vector machines.2 That is why this book mainly focuses on DL. DL approaches used include dense neural networks (DNNs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs). More details appear in later chapters, particularly in Part III.
Neural Networks
The previous sections provide a broader overview of algorithms in AI. This section shows how neural networks fit in. A simple example will illustrate what characterizes neural networks in comparison to traditional statistical methods, such as ordinary least-squares (OLS) regression. The example starts with mathematics and then uses linear regression for estimation (or function approximation) and finally applies neural networks to accomplish the estimation. The approach taken here is a supervised learning approach where the task is to estimate labels data based on features data. This section also illustrates the use of neural networks in the context of classification problems.
OLS Regression
Assume that a mathematical function is given as follows:
Such a function transforms an input value to an output value . Or it transforms a series of input values into a series of output values . The following Python code implements the mathematical function as a Python function and creates a number of input and output values. Figure 1-2 plots the output values against the input values:
In
[
18
]
:
def
f
(
x
)
:
return
2
*
x
*
*
2
-
x
*
*
3
/
3
In
[
19
]
:
x
=
np
.
linspace
(
-
2
,
4
,
25
)
x
Out
[
19
]
:
array
(
[
-
2.
,
-
1.75
,
-
1.5
,
-
1.25
,
-
1.
,
-
0.75
,
-
0.5
,
-
0.25
,
0.
,
0.25
,
0.5
,
0.75
,
1.
,
1.25
,
1.5
,
1.75
,
2.
,
2.25
,
2.5
,
2.75
,
3.
,
3.25
,
3.5
,
3.75
,
4.
]
)
In
[
20
]
:
y
=
f
(
x
)
y
Out
[
20
]
:
array
(
[
10.6667
,
7.9115
,
5.625
,
3.776
,
2.3333
,
1.2656
,
0.5417
,
0.1302
,
0.
,
0.1198
,
0.4583
,
0.9844
,
1.6667
,
2.474
,
3.375
,
4.3385
,
5.3333
,
6.3281
,
7.2917
,
8.1927
,
9.
,
9.6823
,
10.2083
,
10.5469
,
10.6667
]
)
In
[
21
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
x
,
y
,
'
ro
'
)
;
Whereas in the mathematical example the function comes first, the input data second, and the output data third, the sequence is different in statistical learning. Assume that the previous input values and output values are given. They represent the sample (data). The problem in statistical regression is to find a function that approximates the functional relationship between the input values (also called the independent values) and the output values (also called the dependent values) as well as possible.
Assume simple OLS linear regression. In this case, the functional relationship between the input and output values is assumed to be linear, and the problem is to find optimal parameters and for the following linear equation:
For given input values and output values , optimal in this case means that they minimize the mean squared error (MSE) between the real output values and the approximated output values:
For the case of simple linear regression, the solution is known in closed form, as shown in the following equation. Bars on the variables indicate sample mean values:
The following Python code calculates the optimal parameter values, linearly estimates (approximates) the output values, and plots the linear regression line alongside the sample data (see Figure 1-3). The linear regression approach does not work too well here in approximating the functional relationship. This is confirmed by the relatively high MSE value:
In
[
22
]
:
beta
=
np
.
cov
(
x
,
y
,
ddof
=
0
)
[
0
,
1
]
/
np
.
var
(
x
)
beta
Out
[
22
]
:
1.0541666666666667
In
[
23
]
:
alpha
=
y
.
mean
(
)
-
beta
*
x
.
mean
(
)
alpha
Out
[
23
]
:
3.8625000000000003
In
[
24
]
:
y_
=
alpha
+
beta
*
x
In
[
25
]
:
MSE
=
(
(
y
-
y_
)
*
*
2
)
.
mean
(
)
MSE
Out
[
25
]
:
10.721953125
In
[
26
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
x
,
y
,
'
ro
'
,
label
=
'
sample data
'
)
plt
.
plot
(
x
,
y_
,
lw
=
3.0
,
label
=
'
linear regression
'
)
plt
.
legend
(
)
;
Calculation of optimal
Calculation of optimal
Calculation of estimated output values
Calculation of the MSE given the approximation
How can the MSE value be improved (decreased)—maybe even to 0, that is, to a “perfect estimation?” Of course, OLS regression is not constrained to a simple linear relationship. In addition to the constant and linear terms, higher order monomials, for instance, can be easily added as basis functions. To this end, compare the regression results shown in Figure 1-4 and the following code that creates the figure. The improvements that come from using quadratic and cubic monomials as basis functions are obvious and also are numerically confirmed by the calculated MSE values. For basis functions up to and including the cubic monomial, the estimation is perfect, and the functional relationship is perfectly recovered:
In
[
27
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
x
,
y
,
'
ro
'
,
label
=
'
sample data
'
)
for
deg
in
[
1
,
2
,
3
]
:
reg
=
np
.
polyfit
(
x
,
y
,
deg
=
deg
)
y_
=
np
.
polyval
(
reg
,
x
)
MSE
=
(
(
y
-
y_
)
*
*
2
)
.
mean
(
)
(
f
'
deg={deg} | MSE={MSE:.5f}
'
)
plt
.
plot
(
x
,
np
.
polyval
(
reg
,
x
)
,
label
=
f
'
deg={deg}
'
)
plt
.
legend
(
)
;
deg
=
1
|
MSE
=
10.72195
deg
=
2
|
MSE
=
2.31258
deg
=
3
|
MSE
=
0.00000
In
[
28
]
:
reg
Out
[
28
]
:
array
(
[
-
0.3333
,
2.
,
0.
,
-
0.
]
)
Exploiting the knowledge of the form of the mathematical function to be approximated and accordingly adding more basis functions to the regression leads to a “perfect approximation.” That is, the OLS regression recovers the exact factors of the quadratic and cubic part, respectively, of the original function.
Estimation with Neural Networks
However, not all relationships are of this kind. This is where, for instance, neural networks can help. Without going into the details, neural networks can approximate a wide range of functional relationships. Knowledge of the form of the relationship is generally not required.
Scikit-learn
The following Python code uses the MLPRegressor
class of scikit-learn
, which implements a DNN for estimation. DNNs are sometimes also called multi-layer perceptron (MLP).3 The results are not perfect, as Figure 1-5 and the MSE illustrate. However, they are quite good already for the simple configuration used:
In
[
29
]
:
from
sklearn.neural_network
import
MLPRegressor
In
[
30
]
:
model
=
MLPRegressor
(
hidden_layer_sizes
=
3
*
[
256
]
,
learning_rate_init
=
0.03
,
max_iter
=
5000
)
In
[
31
]
:
model
.
fit
(
x
.
reshape
(
-
1
,
1
)
,
y
)
Out
[
31
]
:
MLPRegressor
(
hidden_layer_sizes
=
[
256
,
256
,
256
]
,
learning_rate_init
=
0.03
,
max_iter
=
5000
)
In
[
32
]
:
y_
=
model
.
predict
(
x
.
reshape
(
-
1
,
1
)
)
In
[
33
]
:
MSE
=
(
(
y
-
y_
)
*
*
2
)
.
mean
(
)
MSE
Out
[
33
]
:
0.021662355744355866
In
[
34
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
x
,
y
,
'
ro
'
,
label
=
'
sample data
'
)
plt
.
plot
(
x
,
y_
,
lw
=
3.0
,
label
=
'
dnn estimation
'
)
plt
.
legend
(
)
;
Instantiates the
MLPRegressor
objectImplements the fitting or learning step
Implements the prediction step
Just having a look at the results in Figure 1-4 and Figure 1-5, one might assume that the methods and approaches are not too dissimilar after all. However, there is a fundamental difference worth highlighting. Although the OLS regression approach, as shown explicitly for the simple linear regression, is based on the calculation of certain well-specified quantities and parameters, the neural network approach relies on incremental learning. This in turn means that a set of parameters, the weights within the neural network, are first initialized randomly and then adjusted gradually given the differences between the neural network output and the sample output values. This approach lets you retrain (update) a neural network incrementally.
Keras
The next example uses a sequential model with the Keras
deep learning package.4 The model is fitted, or trained, for 100 epochs. The procedure is repeated for five rounds. After every such round, the approximation by the neural network is updated and plotted. Figure 1-6 shows how the approximation gradually improves with every round. This is also reflected in the decreasing MSE values. The end result is not perfect, but again, it is quite good given the simplicity of the model:
In
[
35
]
:
import
tensorflow
as
tf
tf
.
random
.
set_seed
(
100
)
In
[
36
]
:
from
keras.layers
import
Dense
from
keras.models
import
Sequential
Using
TensorFlow
backend
.
In
[
37
]
:
model
=
Sequential
(
)
model
.
add
(
Dense
(
256
,
activation
=
'
relu
'
,
input_dim
=
1
)
)
model
.
add
(
Dense
(
1
,
activation
=
'
linear
'
)
)
model
.
compile
(
loss
=
'
mse
'
,
optimizer
=
'
rmsprop
'
)
In
[
38
]
:
(
(
y
-
y_
)
*
*
2
)
.
mean
(
)
Out
[
38
]
:
0.021662355744355866
In
[
39
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
x
,
y
,
'
ro
'
,
label
=
'
sample data
'
)
for
_
in
range
(
1
,
6
)
:
model
.
fit
(
x
,
y
,
epochs
=
100
,
verbose
=
False
)
y_
=
model
.
predict
(
x
)
MSE
=
(
(
y
-
y_
.
flatten
(
)
)
*
*
2
)
.
mean
(
)
(
f
'
round={_} | MSE={MSE:.5f}
'
)
plt
.
plot
(
x
,
y_
,
'
--
'
,
label
=
f
'
round={_}
'
)
plt
.
legend
(
)
;
round
=
1
|
MSE
=
3.09714
round
=
2
|
MSE
=
0.75603
round
=
3
|
MSE
=
0.22814
round
=
4
|
MSE
=
0.11861
round
=
5
|
MSE
=
0.09029
Instantiates the
Sequential
model objectAdds a densely connected hidden layer with rectified linear unit (ReLU) activation5
Adds the output layer with linear activation
Compiles the model for usage
Trains the neural network for a fixed number of epochs
Implements the approximation step
Calculates the current MSE
Plots the current approximation results
Roughly speaking, one can say that the neural network does almost as well in the estimation as the OLS regression, which delivers a perfect result. Therefore, why use neural networks at all? A more comprehensive answer might need to come later in this book, but a somewhat different example might give some hint.
Consider instead the previous sample data set, as generated from a well-defined mathematical function, now a random sample data set, for which both features and labels are randomly chosen. Of course, this example is for illustration and does not allow for a deep interpretation.
The following code generates the random sample data set and creates the OLS regression estimation based on a varying number of monomial basis functions. Figure 1-7 visualizes the results. Even for the highest number of monomials in the example, the estimation results are still not too good. The MSE value is accordingly relatively high:
In
[
40
]:
np
.
random
.
seed
(
0
)
x
=
np
.
linspace
(
-
1
,
1
)
y
=
np
.
random
.
random
(
len
(
x
))
*
2
-
1
In
[
41
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
x
,
y
,
'ro'
,
label
=
'sample data'
)
for
deg
in
[
1
,
5
,
9
,
11
,
13
,
15
]:
reg
=
np
.
polyfit
(
x
,
y
,
deg
=
deg
)
y_
=
np
.
polyval
(
reg
,
x
)
MSE
=
((
y
-
y_
)
**
2
)
.
mean
()
(
f
'deg={deg:2d} | MSE={MSE:.5f}'
)
plt
.
plot
(
x
,
np
.
polyval
(
reg
,
x
),
label
=
f
'deg={deg}'
)
plt
.
legend
();
deg
=
1
|
MSE
=
0.28153
deg
=
5
|
MSE
=
0.27331
deg
=
9
|
MSE
=
0.25442
deg
=
11
|
MSE
=
0.23458
deg
=
13
|
MSE
=
0.22989
deg
=
15
|
MSE
=
0.21672
The results for the OLS regression are not too surprising. OLS regression in this case assumes that the approximation can be achieved through an appropriate combination of a finite number of basis functions. Since the sample data set has been generated randomly, the OLS regression does not perform well in this case.
What about neural networks? The application is as straightforward as before and yields estimations as shown in Figure 1-8. While the end result is not perfect, it is obvious that the neural network performs better than the OLS regression in estimating the random label values from the random features values. Given its architecture, however, the neural network has almost 200,000 trainable parameters (weights), which offers relatively high flexibility, particularly when compared to the OLS regression, for which a maximum of 15 + 1 parameters are used:
In
[
42
]
:
model
=
Sequential
(
)
model
.
add
(
Dense
(
256
,
activation
=
'
relu
'
,
input_dim
=
1
)
)
for
_
in
range
(
3
)
:
model
.
add
(
Dense
(
256
,
activation
=
'
relu
'
)
)
model
.
add
(
Dense
(
1
,
activation
=
'
linear
'
)
)
model
.
compile
(
loss
=
'
mse
'
,
optimizer
=
'
rmsprop
'
)
In
[
43
]
:
model
.
summary
(
)
Model
:
"
sequential_2
"
_________________________________________________________________
Layer
(
type
)
Output
Shape
Param
#
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
dense_3
(
Dense
)
(
None
,
256
)
512
_________________________________________________________________
dense_4
(
Dense
)
(
None
,
256
)
65792
_________________________________________________________________
dense_5
(
Dense
)
(
None
,
256
)
65792
_________________________________________________________________
dense_6
(
Dense
)
(
None
,
256
)
65792
_________________________________________________________________
dense_7
(
Dense
)
(
None
,
1
)
257
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
Total
params
:
198
,
145
Trainable
params
:
198
,
145
Non
-
trainable
params
:
0
_________________________________________________________________
In
[
44
]
:
%
%
time
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
x
,
y
,
'
ro
'
,
label
=
'
sample data
'
)
for
_
in
range
(
1
,
8
)
:
model
.
fit
(
x
,
y
,
epochs
=
500
,
verbose
=
False
)
y_
=
model
.
predict
(
x
)
MSE
=
(
(
y
-
y_
.
flatten
(
)
)
*
*
2
)
.
mean
(
)
(
f
'
round={_} | MSE={MSE:.5f}
'
)
plt
.
plot
(
x
,
y_
,
'
--
'
,
label
=
f
'
round={_}
'
)
plt
.
legend
(
)
;
round
=
1
|
MSE
=
0.13560
round
=
2
|
MSE
=
0.08337
round
=
3
|
MSE
=
0.06281
round
=
4
|
MSE
=
0.04419
round
=
5
|
MSE
=
0.03329
round
=
6
|
MSE
=
0.07676
round
=
7
|
MSE
=
0.00431
CPU
times
:
user
30.4
s
,
sys
:
4.7
s
,
total
:
35.1
s
Wall
time
:
13.6
s
Classification with Neural Networks
Another benefit of neural networks is that they can be easily used for classification tasks as well. Consider the following Python code that implements a classification using a neural network based on Keras
. The binary features data and labels data are generated randomly. The major adjustment to be made modeling-wise is to change the activation function from the output layer to sigmoid
from linear
. More details on this appear in later chapters. The classification is not perfect. However, it reaches
a high level of accuracy. How the accuracy, expressed as the relationship between
correct results to all label values, changes with the number of training epochs is shown
in Figure 1-9. The accuracy starts out low and then improves step-wise, though not
necessarily with every step:
In
[
45
]
:
f
=
5
n
=
10
In
[
46
]
:
np
.
random
.
seed
(
100
)
In
[
47
]
:
x
=
np
.
random
.
randint
(
0
,
2
,
(
n
,
f
)
)
x
Out
[
47
]
:
array
(
[
[
0
,
0
,
1
,
1
,
1
]
,
[
1
,
0
,
0
,
0
,
0
]
,
[
0
,
1
,
0
,
0
,
0
]
,
[
0
,
1
,
0
,
0
,
1
]
,
[
0
,
1
,
0
,
0
,
0
]
,
[
1
,
1
,
1
,
0
,
0
]
,
[
1
,
0
,
0
,
1
,
1
]
,
[
1
,
1
,
1
,
0
,
0
]
,
[
1
,
1
,
1
,
1
,
1
]
,
[
1
,
1
,
1
,
0
,
1
]
]
)
In
[
48
]
:
y
=
np
.
random
.
randint
(
0
,
2
,
n
)
y
Out
[
48
]
:
array
(
[
1
,
1
,
0
,
0
,
1
,
1
,
0
,
1
,
0
,
1
]
)
In
[
49
]
:
model
=
Sequential
(
)
model
.
add
(
Dense
(
256
,
activation
=
'
relu
'
,
input_dim
=
f
)
)
model
.
add
(
Dense
(
1
,
activation
=
'
sigmoid
'
)
)
model
.
compile
(
loss
=
'
binary_crossentropy
'
,
optimizer
=
'
rmsprop
'
,
metrics
=
[
'
acc
'
]
)
In
[
50
]
:
h
=
model
.
fit
(
x
,
y
,
epochs
=
50
,
verbose
=
False
)
Out
[
50
]
:
<
keras
.
callbacks
.
callbacks
.
History
at
0x7fde09dd1cd0
>
In
[
51
]
:
y_
=
np
.
where
(
model
.
predict
(
x
)
.
flatten
(
)
>
0.5
,
1
,
0
)
y_
Out
[
51
]
:
array
(
[
1
,
1
,
0
,
0
,
0
,
1
,
0
,
1
,
0
,
1
]
,
dtype
=
int32
)
In
[
52
]
:
y
==
y_
Out
[
52
]
:
array
(
[
True
,
True
,
True
,
True
,
False
,
True
,
True
,
True
,
True
,
True
]
)
In
[
53
]
:
res
=
pd
.
DataFrame
(
h
.
history
)
In
[
54
]
:
res
.
plot
(
figsize
=
(
10
,
6
)
)
;
Creates random features data
Creates random labels data
Defines the activation function for the output layer as
sigmoid
Defines the loss function to be
binary_crossentropy
6Compares the predicted values with the labels data
Plots the loss function and accuracy values for every training step
The examples in this section illustrate some fundamental characteristics of neural networks as compared to OLS regression:
- Problem-agnostic
-
The neural network approach is agnostic when it comes to estimating and classifying label values, given a set of feature values. Statistical methods, such as OLS regression, might perform well for a smaller set of problems, but not too well or not at all for others.
- Incremental learning
-
The optimal weights within a neural network, given a target measure of success, are learned incrementally based on a random initialization and incremental improvements. These incremental improvements are achieved by considering the differences between the predicted values and the sample label values and backpropagating weights updates through the neural network.
- Universal approximation
-
There are strong mathematical theorems showing that neural networks (even with one hidden layer only) can approximate almost any function.7
These characteristics might justify why this book puts neural networks at the core with regard to the algorithms used. Chapter 2 discusses more good reasons.
Importance of Data
The example at the end of the previous section shows that neural networks are capable of solving classification problems quite well. The neural network with one hidden layer reaches a high degree of accuracy on the given data set, or in-sample. However, what about the predictive power of a neural network? This hinges significantly on the volume and variety of the data available to train the neural network. Another numerical example, based on larger data sets, will illustrate this point.
Small Data Set
Consider a random sample data set similar to the one used before in the classification example, but with more features and more samples. Most algorithms used in AI are about pattern recognition. In the following Python code, the number of binary features defines the number of possible patterns about which the algorithm can learn something. Given that the labels data is also binary, the algorithm tries to learn whether a 0
or 1
is more likely given a certain pattern, say [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
. Because all numbers are randomly chosen with equal probability, there is not that much to learn beyond the fact that the labels 0
and 1
are equally likely no matter what (random) pattern is observed. Therefore, a baseline prediction algorithm should be accurate about 50% of the time, no matter what (random) pattern it is
presented with:
In
[
55
]
:
f
=
10
n
=
250
In
[
56
]
:
np
.
random
.
seed
(
100
)
In
[
57
]
:
x
=
np
.
random
.
randint
(
0
,
2
,
(
n
,
f
)
)
x
[
:
4
]
Out
[
57
]
:
array
(
[
[
0
,
0
,
1
,
1
,
1
,
1
,
0
,
0
,
0
,
0
]
,
[
0
,
1
,
0
,
0
,
0
,
0
,
1
,
0
,
0
,
1
]
,
[
0
,
1
,
0
,
0
,
0
,
1
,
1
,
1
,
0
,
0
]
,
[
1
,
0
,
0
,
1
,
1
,
1
,
1
,
1
,
0
,
0
]
]
)
In
[
58
]
:
y
=
np
.
random
.
randint
(
0
,
2
,
n
)
y
[
:
4
]
Out
[
58
]
:
array
(
[
0
,
1
,
0
,
0
]
)
In
[
59
]
:
2
*
*
f
Out
[
59
]
:
1024
In order to proceed, the raw data is put into a pandas
DataFrame
object, which simplifies certain operations and analyses:
In
[
60
]
:
fcols
=
[
f
'
f{_}
'
for
_
in
range
(
f
)
]
fcols
Out
[
60
]
:
[
'
f0
'
,
'
f1
'
,
'
f2
'
,
'
f3
'
,
'
f4
'
,
'
f5
'
,
'
f6
'
,
'
f7
'
,
'
f8
'
,
'
f9
'
]
In
[
61
]
:
data
=
pd
.
DataFrame
(
x
,
columns
=
fcols
)
data
[
'
l
'
]
=
y
In
[
62
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
RangeIndex
:
250
entries
,
0
to
249
Data
columns
(
total
11
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
f0
250
non
-
null
int64
1
f1
250
non
-
null
int64
2
f2
250
non
-
null
int64
3
f3
250
non
-
null
int64
4
f4
250
non
-
null
int64
5
f5
250
non
-
null
int64
6
f6
250
non
-
null
int64
7
f7
250
non
-
null
int64
8
f8
250
non
-
null
int64
9
f9
250
non
-
null
int64
10
l
250
non
-
null
int64
dtypes
:
int64
(
11
)
memory
usage
:
21.6
KB
Defines column names for the features data
Puts the features data into a
DataFrame
objectPuts the labels data into the same
DataFrame
objectShows the meta information for the data set
Two major problems can be identified given the results from executing the following Python code. First, not all patterns are in the sample data set. Second, the sample size is much too small per observed pattern. Even without digging deeper, it is clear that no classification algorithm can really learn about all the possible patterns in a meaningful way:
In
[
63
]
:
grouped
=
data
.
groupby
(
list
(
data
.
columns
)
)
In
[
64
]
:
freq
=
grouped
[
'
l
'
]
.
size
(
)
.
unstack
(
fill_value
=
0
)
In
[
65
]
:
freq
[
'
sum
'
]
=
freq
[
0
]
+
freq
[
1
]
In
[
66
]
:
freq
.
head
(
10
)
Out
[
66
]
:
l
0
1
sum
f0
f1
f2
f3
f4
f5
f6
f7
f8
f9
0
0
0
0
0
0
0
1
1
1
0
1
1
1
0
1
0
1
1
2
1
0
1
1
1
0
0
0
0
1
0
1
1
0
1
1
1
1
1
0
1
1
1
0
0
0
0
1
1
1
0
0
1
1
1
0
0
0
1
1
1
0
1
1
1
0
0
1
0
1
In
[
67
]
:
freq
[
'
sum
'
]
.
describe
(
)
.
astype
(
int
)
Out
[
67
]
:
count
227
mean
1
std
0
min
1
25
%
1
50
%
1
75
%
1
max
2
Name
:
sum
,
dtype
:
int64
Groups the data along all columns
Unstacks the grouped data for the labels column
Adds up the frequency for a
0
and a1
Shows the frequencies for a
0
and a1
given a certain patternProvides statistics for the sum of the frequencies
The following Python code uses the MLPClassifier
model from scikit-learn
.8 The model is trained on the whole data set. What about the ability of a neural network
to learn about the relationships within a given data set? The ability is pretty high, as the in-sample accuracy score shows. It is in fact close to 100%, a result driven to a large extent by the relatively high neural network capacity given the relatively small
data set:
In
[
68
]:
from
sklearn.neural_network
import
MLPClassifier
from
sklearn.metrics
import
accuracy_score
In
[
69
]:
model
=
MLPClassifier
(
hidden_layer_sizes
=
[
128
,
128
,
128
],
max_iter
=
1000
,
random_state
=
100
)
In
[
70
]:
model
.
fit
(
data
[
fcols
],
data
[
'l'
])
Out
[
70
]:
MLPClassifier
(
hidden_layer_sizes
=
[
128
,
128
,
128
],
max_iter
=
1000
,
random_state
=
100
)
In
[
71
]:
accuracy_score
(
data
[
'l'
],
model
.
predict
(
data
[
fcols
]))
Out
[
71
]:
0.952
But what about the predictive power of a trained neural network? To this end, the given data set can be split into a training and a test data sub-set. The model is trained on the training data sub-set only and then tested with regard to its predictive power on the test data set. As before, the accuracy of the trained neural network is pretty high in-sample (that is, on the training data set). However, it is more than 10 percentage points worse than an uninformed baseline algorithm on the test data set:
In
[
72
]
:
split
=
int
(
len
(
data
)
*
0.7
)
In
[
73
]
:
train
=
data
[
:
split
]
test
=
data
[
split
:
]
In
[
74
]
:
model
.
fit
(
train
[
fcols
]
,
train
[
'
l
'
]
)
Out
[
74
]
:
MLPClassifier
(
hidden_layer_sizes
=
[
128
,
128
,
128
]
,
max_iter
=
1000
,
random_state
=
100
)
In
[
75
]
:
accuracy_score
(
train
[
'
l
'
]
,
model
.
predict
(
train
[
fcols
]
)
)
Out
[
75
]
:
0.9714285714285714
In
[
76
]
:
accuracy_score
(
test
[
'
l
'
]
,
model
.
predict
(
test
[
fcols
]
)
)
Out
[
76
]
:
0.38666666666666666
Splits the data into
train
andtest
data sub-setsTrains the model on the training data set only
Reports the accuracy in-sample (training data set)
Reports the accuracy out-of-sample (test data set)
Roughly speaking, the neural network, trained on a small data set only, learns wrong relationships due to the identified two major problem areas. The problems are not really relevant in the context of learning relationships in-sample. To the contrary, the smaller a data set is, the more easily in-sample relationships can be learned in general. However, the problem areas are highly relevant when using the trained neural network to generate predictions out-of-sample.
Larger Data Set
Fortunately, there is often a clear way out of this problematic situation: a larger data set. Faced with real-world problems, this theoretical insight might be equally correct. From a practical point of view, though, such larger data sets are not always available, nor can they often be generated so easily. However, in the context of the example of this section, a larger data set is indeed easily created.
The following Python code increases the number of samples in the initial sample data set significantly. The result is that the prediction accuracy of the trained neural network increases by more than 10 percentage points, to a level of about 50%, which is to be expected given the nature of the labels data. It is now in line with an uninformed baseline algorithm:
In
[
77
]
:
factor
=
50
In
[
78
]
:
big
=
pd
.
DataFrame
(
np
.
random
.
randint
(
0
,
2
,
(
factor
*
n
,
f
)
)
,
columns
=
fcols
)
In
[
79
]
:
big
[
'
l
'
]
=
np
.
random
.
randint
(
0
,
2
,
factor
*
n
)
In
[
80
]
:
train
=
big
[
:
split
]
test
=
big
[
split
:
]
In
[
81
]
:
model
.
fit
(
train
[
fcols
]
,
train
[
'
l
'
]
)
Out
[
81
]
:
MLPClassifier
(
hidden_layer_sizes
=
[
128
,
128
,
128
]
,
max_iter
=
1000
,
random_state
=
100
)
In
[
82
]
:
accuracy_score
(
train
[
'
l
'
]
,
model
.
predict
(
train
[
fcols
]
)
)
Out
[
82
]
:
0.9657142857142857
In
[
83
]
:
accuracy_score
(
test
[
'
l
'
]
,
model
.
predict
(
test
[
fcols
]
)
)
Out
[
83
]
:
0.5043407707910751
A quick analysis of the available data, as shown next, explains the increase in the prediction accuracy. First, all possible patterns are now represented in the data set. Second, all patterns have an average frequency of above 10 in the data set. In other words, the neural network sees basically all the patterns multiple times. This allows the neural network to “learn” that both labels 0
and 1
are equally likely for all possible patterns. Of course, it is a rather involved way of learning this, but it is a good illustration of the fact that a relatively small data set might often be too small in the context of neural networks:
In
[
84
]
:
grouped
=
big
.
groupby
(
list
(
data
.
columns
)
)
In
[
85
]
:
freq
=
grouped
[
'
l
'
]
.
size
(
)
.
unstack
(
fill_value
=
0
)
In
[
86
]
:
freq
[
'
sum
'
]
=
freq
[
0
]
+
freq
[
1
]
In
[
87
]
:
freq
.
head
(
6
)
Out
[
87
]
:
l
0
1
sum
f0
f1
f2
f3
f4
f5
f6
f7
f8
f9
0
0
0
0
0
0
0
0
0
0
10
9
19
1
5
4
9
1
0
2
5
7
1
6
6
12
1
0
0
9
8
17
1
7
4
11
In
[
88
]
:
freq
[
'
sum
'
]
.
describe
(
)
.
astype
(
int
)
Out
[
88
]
:
count
1024
mean
12
std
3
min
2
25
%
10
50
%
12
75
%
15
max
26
Name
:
sum
,
dtype
:
int64
Volume and Variety
In the context of neural networks that perform prediction tasks, the volume and variety of the available data used to train the neural network are decisive for its prediction performance. The numerical, hypothetical examples in this section show that the same neural network trained on a relatively small and not-as-varied data set underperforms its counterpart trained on a relatively large and varied data set by more than 10 percentage points. This difference can be considered huge given that AI practitioners and companies often fight for improvements as small as a tenth of a percentage point.
Big Data
What is the difference between a larger data set and a big data set? The term big data has been used for more than a decade now to mean a number of things. For the purposes of this book, one might say that a big data set is large enough—in terms of volume, variety, and also maybe velocity—for an AI algorithm to be trained properly such that the algorithm performs better at a prediction task as compared to a baseline algorithm.
The larger data set used before is still small in practical terms. However, it is large enough to accomplish the specified goal. The required volume and variety of the data set are mainly driven by the structure and characteristics of the features and labels data.
In this context, assume that a retail bank implements a neural network–based classification approach for credit scoring. Given in-house data, the responsible data scientist designs 25 categorical features, every one of which can take on 8 different values. The resulting number of patterns is astronomically large:
In
[
89
]:
8
**
25
Out
[
89
]:
37778931862957161709568
It is clear that no single data set can provide a neural network with exposure to every single one of these patterns.9 Fortunately, in practice this is not necessary for the neural network to learn about the creditworthiness based on data for regular, defaulting, and/or rejected debtors. It is also not necessary in general to generate “good” predictions with regard to the creditworthiness of every potential debtor.
This is due to a number of reasons. To name only a few, first, not every pattern will be relevant in practice—some patterns might simply not exist, might be impossible, and so forth. Second, not all features might be equally important, reducing the number of relevant features and thereby the number of possible patterns. Third, a value of 4
or 5
for feature number 7
, say, might not make a difference at all, further reducing the number of relevant patterns.
Conclusions
For this book, artificial intelligence, or AI, encompasses methods, techniques, algorithms, and so on that are able to learn relationships, rules, probabilities, and more from data. The focus lies on supervised learning algorithms, such as those for estimation and classification. With regard to algorithms, neural networks and deep learning approaches are at the core.
The central theme of this book is the application of neural networks to one of the core problems in finance: the prediction of future market movements. More specifically, the problem might be to predict the direction of movement for a stock index or the exchange rate for a currency pair. The prediction of the future market direction (that is, whether a target level or price goes up or down) is a problem that can be easily cast into a classification setting.
Before diving deeper into the core theme itself, the next chapter first discusses selected topics related to what is called superintelligence and technological singularity. That discussion will provide useful background for the chapters that follow, which focus on finance and the application of AI to the financial domain.
References
Books and papers cited in this chapter:
1 For details, see sklearn.cluster.KMeans
and VanderPlas (2017, ch. 5).
2 For details, see VanderPlas (2017, ch. 5).
3 For details, see sklearn.neural_network.MLPRegressor
. For more background, see Goodfellow et al. (2016, ch. 6).
4 For details, see Chollet (2017, ch. 3).
5 For details on activation functions with Keras
, see https://keras.io/activations.
6 The loss function calculates the prediction error of the neural network (or other ML algorithms). Binary cross entropy is an appropriate loss function for binary classification problems, while the mean squared error (MSE) is, for example, appropriate for estimation problems. For details on loss functions with Keras
, see https://keras.io/losses.
7 See, for example, Kratsios (2019).
8 For details, see sklearn.neural_network.MLPClassifier
.
9 Nor would current compute technology allow one to model and train a neural network based on such a data set if it would be available. In this context, the next chapter discusses the importance of hardware for AI.
Get Artificial Intelligence in Finance now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.