Chapter 1. Artificial Intelligence

This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

David Silver et al. (2016)

This chapter introduces general notions, ideas, and definitions from the field of artificial intelligence (AI) for the purposes of this book. It also provides worked-out examples for different types of major learning algorithms. In particular, “Algorithms” takes a broad perspective and categorizes types of data, types of learning, and types of problems typically encountered in an AI context. This chapter also presents examples for unsupervised and reinforcement learning. “Neural Networks” jumps right into the world of neural networks, which not only are central to what follows in later chapters of the book but also have proven to be among the most powerful algorithms AI has to offer nowadays. “Importance of Data” discusses the importance of data volume and variety in the context of AI.

Algorithms

This section introduces basic notions from the field of AI relevant to this book. It discusses the different types of data, learning, problems, and approaches that can be subsumed under the general term AI. Alpaydin (2016) provides an informal introduction to and overview of many of the topics covered only briefly in this section, along with many examples.

Types of Data

Data in general has two major components:

Features: Features data (or input data) is data that is given as input to an algorithm. In a financial context, this might be, for example, the income and the savings of a potential debtor.
Labels: Labels data (or output data) is data that is given as the relevant output to be learned, for example, by a supervised learning algorithm. In a financial context, this might be the creditworthiness of a potential debtor.

Types of Learning

There are three major types of learning algorithms:

Supervised learning (SL): These are algorithms that learn from a given sample data set of features (input) and labels (output) values. The next section presents examples for such algorithms, like ordinary least-squares (OLS) regression and neural networks. The purpose of supervised learning is to learn the relationship between the input and output values. In finance, such algorithms might be trained to predict whether a potential debtor is creditworthy or not. For the purposes of this book, these are the most important types of algorithms.
Unsupervised learning (UL): These are algorithms that learn from a given sample data set of features (input) values only, often with the goal of finding structure in the data. They are supposed to learn about the input data set, given, for example, some guiding parameters. Clustering algorithms fall into that category. In a financial context, such algorithms might cluster stocks into certain groups.
Reinforcement learning (RL): These are algorithms that learn from trial and error by receiving a reward for taking an action. They update an optimal action policy according to what rewards and punishments they receive. Such algorithms are, for example, used for environments where actions need to be taken continuously and rewards are received immediately, such as in a computer game.

Because supervised learning is addressed in the subsequent section in some detail, brief examples will illustrate unsupervised learning and reinforcement learning.

Unsupervised Learning

Simply speaking, a k-means clustering algorithm sorts $n$ observations into $k$ clusters. Each observation belongs to the cluster to which its mean (center) is nearest. The following Python code generates sample data for which the features data is clustered. Figure 1-1 visualizes the clustered sample data and also shows that the scikit-learn KMeans algorithm used here has identified the clusters perfectly. The coloring of the dots is based on what the algorithm has learned.¹

In [1]: import numpy as np
        import pandas as pd
        from pylab import plt, mpl
        plt.style.use('seaborn')
        mpl.rcParams['savefig.dpi'] = 300
        mpl.rcParams['font.family'] = 'serif'
        np.set_printoptions(precision=4, suppress=True)

In [2]: from sklearn.cluster import KMeans
        from sklearn.datasets import make_blobs

In [3]: x, y = make_blobs(n_samples=100, centers=4,
                          random_state=500, cluster_std=1.25)  

In [4]: model = KMeans(n_clusters=4, random_state=0)  

In [5]: model.fit(x)  
Out[5]: KMeans(n_clusters=4, random_state=0)

In [6]: y_ = model.predict(x)  

In [7]: y_  
Out[7]: array([3, 3, 1, 2, 1, 1, 3, 2, 1, 2, 2, 3, 2, 0, 0, 3, 2, 0, 2, 0, 0, 3,
               1, 2, 1, 1, 0, 0, 1, 3, 2, 1, 1, 0, 1, 3, 1, 3, 2, 2, 2, 1, 0, 0,
               3, 1, 2, 0, 2, 0, 3, 0, 1, 0, 1, 3, 1, 2, 0, 3, 1, 0, 3, 2, 3, 0,
               1, 1, 1, 2, 3, 1, 2, 0, 2, 3, 2, 0, 2, 2, 1, 3, 1, 3, 2, 2, 3, 2,
               0, 0, 0, 3, 3, 3, 3, 0, 3, 1, 0, 0], dtype=int32)

In [8]: plt.figure(figsize=(10, 6))
        plt.scatter(x[:, 0], x[:, 1], c=y_,  cmap='coolwarm');

: A sample data set is created with clustered features data.
: A KMeans model object is instantiated, fixing the number of clusters.
: The model is fitted to the features data.
: The predictions are generated given the fitted model.
: The predictions are numbers from 0 to 3, each representing one cluster.

Once an algorithm such as KMeans is trained, it can, for instance, predict the cluster for a new (not yet seen) combination of features values. Assume that such an algorithm is trained on features data that describes potential and real debtors of a bank. It might learn about the creditworthiness of potential debtors by generating two clusters. New potential debtors can then be sorted into a certain cluster: “creditworthy” versus “not creditworthy.”

Reinforcement learning

The following example is based on a coin tossing game that is played with a coin that lands 80% of the time on heads and 20% of the time on tails. The coin tossing game is heavily biased to emphasize the benefits of learning as compared to an uninformed baseline algorithm. The baseline algorithm, which bets randomly and equally distributes on heads and tails, achieves a total reward of around 50, on average, per epoch of 100 bets played:

In [9]: ssp = [1, 1, 1, 1, 0]  

In [10]: asp = [1, 0]  

In [11]: def epoch():
             tr = 0
             for _ in range(100):
                 a = np.random.choice(asp)  
                 s = np.random.choice(ssp)  
                 if a == s:
                     tr += 1  
             return tr

In [12]: rl = np.array([epoch() for _ in range(15)])  
         rl
Out[12]: array([53, 55, 50, 48, 46, 41, 51, 49, 50, 52, 46, 47, 43, 51, 52])

In [13]: rl.mean()  
Out[13]: 48.93333333333333

: The state space (1 = heads, 0 = tails).
: The action space (1 = bet on heads, 0 = bet on tails).
: An action is randomly chosen from the action space.
: A state is randomly chosen from the state space.
: The total reward tr is increased by one if the bet is correct.
: The game is played for a number of epochs; each epoch is 100 bets.
: The average total reward of the epochs played is calculated.

Reinforcement learning tries to learn from what is observed after an action is taken, usually based on a reward. To keep things simple, the following learning algorithm only keeps track of the states that are observed in each round insofar as they are appended to the action space list object. In this way, the algorithm learns the bias in the game, though maybe not perfectly. By randomly sampling from the updated action space, the bias is reflected because naturally the bet will more often be heads. Over time, heads is chosen, on average, around 80% of the time. The average total reward of around 65 reflects the improvement of the learning algorithm as compared to the uninformed baseline algorithm:

In [14]: ssp = [1, 1, 1, 1, 0]

In [15]: def epoch():
             tr = 0
             asp = [0, 1]  
             for _ in range(100):
                 a = np.random.choice(asp)
                 s = np.random.choice(ssp)
                 if a == s:
                     tr += 1
                 asp.append(s)  
             return tr


In [16]: rl = np.array([epoch() for _ in range(15)])
         rl
Out[16]: array([64, 65, 77, 65, 54, 64, 71, 64, 57, 62, 69, 63, 61, 66, 75])

In [17]: rl.mean()
Out[17]: 65.13333333333334

: Resets the action space before starting (over)
: Adds the observed state to the action space

Types of Tasks

Depending on the type of labels data and the problem at hand, two types of tasks to be learned are important:

Estimation: Estimation (or approximation, regression) refers to the cases in which the labels data is real-valued (continuous); that is, it is technically represented as floating point numbers.
Classification: Classification refers to the cases in which the labels data consists of a finite number of classes or categories that are typically represented by discrete values (positive natural numbers), which in turn are represented technically as integers.

The following section provides examples for both types of tasks.

Types of Approaches

Some more definitions might be in order before finishing this section. This book follows the common differentiation between the following three major terms:

Artificial intelligence (AI): AI encompasses all types of learning (algorithms), as defined before, and some more (for example, expert systems).
Machine learning (ML): ML is the discipline of learning relationships and other information about given data sets based on an algorithm and a measure of success; a measure of success might, for example, be the mean-squared error (MSE) given labels values and output values to be estimated and the predicted values from the algorithm. ML is a sub-set of AI.
Deep learning (DL): DL encompasses all algorithms based on neural networks. The term deep is usually only used when the neural network has more than one hidden layer. DL is a sub-set of machine learning and so is therefore also a sub-set of AI.

DL has proven useful for a number of broad problem areas. It is suited for estimation and classification tasks, as well as for RL. In many cases, DL-based approaches perform better than alternative algorithms, such as logistic regression or kernel-based ones, like support vector machines.² That is why this book mainly focuses on DL. DL approaches used include dense neural networks (DNNs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs). More details appear in later chapters, particularly in Part III.

Neural Networks

The previous sections provide a broader overview of algorithms in AI. This section shows how neural networks fit in. A simple example will illustrate what characterizes neural networks in comparison to traditional statistical methods, such as ordinary least-squares (OLS) regression. The example starts with mathematics and then uses linear regression for estimation (or function approximation) and finally applies neural networks to accomplish the estimation. The approach taken here is a supervised learning approach where the task is to estimate labels data based on features data. This section also illustrates the use of neural networks in the context of classification problems.

OLS Regression

Assume that a mathematical function is given as follows:

f : ℝ \to ℝ, y = 2 x^{2} - \frac{1}{3} x^{3}

Such a function transforms an input value $x$ to an output value $y$ . Or it transforms a series of input values $x 1 comma x 2 comma ellipsis comma x Subscript upper N Baseline$ into a series of output values $y 1 comma y 2 comma ellipsis comma y Subscript upper N Baseline$ . The following Python code implements the mathematical function as a Python function and creates a number of input and output values. Figure 1-2 plots the output values against the input values:

In [18]: def f(x):
             return 2 * x ** 2 - x ** 3 / 3  

In [19]: x = np.linspace(-2, 4, 25)  
         x  
Out[19]: array([-2.  , -1.75, -1.5 , -1.25, -1.  , -0.75, -0.5 , -0.25,  0.  ,
                 0.25,  0.5 ,  0.75,  1.  ,  1.25,  1.5 ,  1.75,  2.  ,  2.25,
                 2.5 ,  2.75,  3.  ,  3.25,  3.5 ,  3.75,  4.  ])

In [20]: y = f(x)  
         y  
Out[20]: array([10.6667,  7.9115,  5.625 ,  3.776 ,  2.3333,  1.2656,  0.5417,
                 0.1302,  0.    ,  0.1198,  0.4583,  0.9844,  1.6667,  2.474 ,
                 3.375 ,  4.3385,  5.3333,  6.3281,  7.2917,  8.1927,  9.    ,
                 9.6823, 10.2083, 10.5469, 10.6667])

In [21]: plt.figure(figsize=(10, 6))
         plt.plot(x, y, 'ro');

: The mathematical function as a Python function
: The input values
: The output values

Whereas in the mathematical example the function comes first, the input data second, and the output data third, the sequence is different in statistical learning. Assume that the previous input values and output values are given. They represent the sample (data). The problem in statistical regression is to find a function that approximates the functional relationship between the input values (also called the independent values) and the output values (also called the dependent values) as well as possible.

Assume simple OLS linear regression. In this case, the functional relationship between the input and output values is assumed to be linear, and the problem is to find optimal parameters $α$ and $β$ for the following linear equation:

\hat{f} : ℝ \to ℝ, \hat{y} = α + β x

For given input values $x 1 comma x 2 comma ellipsis comma x Subscript upper N Baseline$ and output values $y 1 comma y 2 comma ellipsis comma y Subscript upper N Baseline$ , optimal in this case means that they minimize the mean squared error (MSE) between the real output values and the approximated output values:

min_{α, β} \frac{1}{N} \sum_{n}^{N} {(y_{n} - \hat{f} (x_{n}))}^{2}

For the case of simple linear regression, the solution $(α^{*}, β^{*})$ is known in closed form, as shown in the following equation. Bars on the variables indicate sample mean values:

\begin{matrix} β^{*} & = & \frac{C o v (x, y)}{V a r (x)} \\ α^{*} & = & \bar{y} - β \bar{x} \end{matrix}

The following Python code calculates the optimal parameter values, linearly estimates (approximates) the output values, and plots the linear regression line alongside the sample data (see Figure 1-3). The linear regression approach does not work too well here in approximating the functional relationship. This is confirmed by the relatively high MSE value:

In [22]: beta = np.cov(x, y, ddof=0)[0, 1] / np.var(x)  
         beta  
Out[22]: 1.0541666666666667

In [23]: alpha = y.mean() - beta * x.mean()  
         alpha  
Out[23]: 3.8625000000000003

In [24]: y_ = alpha + beta * x  

In [25]: MSE = ((y - y_) ** 2).mean()  
         MSE  
Out[25]: 10.721953125

In [26]: plt.figure(figsize=(10, 6))
         plt.plot(x, y, 'ro', label='sample data')
         plt.plot(x, y_, lw=3.0, label='linear regression')
         plt.legend();

: Calculation of optimal $β$
: Calculation of optimal $α$
: Calculation of estimated output values
: Calculation of the MSE given the approximation

How can the MSE value be improved (decreased)—maybe even to 0, that is, to a “perfect estimation?” Of course, OLS regression is not constrained to a simple linear relationship. In addition to the constant and linear terms, higher order monomials, for instance, can be easily added as basis functions. To this end, compare the regression results shown in Figure 1-4 and the following code that creates the figure. The improvements that come from using quadratic and cubic monomials as basis functions are obvious and also are numerically confirmed by the calculated MSE values. For basis functions up to and including the cubic monomial, the estimation is perfect, and the functional relationship is perfectly recovered:

In [27]: plt.figure(figsize=(10, 6))
         plt.plot(x, y, 'ro', label='sample data')
         for deg in [1, 2, 3]:
             reg = np.polyfit(x, y, deg=deg)  
             y_ = np.polyval(reg, x)  
             MSE = ((y - y_) ** 2).mean()  
             print(f'deg={deg} | MSE={MSE:.5f}')
             plt.plot(x, np.polyval(reg, x), label=f'deg={deg}')
         plt.legend();
         deg=1 | MSE=10.72195
         deg=2 | MSE=2.31258
         deg=3 | MSE=0.00000

In [28]: reg  
Out[28]: array([-0.3333,  2.    ,  0.    , -0.    ])

: Regression step
: Approximation step
: MSE calculation
: Optimal (“perfect”) parameter values

Exploiting the knowledge of the form of the mathematical function to be approximated and accordingly adding more basis functions to the regression leads to a “perfect approximation.” That is, the OLS regression recovers the exact factors of the quadratic and cubic part, respectively, of the original function.

Estimation with Neural Networks

However, not all relationships are of this kind. This is where, for instance, neural networks can help. Without going into the details, neural networks can approximate a wide range of functional relationships. Knowledge of the form of the relationship is generally not required.

Scikit-learn

The following Python code uses the MLPRegressor class of scikit-learn, which implements a DNN for estimation. DNNs are sometimes also called multi-layer perceptron (MLP).³ The results are not perfect, as Figure 1-5 and the MSE illustrate. However, they are quite good already for the simple configuration used:

In [29]: from sklearn.neural_network import MLPRegressor

In [30]: model = MLPRegressor(hidden_layer_sizes=3 * [256],
                              learning_rate_init=0.03,
                              max_iter=5000)  

In [31]: model.fit(x.reshape(-1, 1), y)  
Out[31]: MLPRegressor(hidden_layer_sizes=[256, 256, 256], learning_rate_init=0.03,
                      max_iter=5000)

In [32]: y_ = model.predict(x.reshape(-1, 1))  

In [33]: MSE = ((y - y_) ** 2).mean()
         MSE
Out[33]: 0.021662355744355866

In [34]: plt.figure(figsize=(10, 6))
         plt.plot(x, y, 'ro', label='sample data')
         plt.plot(x, y_, lw=3.0, label='dnn estimation')
         plt.legend();

: Instantiates the MLPRegressor object
: Implements the fitting or learning step
: Implements the prediction step

Just having a look at the results in Figure 1-4 and Figure 1-5, one might assume that the methods and approaches are not too dissimilar after all. However, there is a fundamental difference worth highlighting. Although the OLS regression approach, as shown explicitly for the simple linear regression, is based on the calculation of certain well-specified quantities and parameters, the neural network approach relies on incremental learning. This in turn means that a set of parameters, the weights within the neural network, are first initialized randomly and then adjusted gradually given the differences between the neural network output and the sample output values. This approach lets you retrain (update) a neural network incrementally.

Keras

The next example uses a sequential model with the Keras deep learning package.⁴ The model is fitted, or trained, for 100 epochs. The procedure is repeated for five rounds. After every such round, the approximation by the neural network is updated and plotted. Figure 1-6 shows how the approximation gradually improves with every round. This is also reflected in the decreasing MSE values. The end result is not perfect, but again, it is quite good given the simplicity of the model:

In [35]: import tensorflow as tf
         tf.random.set_seed(100)

In [36]: from keras.layers import Dense
         from keras.models import Sequential
         Using TensorFlow backend.

In [37]: model = Sequential()  
         model.add(Dense(256, activation='relu', input_dim=1)) 
         model.add(Dense(1, activation='linear'))  
         model.compile(loss='mse', optimizer='rmsprop')  

In [38]: ((y - y_) ** 2).mean()
Out[38]: 0.021662355744355866

In [39]: plt.figure(figsize=(10, 6))
         plt.plot(x, y, 'ro', label='sample data')
         for _ in range(1, 6):
             model.fit(x, y, epochs=100, verbose=False)  
             y_ =  model.predict(x)  
             MSE = ((y - y_.flatten()) ** 2).mean()  
             print(f'round={_} | MSE={MSE:.5f}')
             plt.plot(x, y_, '--', label=f'round={_}')  
         plt.legend();
         round=1 | MSE=3.09714
         round=2 | MSE=0.75603
         round=3 | MSE=0.22814
         round=4 | MSE=0.11861
         round=5 | MSE=0.09029

: Instantiates the Sequential model object
: Adds a densely connected hidden layer with rectified linear unit (ReLU) activation⁵
: Adds the output layer with linear activation
: Compiles the model for usage
: Trains the neural network for a fixed number of epochs
: Implements the approximation step
: Calculates the current MSE
: Plots the current approximation results

Roughly speaking, one can say that the neural network does almost as well in the estimation as the OLS regression, which delivers a perfect result. Therefore, why use neural networks at all? A more comprehensive answer might need to come later in this book, but a somewhat different example might give some hint.

Consider instead the previous sample data set, as generated from a well-defined mathematical function, now a random sample data set, for which both features and labels are randomly chosen. Of course, this example is for illustration and does not allow for a deep interpretation.

The following code generates the random sample data set and creates the OLS regression estimation based on a varying number of monomial basis functions. Figure 1-7 visualizes the results. Even for the highest number of monomials in the example, the estimation results are still not too good. The MSE value is accordingly relatively high:

In [40]: np.random.seed(0)
         x = np.linspace(-1, 1)
         y = np.random.random(len(x)) * 2 - 1

In [41]: plt.figure(figsize=(10, 6))
         plt.plot(x, y, 'ro', label='sample data')
         for deg in [1, 5, 9, 11, 13, 15]:
             reg = np.polyfit(x, y, deg=deg)
             y_ = np.polyval(reg, x)
             MSE = ((y - y_) ** 2).mean()
             print(f'deg={deg:2d} | MSE={MSE:.5f}')
             plt.plot(x, np.polyval(reg, x), label=f'deg={deg}')
         plt.legend();
         deg= 1 | MSE=0.28153
         deg= 5 | MSE=0.27331
         deg= 9 | MSE=0.25442
         deg=11 | MSE=0.23458
         deg=13 | MSE=0.22989
         deg=15 | MSE=0.21672

The results for the OLS regression are not too surprising. OLS regression in this case assumes that the approximation can be achieved through an appropriate combination of a finite number of basis functions. Since the sample data set has been generated randomly, the OLS regression does not perform well in this case.

What about neural networks? The application is as straightforward as before and yields estimations as shown in Figure 1-8. While the end result is not perfect, it is obvious that the neural network performs better than the OLS regression in estimating the random label values from the random features values. Given its architecture, however, the neural network has almost 200,000 trainable parameters (weights), which offers relatively high flexibility, particularly when compared to the OLS regression, for which a maximum of 15 + 1 parameters are used:

In [42]: model = Sequential()
         model.add(Dense(256, activation='relu', input_dim=1))
         for _ in range(3):
             model.add(Dense(256, activation='relu'))  
         model.add(Dense(1, activation='linear'))
         model.compile(loss='mse', optimizer='rmsprop')

In [43]: model.summary()  
         Model: "sequential_2"
         _________________________________________________________________
         Layer (type)                 Output Shape              Param #
         =================================================================
         dense_3 (Dense)              (None, 256)               512
         _________________________________________________________________
         dense_4 (Dense)              (None, 256)               65792
         _________________________________________________________________
         dense_5 (Dense)              (None, 256)               65792
         _________________________________________________________________
         dense_6 (Dense)              (None, 256)               65792
         _________________________________________________________________
         dense_7 (Dense)              (None, 1)                 257
         =================================================================
         Total params: 198,145
         Trainable params: 198,145
         Non-trainable params: 0
         _________________________________________________________________

In [44]: %%time
         plt.figure(figsize=(10, 6))
         plt.plot(x, y, 'ro', label='sample data')
         for _ in range(1, 8):
             model.fit(x, y, epochs=500, verbose=False)
             y_ =  model.predict(x)
             MSE = ((y - y_.flatten()) ** 2).mean()
             print(f'round={_} | MSE={MSE:.5f}')
             plt.plot(x, y_, '--', label=f'round={_}')
         plt.legend();
         round=1 | MSE=0.13560
         round=2 | MSE=0.08337
         round=3 | MSE=0.06281
         round=4 | MSE=0.04419
         round=5 | MSE=0.03329
         round=6 | MSE=0.07676
         round=7 | MSE=0.00431
         CPU times: user 30.4 s, sys: 4.7 s, total: 35.1 s
         Wall time: 13.6 s

: Multiple hidden layers are added.
: Network architecture and number of trainable parameters are shown.

Classification with Neural Networks

Another benefit of neural networks is that they can be easily used for classification tasks as well. Consider the following Python code that implements a classification using a neural network based on Keras. The binary features data and labels data are generated randomly. The major adjustment to be made modeling-wise is to change the activation function from the output layer to sigmoid from linear. More details on this appear in later chapters. The classification is not perfect. However, it reaches a high level of accuracy. How the accuracy, expressed as the relationship between correct results to all label values, changes with the number of training epochs is shown in Figure 1-9. The accuracy starts out low and then improves step-wise, though not necessarily with every step:

In [45]: f = 5
         n = 10

In [46]: np.random.seed(100)

In [47]: x = np.random.randint(0, 2, (n, f))  
         x  
Out[47]: array([[0, 0, 1, 1, 1],
                [1, 0, 0, 0, 0],
                [0, 1, 0, 0, 0],
                [0, 1, 0, 0, 1],
                [0, 1, 0, 0, 0],
                [1, 1, 1, 0, 0],
                [1, 0, 0, 1, 1],
                [1, 1, 1, 0, 0],
                [1, 1, 1, 1, 1],
                [1, 1, 1, 0, 1]])

In [48]: y = np.random.randint(0, 2, n)  
         y  
Out[48]: array([1, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [49]: model = Sequential()
         model.add(Dense(256, activation='relu', input_dim=f))
         model.add(Dense(1, activation='sigmoid'))  
         model.compile(loss='binary_crossentropy', optimizer='rmsprop',
                      metrics=['acc'])  

In [50]: h = model.fit(x, y, epochs=50, verbose=False)
Out[50]: <keras.callbacks.callbacks.History at 0x7fde09dd1cd0>

In [51]: y_ = np.where(model.predict(x).flatten() > 0.5, 1, 0)
         y_
Out[51]: array([1, 1, 0, 0, 0, 1, 0, 1, 0, 1], dtype=int32)




In [52]: y == y_  
Out[52]: array([ True,  True,  True,  True, False,  True,  True,  True,  True,
                 True])

In [53]: res = pd.DataFrame(h.history)  

In [54]: res.plot(figsize=(10, 6));

: Creates random features data
: Creates random labels data
: Defines the activation function for the output layer as sigmoid
: Defines the loss function to be binary_crossentropy⁶
: Compares the predicted values with the labels data
: Plots the loss function and accuracy values for every training step

The examples in this section illustrate some fundamental characteristics of neural networks as compared to OLS regression:

Problem-agnostic: The neural network approach is agnostic when it comes to estimating and classifying label values, given a set of feature values. Statistical methods, such as OLS regression, might perform well for a smaller set of problems, but not too well or not at all for others.
Incremental learning: The optimal weights within a neural network, given a target measure of success, are learned incrementally based on a random initialization and incremental improvements. These incremental improvements are achieved by considering the differences between the predicted values and the sample label values and backpropagating weights updates through the neural network.
Universal approximation: There are strong mathematical theorems showing that neural networks (even with one hidden layer only) can approximate almost any function.⁷

These characteristics might justify why this book puts neural networks at the core with regard to the algorithms used. Chapter 2 discusses more good reasons.

Neural Networks

Neural networks are good at learning relationships between input and output data. They can be applied to a number of problem types, such as estimation in the presence of complex relationships or classification, for which traditional statistical methods are not well suited.

Importance of Data

The example at the end of the previous section shows that neural networks are capable of solving classification problems quite well. The neural network with one hidden layer reaches a high degree of accuracy on the given data set, or in-sample. However, what about the predictive power of a neural network? This hinges significantly on the volume and variety of the data available to train the neural network. Another numerical example, based on larger data sets, will illustrate this point.

Small Data Set

Consider a random sample data set similar to the one used before in the classification example, but with more features and more samples. Most algorithms used in AI are about pattern recognition. In the following Python code, the number of binary features defines the number of possible patterns about which the algorithm can learn something. Given that the labels data is also binary, the algorithm tries to learn whether a 0 or 1 is more likely given a certain pattern, say [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]. Because all numbers are randomly chosen with equal probability, there is not that much to learn beyond the fact that the labels 0 and 1 are equally likely no matter what (random) pattern is observed. Therefore, a baseline prediction algorithm should be accurate about 50% of the time, no matter what (random) pattern it is presented with:

In [55]: f = 10
         n = 250

In [56]: np.random.seed(100)

In [57]: x = np.random.randint(0, 2, (n, f))  
         x[:4]  
Out[57]: array([[0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
                [0, 1, 0, 0, 0, 0, 1, 0, 0, 1],
                [0, 1, 0, 0, 0, 1, 1, 1, 0, 0],
                [1, 0, 0, 1, 1, 1, 1, 1, 0, 0]])

In [58]: y = np.random.randint(0, 2, n)  
         y[:4]  
Out[58]: array([0, 1, 0, 0])

In [59]: 2 ** f  
Out[59]: 1024

: Features data
: Labels data
: Number of patterns

In order to proceed, the raw data is put into a pandas DataFrame object, which simplifies certain operations and analyses:

In [60]: fcols = [f'f{_}' for _ in range(f)]  
         fcols  
Out[60]: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9']

In [61]: data = pd.DataFrame(x, columns=fcols)  
         data['l'] = y  

In [62]: data.info()  
         <class 'pandas.core.frame.DataFrame'>
         RangeIndex: 250 entries, 0 to 249
         Data columns (total 11 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   f0      250 non-null    int64
          1   f1      250 non-null    int64
          2   f2      250 non-null    int64
          3   f3      250 non-null    int64
          4   f4      250 non-null    int64
          5   f5      250 non-null    int64
          6   f6      250 non-null    int64
          7   f7      250 non-null    int64
          8   f8      250 non-null    int64
          9   f9      250 non-null    int64
          10  l       250 non-null    int64
         dtypes: int64(11)
         memory usage: 21.6 KB

: Defines column names for the features data
: Puts the features data into a DataFrame object
: Puts the labels data into the same DataFrame object
: Shows the meta information for the data set

Two major problems can be identified given the results from executing the following Python code. First, not all patterns are in the sample data set. Second, the sample size is much too small per observed pattern. Even without digging deeper, it is clear that no classification algorithm can really learn about all the possible patterns in a meaningful way:

In [63]: grouped = data.groupby(list(data.columns))  

In [64]: freq = grouped['l'].size().unstack(fill_value=0)  

In [65]: freq['sum'] = freq[0] + freq[1]  

In [66]: freq.head(10)  
Out[66]: l                              0  1  sum
         f0 f1 f2 f3 f4 f5 f6 f7 f8 f9
         0  0  0  0  0  0  0  1  1  1   0  1    1
                           1  0  1  0   1  1    2
                                    1   0  1    1
                        1  0  0  0  0   1  0    1
                                    1   0  1    1
                              1  1  1   0  1    1
                           1  0  0  0   0  1    1
                                 1  0   0  1    1
                     1  0  0  0  1  1   1  0    1
                           1  1  0  0   1  0    1

In [67]: freq['sum'].describe().astype(int)  
Out[67]: count    227
         mean       1
         std        0
         min        1
         25%        1
         50%        1
         75%        1
         max        2
         Name: sum, dtype: int64

: Groups the data along all columns
: Unstacks the grouped data for the labels column
: Adds up the frequency for a 0 and a 1
: Shows the frequencies for a 0 and a 1 given a certain pattern
: Provides statistics for the sum of the frequencies

The following Python code uses the MLPClassifier model from scikit-learn.⁸ The model is trained on the whole data set. What about the ability of a neural network to learn about the relationships within a given data set? The ability is pretty high, as the in-sample accuracy score shows. It is in fact close to 100%, a result driven to a large extent by the relatively high neural network capacity given the relatively small data set:

In [68]: from sklearn.neural_network import MLPClassifier
         from sklearn.metrics import accuracy_score

In [69]: model = MLPClassifier(hidden_layer_sizes=[128, 128, 128],
                               max_iter=1000, random_state=100)

In [70]: model.fit(data[fcols], data['l'])
Out[70]: MLPClassifier(hidden_layer_sizes=[128, 128, 128], max_iter=1000,
                       random_state=100)

In [71]: accuracy_score(data['l'], model.predict(data[fcols]))
Out[71]: 0.952

But what about the predictive power of a trained neural network? To this end, the given data set can be split into a training and a test data sub-set. The model is trained on the training data sub-set only and then tested with regard to its predictive power on the test data set. As before, the accuracy of the trained neural network is pretty high in-sample (that is, on the training data set). However, it is more than 10 percentage points worse than an uninformed baseline algorithm on the test data set:

In [72]: split = int(len(data) * 0.7)  

In [73]: train = data[:split]  
         test = data[split:]  

In [74]: model.fit(train[fcols], train['l'])  
Out[74]: MLPClassifier(hidden_layer_sizes=[128, 128, 128], max_iter=1000,
                       random_state=100)

In [75]: accuracy_score(train['l'], model.predict(train[fcols]))  
Out[75]: 0.9714285714285714

In [76]: accuracy_score(test['l'], model.predict(test[fcols]))  
Out[76]: 0.38666666666666666

: Splits the data into train and test data sub-sets
: Trains the model on the training data set only
: Reports the accuracy in-sample (training data set)
: Reports the accuracy out-of-sample (test data set)

Roughly speaking, the neural network, trained on a small data set only, learns wrong relationships due to the identified two major problem areas. The problems are not really relevant in the context of learning relationships in-sample. To the contrary, the smaller a data set is, the more easily in-sample relationships can be learned in general. However, the problem areas are highly relevant when using the trained neural network to generate predictions out-of-sample.

Larger Data Set

Fortunately, there is often a clear way out of this problematic situation: a larger data set. Faced with real-world problems, this theoretical insight might be equally correct. From a practical point of view, though, such larger data sets are not always available, nor can they often be generated so easily. However, in the context of the example of this section, a larger data set is indeed easily created.

The following Python code increases the number of samples in the initial sample data set significantly. The result is that the prediction accuracy of the trained neural network increases by more than 10 percentage points, to a level of about 50%, which is to be expected given the nature of the labels data. It is now in line with an uninformed baseline algorithm:

In [77]: factor = 50

In [78]: big = pd.DataFrame(np.random.randint(0, 2, (factor * n, f)),
                            columns=fcols)

In [79]: big['l'] = np.random.randint(0, 2, factor * n)

In [80]: train = big[:split]
         test = big[split:]

In [81]: model.fit(train[fcols], train['l'])
Out[81]: MLPClassifier(hidden_layer_sizes=[128, 128, 128], max_iter=1000,
                       random_state=100)

In [82]: accuracy_score(train['l'], model.predict(train[fcols]))  
Out[82]: 0.9657142857142857

In [83]: accuracy_score(test['l'], model.predict(test[fcols]))  
Out[83]: 0.5043407707910751

: Prediction accuracy in-sample (training data set)
: Prediction accuracy out-of-sample (test data set)

A quick analysis of the available data, as shown next, explains the increase in the prediction accuracy. First, all possible patterns are now represented in the data set. Second, all patterns have an average frequency of above 10 in the data set. In other words, the neural network sees basically all the patterns multiple times. This allows the neural network to “learn” that both labels 0 and 1 are equally likely for all possible patterns. Of course, it is a rather involved way of learning this, but it is a good illustration of the fact that a relatively small data set might often be too small in the context of neural networks:

In [84]: grouped = big.groupby(list(data.columns))

In [85]: freq = grouped['l'].size().unstack(fill_value=0)

In [86]: freq['sum'] = freq[0] + freq[1]  

In [87]: freq.head(6)
Out[87]: l                               0  1  sum
         f0 f1 f2 f3 f4 f5 f6 f7 f8 f9
         0  0  0  0  0  0  0  0  0  0   10  9   19
                                    1    5  4    9
                                 1  0    2  5    7
                                    1    6  6   12
                              1  0  0    9  8   17
                                    1    7  4   11

In [88]: freq['sum'].describe().astype(int)  
Out[88]: count    1024
         mean       12
         std         3
         min         2
         25%        10
         50%        12
         75%        15
         max        26
         Name: sum, dtype: int64

: Adds the frequency for the 0 and 1 values
: Shows summary statistics for the sum values

Volume and Variety

In the context of neural networks that perform prediction tasks, the volume and variety of the available data used to train the neural network are decisive for its prediction performance. The numerical, hypothetical examples in this section show that the same neural network trained on a relatively small and not-as-varied data set underperforms its counterpart trained on a relatively large and varied data set by more than 10 percentage points. This difference can be considered huge given that AI practitioners and companies often fight for improvements as small as a tenth of a percentage point.

Big Data

What is the difference between a larger data set and a big data set? The term big data has been used for more than a decade now to mean a number of things. For the purposes of this book, one might say that a big data set is large enough—in terms of volume, variety, and also maybe velocity—for an AI algorithm to be trained properly such that the algorithm performs better at a prediction task as compared to a baseline algorithm.

The larger data set used before is still small in practical terms. However, it is large enough to accomplish the specified goal. The required volume and variety of the data set are mainly driven by the structure and characteristics of the features and labels data.

In this context, assume that a retail bank implements a neural network–based classification approach for credit scoring. Given in-house data, the responsible data scientist designs 25 categorical features, every one of which can take on 8 different values. The resulting number of patterns is astronomically large:

In [89]: 8 ** 25
Out[89]: 37778931862957161709568

It is clear that no single data set can provide a neural network with exposure to every single one of these patterns.⁹ Fortunately, in practice this is not necessary for the neural network to learn about the creditworthiness based on data for regular, defaulting, and/or rejected debtors. It is also not necessary in general to generate “good” predictions with regard to the creditworthiness of every potential debtor.

This is due to a number of reasons. To name only a few, first, not every pattern will be relevant in practice—some patterns might simply not exist, might be impossible, and so forth. Second, not all features might be equally important, reducing the number of relevant features and thereby the number of possible patterns. Third, a value of 4 or 5 for feature number 7, say, might not make a difference at all, further reducing the number of relevant patterns.

Conclusions

For this book, artificial intelligence, or AI, encompasses methods, techniques, algorithms, and so on that are able to learn relationships, rules, probabilities, and more from data. The focus lies on supervised learning algorithms, such as those for estimation and classification. With regard to algorithms, neural networks and deep learning approaches are at the core.

The central theme of this book is the application of neural networks to one of the core problems in finance: the prediction of future market movements. More specifically, the problem might be to predict the direction of movement for a stock index or the exchange rate for a currency pair. The prediction of the future market direction (that is, whether a target level or price goes up or down) is a problem that can be easily cast into a classification setting.

Before diving deeper into the core theme itself, the next chapter first discusses selected topics related to what is called superintelligence and technological singularity. That discussion will provide useful background for the chapters that follow, which focus on finance and the application of AI to the financial domain.

References

Books and papers cited in this chapter:

Alpaydin, Ethem. 2016. Machine Learning. MIT Press, Cambridge.
Chollet, Francois. 2017. Deep Learning with Python. Shelter Island: Manning.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge: MIT Press. http://deeplearningbook.org.
Kratsios, Anastasis. 2019. “Universal Approximation Theorems.” https://oreil.ly/COOdI.
Silver, David et al. 2016. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature 529 (January): 484-489.
Shanahan, Murray. 2015. The Technological Singularity. Cambridge: MIT Press.
Tegmark, Max. 2017. Life 3.0: Being Human in the Age of Artificial Intelligence. United Kingdom: Penguin Random House.
VanderPlas, Jake. 2017. Python Data Science Handbook. Sebastopol: O’Reilly.

¹ For details, see sklearn.cluster.KMeans and VanderPlas (2017, ch. 5).

² For details, see VanderPlas (2017, ch. 5).

³ For details, see sklearn.neural_network.MLPRegressor. For more background, see Goodfellow et al. (2016, ch. 6).

⁴ For details, see Chollet (2017, ch. 3).

⁵ For details on activation functions with Keras, see https://keras.io/activations.

⁶ The loss function calculates the prediction error of the neural network (or other ML algorithms). Binary cross entropy is an appropriate loss function for binary classification problems, while the mean squared error (MSE) is, for example, appropriate for estimation problems. For details on loss functions with Keras, see https://keras.io/losses.

⁷ See, for example, Kratsios (2019).

⁸ For details, see sklearn.neural_network.MLPClassifier.

⁹ Nor would current compute technology allow one to model and train a neural network based on such a data set if it would be available. In this context, the next chapter discusses the importance of hardware for AI.

Get Artificial Intelligence in Finance now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Artificial Intelligence in Finance by Yves Hilpisch

Chapter 1. Artificial Intelligence

Algorithms

Types of Data

Types of Learning

Unsupervised Learning

Figure 1-1. Unsupervised learning of clusters

Reinforcement learning

Types of Tasks

Types of Approaches

Neural Networks

OLS Regression

Figure 1-2. Output values against input values

Figure 1-3. Sample data and linear regression line

Figure 1-4. Sample data and OLS regression lines

Estimation with Neural Networks

Scikit-learn

Figure 1-5. Sample data and neural network–based estimations

Keras

Figure 1-6. Sample data and estimations after multiple training rounds

Figure 1-7. Random sample data and OLS regression lines

Figure 1-8. Random sample data and neural network estimations

Classification with Neural Networks

Figure 1-9. Classification accuracy and loss against number of epochs

Neural Networks

Importance of Data

Small Data Set

Larger Data Set

Volume and Variety

Big Data

Conclusions

References

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly