Chapter 4. Machine Learning-Based Volatility Prediction

The most critical feature of the conditional return distribution is arguably its second moment structure, which is empirically the dominant time-varying characteristic of the distribution. This fact has spurred an enormous literature on the modeling and forecasting of return volatility.

Andersen et al. (2003)

“Some concepts are easy to understand but hard to define. This also holds true for volatility.” This could be a quote from someone living before Markowitz because the way he models volatility is very clear and intuitive. Markowitz proposed his celebrated portfolio theory in which he defined volatility as standard deviation so that from then onward, finance became more intertwined with mathematics.

Volatility is the backbone of finance in the sense that it not only provides an information signal to investors, but it also is an input to various financial models. What makes volatility so important? The answer stresses the importance of uncertainty, which is the main characteristic of the financial model.

Increased integration of financial markets has led to prolonged uncertainty in those markets, which in turn stresses the importance of volatility, the degree at which values of financial assets changes. Volatility used as a proxy of risk is among the most important variables in many fields, including asset pricing and risk management. Its strong presence and latency make it even compulsory to model. Volatility as a risk measure has taken on a key role in risk management following the Basel Accord that came into effect in 1996 (Karasan and Gaygisiz 2020).

A large and growing body of literature regarding the estimation of volatility has emerged after the ground-breaking studies of Black (1976), including Andersen and Bollerslev (1997), Raju and Ghosh (2004), Dokuchaev (2014), and De Stefani et al. (2017). We are talking about a long tradition of volatility prediction using ARCH- and GARCH-type models in which there are certain drawbacks that might cause failures, such as volatility clustering, information asymmetry, and so on. Even though these issues are addressed by different models, recent fluctuations in financial markets coupled with developments in ML have made researchers rethink volatility estimation.

In this chapter, our aim is to show how we can enhance the predictive performance using an ML-based model. We will visit various ML algorithms, namely support vector regression, neural network, and deep learning, so that we are able to compare the predictive performance.

Modeling volatility amounts to modeling uncertainty so that we better understand and approach uncertainty, enabling us to have good enough approximations of the real world. To gauge the extent to which proposed models account for the real-world situation, we need to calculate the return volatility, which is also known as realized volatility. Realized volatility is the square root of realized variance, which is the sum of squared return. Realized volatility is used to calculate the performance of the volatility prediction method. Here is the formula for return volatility:

\hat{σ} = \sqrt{\frac{1}{n - 1} \sum_{n = 1}^{N} {(r_{n} - μ)}^{2}}

where r and $μ$ are return and mean of return, and n is number of observations.

Let’s see how return volatility is computed in Python:

In [1]: import numpy as np
        from scipy.stats import norm
        import scipy.optimize as opt
        import yfinance as yf
        import pandas as pd
        import datetime
        import time
        from arch import arch_model
        import matplotlib.pyplot as plt
        from numba import jit
        from sklearn.metrics import mean_squared_error as mse
        import warnings
        warnings.filterwarnings('ignore')

In [2]: stocks = '^GSPC'
        start = datetime.datetime(2010, 1, 1)
        end = datetime.datetime(2021, 8, 1)
        s_p500 = yf.download(stocks, start=start, end = end, interval='1d')
        [*********************100%***********************]  1 of 1 completed

In [3]: ret = 100 * (s_p500.pct_change()[1:]['Adj Close']) 
        realized_vol = ret.rolling(5).std()

In [4]: plt.figure(figsize=(10, 6))
        plt.plot(realized_vol.index,realized_vol)
        plt.title('Realized Volatility- S&P-500')
        plt.ylabel('Volatility')
        plt.xlabel('Date')
        plt.show()

: Calculating the returns of the S&P 500 based on adjusted closing prices.

Figure 4-1 shows the realized volatility of S&P 500 over the period of 2010–2021. The most striking observation is the spikes around the COVID-19 pandemic.

The way volatility is estimated has an undeniable impact on the reliability and accuracy of the related analysis. So this chapter deals with both classical and ML-based volatility prediction techniques with a view to showing the superior prediction performance of the ML-based models. To compare the brand-new ML-based models, we start with modeling the classical volatility models. Some very well known classical volatility models include, but are not limited to, the following:

ARCH
GARCH
GJR-GARCH
EGARCH

It’s time to dig into the classical volatility models. Let’s start off with the ARCH model.

ARCH Model

One of the early attempts to model volatility was proposed by Eagle (1982) and is known as the ARCH model. The ARCH model is a univariate model and based on historical asset returns. The ARCH(p) model has the following form:

σ_{t}^{2} = ω + \sum_{k = 1}^{p} α_{k} {(r_{t - k})}^{2}

where the mean model is:

r_{t} = σ_{t} ϵ_{t}

where $ϵ_{t}$ is assumed to be normally distributed. In this parametric model, we need to satisfy some assumptions to have strictly positive variance. In this respect, the following conditions should hold:

$omega greater-than 0$
$alpha Subscript k Baseline greater-than-or-equal-to 0$

All of these equations tell us that ARCH is a univariate and nonlinear model in which volatility is estimated with the square of past returns. One of the most distinctive features of ARCH is that it has the property of time-varying conditional variance¹ so that ARCH is able to model the phenomenon known as volatility clustering—that is, large changes tend to be followed by large changes of either sign, and small changes tend to be followed by small changes, as described by Mandelbrot (1963). Hence, once an important announcement is made to the market, it might result in huge volatility.

The following code block shows how to plot clustering and what it looks like:

In [5]: retv = ret.values 

In [6]: plt.figure(figsize=(10, 6))
        plt.plot(s_p500.index[1:], ret)
        plt.title('Volatility clustering of S&P-500')
        plt.ylabel('Daily returns')
        plt.xlabel('Date')
        plt.show()

: Return dataframe into a numpy representation

Similar to spikes in realized volatility, Figure 4-2 suggests some large movements, and, unsurprisingly, these ups and downs happen around important events such as the COVID-19 pandemic in mid-2020.

Despite its appealing features, such as simplicity, nonlinearity, easiness, and adjustment for forecast, the ARCH model has certain drawbacks:

Equal response to positive and negative shocks
Strong assumptions such as restrictions on parameters
Possible misprediction due to slow adjustments to large movements

These drawbacks motivated researchers to work on extensions of the ARCH model, notably the GARCH model proposed by Bollerslev (1986) and Taylor (1986), which we will discuss shortly.

Now let’s employ the ARCH model to predict volatility. First, let’s generate our own Python code, and then compare it with a built-in function from the arch library to see the differences:

In [7]: n = 252
        split_date = ret.iloc[-n:].index 

In [8]: sgm2 = ret.var() 
        K = ret.kurtosis() 
        alpha = (-3.0 * sgm2 + np.sqrt(9.0 * sgm2 ** 2 - 12.0 *
                                     (3.0 * sgm2 - K) * K)) / (6 * K) 
        omega = (1 - alpha) * sgm2 
        initial_parameters = [alpha, omega]
        omega, alpha
Out[8]: (0.6345749196895419, 0.46656704131150534)

In [9]: @jit(nopython=True, parallel=True) 
        def arch_likelihood(initial_parameters, retv):
            omega = abs(initial_parameters[0]) 
            alpha = abs(initial_parameters[1]) 
            T = len(retv)
            logliks = 0
            sigma2 = np.zeros(T)
            sigma2[0] = np.var(retv) 
            for t in range(1, T):
                sigma2[t] = omega + alpha * (retv[t - 1]) ** 2 
            logliks = np.sum(0.5 * (np.log(sigma2)+retv ** 2 / sigma2)) 
            return logliks


In [10]: logliks = arch_likelihood(initial_parameters, retv)
         logliks
Out[10]: 1453.127184488521

In [11]: def opt_params(x0, retv):
             opt_result = opt.minimize(arch_likelihood, x0=x0, args = (retv),
                                       method='Nelder-Mead',
                                       options={'maxiter': 5000}) 
             params = opt_result.x 
             print('\nResults of Nelder-Mead minimization\n{}\n{}'
                   .format(''.join(['-'] * 28), opt_result))
             print('\nResulting params = {}'.format(params))
             return params

In [12]: params = opt_params(initial_parameters, retv)

         Results of Nelder-Mead minimization
         ----------------------------
          final_simplex: (array([[0.70168795, 0.39039044],
                [0.70163494, 0.3904423 ],
         [0.70163928, 0.39033154]]), array([1385.79241695,
                1385.792417, 1385.79241907]))
                    fun: 1385.7924169507244
                message: 'Optimization terminated successfully.'
                   nfev: 62
                    nit: 33
                 status: 0
                success: True
                      x: array([0.70168795, 0.39039044])

         Resulting params = [0.70168795 0.39039044]

In [13]: def arch_apply(ret):
                 omega = params[0]
                 alpha = params[1]
                 T = len(ret)
                 sigma2_arch = np.zeros(T + 1)
                 sigma2_arch[0] = np.var(ret)
                 for t in range(1, T):
                     sigma2_arch[t] = omega + alpha * ret[t - 1] ** 2
                 return sigma2_arch

In [14]: sigma2_arch = arch_apply(ret)

: Defining the split location and assigning the split data to split variable
: Calculating variance of the S&P 500
: Calculating kurtosis of the S&P 500
: Identifying the initial value for slope coefficient $α$
: Identifying the initial value for constant term $ω$
: Using parallel processing to decrease the processing time
: Taking absolute values and assigning the initial values into related variables
: Identifying the initial values of volatility
: Iterating the variance of S&P 500
: Calculating the log-likelihood
: Minimizing the log-likelihood function
: Creating a variable params for optimized parameters

Well, we modeled volatility via ARCH using our own optimization method and ARCH equation. But how about comparing it with the built-in Python code? This built-in code can be imported from arch library and is extremely easy to apply. The result of the built-in function follows; it turns out that these two results are very similar to each other:

In [15]: arch = arch_model(ret, mean='zero', vol='ARCH', p=1).fit(disp='off')
         print(arch.summary())

                 Zero Mean - ARCH Model Results                        \
=============================================================================
Dep. Variable:            Adj Close   R-squared:                       0.000
Mean Model:               Zero Mean   Adj. R-squared:                  0.000
Vol Model:                     ARCH   Log-Likelihood:               -4063.63
Distribution:                Normal   AIC:                           8131.25
Method:          Maximum Likelihood   BIC:                           8143.21
No. Observations:              2914

Date:                Mon, Sep 13 2021   Df Residuals:                   2914
Time:                        21:56:56   Df Model:                        0

                        Volatility Model
========================================================================
                coef    std err          t      P>|t|  95.0% Conf. Int.
------------------------------------------------------------------------
omega          0.7018  5.006e-02     14.018  1.214e-44 [  0.604,  0.800]
alpha[1]       0.3910  7.016e-02      5.573  2.506e-08 [  0.253,  0.529]
========================================================================

Covariance estimator: robust

Although developing our own code is always helpful and improves our understanding, it does not necessarily mean that there’s no need to use built-in functions or libraries. Rather, these functions makes our lives easier in terms of efficiency and ease of use.

All we need is to create a for loop and define a proper information criteria. Here, we’ll choose Bayesian Information Criteria (BIC) as the model selection method and to select lag. The reason BIC is used is that as long as we have large enough samples, BIC is a reliable tool for model selection as per Burnham and Anderson (2002 and 2004). Now, we iterate ARCH model from 1 to 5 lags:

In [16]: bic_arch = []

         for p in range(1, 5): 
                 arch = arch_model(ret, mean='zero', vol='ARCH', p=p)\
                         .fit(disp='off') 
                 bic_arch.append(arch.bic)
                 if arch.bic == np.min(bic_arch): 
                     best_param = p
         arch = arch_model(ret, mean='zero', vol='ARCH', p=best_param)\
                 .fit(disp='off') 
         print(arch.summary())
         forecast = arch.forecast(start=split_date[0]) 
         forecast_arch = forecast

         Zero Mean - ARCH Model Results
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                       ARCH   Log-Likelihood:               -3712.38
Distribution:                  Normal   AIC:                           7434.75
Method:            Maximum Likelihood   BIC:                           7464.64
No. Observations:                2914

Date:                Mon, Sep 13 2021   Df Residuals:                     2914
Time:                        21:56:58   Df Model:                          0

         Volatility Model

==========================================================================

                 coef    std err          t      P>|t|    95.0% Conf. Int.
--------------------------------------------------------------------------
omega          0.2798  2.584e-02     10.826  2.580e-27   [  0.229,  0.330]
alpha[1]       0.1519  3.460e-02      4.390  1.136e-05 [8.406e-02,  0.220]
alpha[2]       0.2329  3.620e-02      6.433  1.249e-10   [  0.162,  0.304]
alpha[3]       0.1917  3.707e-02      5.170  2.337e-07   [  0.119,  0.264]
alpha[4]       0.1922  4.158e-02      4.623  3.780e-06   [  0.111,  0.274]
==========================================================================

         Covariance estimator: robust

In [17]: rmse_arch = np.sqrt(mse(realized_vol[-n:] / 100,
                                 np.sqrt(forecast_arch\
                                 .variance.iloc[-len(split_date):]
                                 / 100))) 
         print('The RMSE value of ARCH model is {:.4f}'.format(rmse_arch))
         The RMSE value of ARCH model is 0.0896

In [18]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100, label='Realized Volatility')
         plt.plot(forecast_arch.variance.iloc[-len(split_date):] / 100,
                  label='Volatility Prediction-ARCH')
         plt.title('Volatility Prediction with ARCH', fontsize=12)
         plt.legend()
         plt.show()

: Iterating ARCH parameter p over specified interval
: Running ARCH model with different p values
: Finding the minimum BIC score to select the best model
: Running ARCH model with the best p value
: Forecasting the volatility based on the optimized ARCH model
: Calculating the root mean square error (RMSE) score

The result of volatility prediction based on our first model is shown in Figure 4-3.

GARCH Model

The GARCH model is an extension of the ARCH model incorporating lagged conditional variance. So ARCH is improved by adding p number of delated conditional variance, which makes the GARCH model multivariate in the sense that it is an autoregressive moving average model for conditional variance with p number of lagged squared returns and q number of lagged conditional variance. GARCH(p, q) can be formulated as:

σ_{t}^{2} = ω + \sum_{k = 1}^{q} α_{k} r_{t - k}^{2} + \sum_{k = 1}^{p} β_{k} σ_{t - k}^{2}

where $ω$ , $β$ , and $α$ are parameters to be estimated and p and q are maximum lag in the model. To have consistent GARCH, the following conditions should hold:

$ω$ > 0
$beta greater-than-or-equal-to 0$
$alpha greater-than-or-equal-to 0$
$β + α$ < 1

The ARCH model is unable to capture the influence of historical innovations. However, as a more parsimonious model, the GARCH model can account for the change in historical innovations because GARCH models can be expressed as an infinite-order ARCH. Let’s see how GARCH can be shown as an infinite order of ARCH:

σ_{t}^{2} = ω + α r_{t - 1}^{2} + β σ_{t - 1}^{2}

Then replace $sigma Subscript t minus 1 Superscript 2$ by $ω + α r_{t - 2}^{2} + β σ_{t - 2}^{2}$ :

σ_{t}^{2} = ω + α r_{t - 1}^{2} + β (ω + α r_{t - 2}^{2} σ_{t - 2}^{2})

= ω (1 + β) + α r_{t - 1}^{2} + β α r_{t - 2}^{2} + β^{2} σ_{t - 2}^{2})

Now, let’s substitute $sigma Subscript t minus 2 Superscript 2$ with $ω + α r_{t - 3}^{2} + β σ_{t - 3}^{2}$ and do the necessary math so that we end up with:

σ_{t}^{2} = ω (1 + β + β^{2} + . . .) + α \sum_{k = 1}^{\infty} β^{k - 1} r_{t - k}

Similar to the ARCH model, there is more than one way to model volatility using GARCH in Python. Let us try to develop our own Python-based code using the optimization technique first. In what follows, the arch library will be used to predict volatility:

In [19]: a0 = 0.0001
         sgm2 = ret.var()
         K = ret.kurtosis()
         h = 1 - alpha / sgm2
         alpha = np.sqrt(K * (1 - h ** 2) / (2.0 * (K + 3)))
         beta = np.abs(h - omega)
         omega = (1 - omega) * sgm2
         initial_parameters = np.array([omega, alpha, beta])
         print('Initial parameters for omega, alpha, and beta are \n{}\n{}\n{}'
               .format(omega, alpha, beta))
         Initial parameters for omega, alpha, and beta  are
         0.43471178001576827
         0.512827280537482
         0.02677799855546381

In [20]: retv = ret.values

In [21]: @jit(nopython=True, parallel=True)
         def garch_likelihood(initial_parameters, retv):
             omega = initial_parameters[0]
             alpha = initial_parameters[1]
             beta = initial_parameters[2]
             T =  len(retv)
             logliks = 0
             sigma2 = np.zeros(T)
             sigma2[0] = np.var(retv)
             for t in range(1, T):
                 sigma2[t] = omega + alpha * (retv[t - 1]) ** 2 +
                             beta * sigma2[t-1]
             logliks = np.sum(0.5 * (np.log(sigma2) + retv ** 2 / sigma2))
             return logliks

In [22]: logliks = garch_likelihood(initial_parameters, retv)
         print('The Log likelihood  is {:.4f}'.format(logliks))
         The Log likelihood  is 1387.7215

In [23]: def garch_constraint(initial_parameters):
             alpha = initial_parameters[0]
             gamma = initial_parameters[1]
             beta = initial_parameters[2]
             return np.array([1 - alpha - beta])

In [24]: bounds = [(0.0, 1.0), (0.0, 1.0), (0.0, 1.0)]

In [25]: def opt_paramsG(initial_parameters, retv):
             opt_result = opt.minimize(garch_likelihood,
                                       x0=initial_parameters,
                                       constraints=np.array([1 - alpha - beta]),
                                       bounds=bounds, args = (retv),
                                       method='Nelder-Mead',
                                       options={'maxiter': 5000})
             params = opt_result.x
             print('\nResults of Nelder-Mead minimization\n{}\n{}'\
                   .format('-' * 35, opt_result))
             print('-' * 35)
             print('\nResulting parameters = {}'.format(params))
             return params

In [26]: params = opt_paramsG(initial_parameters, retv)

         Results of Nelder-Mead minimization
         -----------------------------------
          final_simplex: (array([[0.03918956, 0.17370549, 0.78991502],
                [0.03920507, 0.17374466, 0.78987403],
                [0.03916671, 0.17377319, 0.78993078],
         [0.03917324, 0.17364595, 0.78998753]]), array([979.87109624, 979.8710967 ,
          979.87109865, 979.8711147 ]))
                    fun: 979.8710962352685
                message: 'Optimization terminated successfully.'
                   nfev: 178
                    nit: 102
                 status: 0
                success: True
                      x: array([0.03918956, 0.17370549, 0.78991502])
         -----------------------------------

         Resulting parameters = [0.03918956 0.17370549 0.78991502]

In [27]: def garch_apply(ret):
                 omega = params[0]
                 alpha = params[1]
                 beta = params[2]
                 T = len(ret)
                 sigma2 = np.zeros(T + 1)
                 sigma2[0] = np.var(ret)
                 for t in range(1, T):
                     sigma2[t] = omega + alpha * ret[t - 1] ** 2 +
                                 beta * sigma2[t-1]
                 return sigma2

The parameters we get from our own GARCH code are approximately:

$ω$ = 0.0392
$α$ = 0.1737
$β$ = 0.7899

Now, let’s try it with the built-in Python function:

In [28]: garch = arch_model(ret, mean='zero', vol='GARCH', p=1, o=0, q=1)\
                 .fit(disp='off')
         print(garch.summary())

         Zero Mean - GARCH Model Results
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                      GARCH   Log-Likelihood:               -3657.62
Distribution:                  Normal   AIC:                           7321.23
Method:            Maximum Likelihood   BIC:                           7339.16
No. Observations:                 2914

Date:                Mon, Sep 13 2021   Df Residuals:                     2914
Time:                        21:57:08   Df Model:                          0
Volatility Model

============================================================================
coef    std err          t      P>|t|      95.0% Conf. Int.

----------------------------------------------------------------------------
omega          0.0392  8.422e-03      4.652  3.280e-06 [2.268e-02,5.569e-02]
alpha[1]       0.1738  2.275e-02      7.637  2.225e-14     [  0.129,  0.218]
beta[1]        0.7899  2.275e-02     34.715 4.607e-264     [  0.745,  0.835]
============================================================================

         Covariance estimator: robust

The built-in function confirms that we did a great job, as the parameters obtained via the built-in code are almost the same as ours, so we have learned how to code GARCH and ARCH models to predict volatility.

It’s apparent that it is easy to work with GARCH(1, 1), but how do we know that the parameters are the optimum ones? Let’s decide the optimum parameter set given the lowest BIC value (and in doing so, generate Figure 4-4):

In [29]: bic_garch = []

         for p in range(1, 5):
             for q in range(1, 5):
                 garch = arch_model(ret, mean='zero',vol='GARCH', p=p, o=0, q=q)\
                         .fit(disp='off')
                 bic_garch.append(garch.bic)
                 if garch.bic == np.min(bic_garch):
                     best_param = p, q
         garch = arch_model(ret, mean='zero', vol='GARCH',
                            p=best_param[0], o=0, q=best_param[1])\
                 .fit(disp='off')
         print(garch.summary())
         forecast = garch.forecast(start=split_date[0])
         forecast_garch = forecast

         Zero Mean - GARCH Model Results
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                      GARCH   Log-Likelihood:               -3657.62
Distribution:                  Normal   AIC:                           7321.23
Method:            Maximum Likelihood   BIC:                           7339.16
No. Observations:                 2914

Date:                Mon, Sep 13 2021   Df Residuals:                     2914
Time:                        21:57:10   Df Model:                          0
Volatility Model

============================================================================
                  coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
omega          0.0392  8.422e-03      4.652  3.280e-06 [2.268e-02, 5.569e-02]
alpha[1]       0.1738  2.275e-02      7.637  2.225e-14       [  0.129, 0.218]
beta[1]        0.7899  2.275e-02     34.715 4.607e-264       [  0.745, 0.835]
============================================================================

         Covariance estimator: robust

In [30]: rmse_garch = np.sqrt(mse(realized_vol[-n:] / 100,
                                  np.sqrt(forecast_garch\
                                  .variance.iloc[-len(split_date):]
                                  / 100)))
         print('The RMSE value of GARCH model is {:.4f}'.format(rmse_garch))
         The RMSE value of GARCH model is 0.0878

In [31]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol / 100, label='Realized Volatility')
         plt.plot(forecast_garch.variance.iloc[-len(split_date):] / 100,
                  label='Volatility Prediction-GARCH')
         plt.title('Volatility Prediction with GARCH', fontsize=12)
         plt.legend()
         plt.show()

The volatility of returns is well-fitted by the GARCH model partly because of its volatility clustering and partly because GARCH does not assume that the returns are independent, which allows it to account for the leptokurtic property of returns. However, despite these useful properties and its intuitiveness, GARCH is not able to model the asymmetric response of the shocks (Karasan and Gaygisiz 2020). To remedy this issue, GJR-GARCH was proposed by Glosten, Jagannathan, and Runkle (1993).

GJR-GARCH

The GJR-GARCH model performs well in modeling the asymmetric effects of announcements in the way that bad news has a larger impact than good news. In other words, in the presence of asymmetry, the distribution of losses has a fatter tail than the distribution of gains. The equation of the model includes one more parameter, $γ$ , and it takes the following form:

σ_{t}^{2} = ω + \sum_{k = 1}^{q} (α_{k} r_{t - k}^{2} + γ r_{t - k}^{2} I (ϵ_{t - 1} < 0)) + \sum_{k = 1}^{p} β_{k} σ_{t - k}^{2}

where $γ$ controls for the asymmetry of the announcements and if:

$γ$ = 0: The response to the past shock is the same.
$γ$ > 0: The response to the past negative shock is stronger than a positive one.
$γ$ < 0: The response to the past positive shock is stronger than a negative one.

Let’s now run the GJR-GARCH model by finding the optimum parameter values using BIC, and producing Figure 4-5 as a result:

In [32]: bic_gjr_garch = []

         for p in range(1, 5):
             for q in range(1, 5):
                 gjrgarch = arch_model(ret, mean='zero', p=p, o=1, q=q)\
                            .fit(disp='off')
                 bic_gjr_garch.append(gjrgarch.bic)
                 if gjrgarch.bic == np.min(bic_gjr_garch):
                     best_param = p, q
         gjrgarch = arch_model(ret,mean='zero', p=best_param[0], o=1,
                               q=best_param[1]).fit(disp='off')
         print(gjrgarch.summary())
         forecast = gjrgarch.forecast(start=split_date[0])
         forecast_gjrgarch = forecast

         Zero Mean - GJR-GARCH Model Results
==============================================================================
Dep. Variable:              Adj Close   R-squared:                      0.000
Mean Model:                 Zero Mean   Adj. R-squared:                 0.000
Vol Model:                  GJR-GARCH   Log-Likelihood:              -3593.36
Distribution:                  Normal   AIC:                          7194.73
Method:            Maximum Likelihood   BIC:                          7218.64
No. Observations:                 2914

Date:                Mon, Sep 13 2021   Df Residuals:                    2914
Time:                        21:57:14   Df Model:                         0
Volatility Model

=============================================================================
                 coef    std err          t      P>|t|       95.0% Conf. Int.
-----------------------------------------------------------------------------
omega          0.0431  7.770e-03      5.542  2.983e-08  [2.784e-02,5.829e-02]
alpha[1]       0.0386  3.060e-02      1.261      0.207 [-2.139e-02,9.855e-02]
gamma[1]       0.2806  4.818e-02      5.824  5.740e-09       [  0.186, 0.375]
beta[1]        0.7907  2.702e-02     29.263 3.029e-188       [  0.738, 0.844]
=============================================================================

         Covariance estimator: robust

In [33]: rmse_gjr_garch = np.sqrt(mse(realized_vol[-n:] / 100,
                                      np.sqrt(forecast_gjrgarch\
                                      .variance.iloc[-len(split_date):]
                                      / 100)))
         print('The RMSE value of GJR-GARCH models is {:.4f}'
               .format(rmse_gjr_garch))
         The RMSE value of GJR-GARCH models is 0.0882

In [34]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100, label='Realized Volatility')
         plt.plot(forecast_gjrgarch.variance.iloc[-len(split_date):] / 100,
                  label='Volatility Prediction-GJR-GARCH')
         plt.title('Volatility Prediction with GJR-GARCH', fontsize=12)
         plt.legend()
         plt.show()

EGARCH

Together with the GJR-GARCH model, the EGARCH model, proposed by Nelson (1991), is another tool for controlling for the effect of asymmetric announcements. Additionally, it is specified in logarithmic form, so there is no need to add restrictions to avoid negative volatility:

log (σ_{t}^{2}) = ω + \sum_{k = 1}^{p} β_{k} log σ_{t - k}^{2} + \sum_{k = 1}^{q} α_{i} \frac{| r_{k - 1} |}{\sqrt{σ_{t - k}^{2}}} + \sum_{k = 1}^{q} γ_{k} \frac{r_{t - k}}{\sqrt{σ_{t - k}^{2}}}

The main difference in the EGARCH equation is that logarithm is taken of the variance on the left-hand side of the equation. This indicates the leverage effect, meaning that there exists a negative correlation between past asset returns and volatility. If $gamma less-than 0$ , it implies leverage effect, and if $gamma not-equals 0$ , that shows asymmetry in volatility.

Following the same procedure we used previously, let’s model the volatility using the EGARCH model (resulting in Figure 4-6):

In [35]: bic_egarch = []

         for p in range(1, 5):
             for q in range(1, 5):
                 egarch = arch_model(ret, mean='zero', vol='EGARCH', p=p, q=q)\
                          .fit(disp='off')
                 bic_egarch.append(egarch.bic)
                 if egarch.bic == np.min(bic_egarch):
                     best_param = p, q
         egarch = arch_model(ret, mean='zero', vol='EGARCH',
                             p=best_param[0], q=best_param[1])\
                  .fit(disp='off')
         print(egarch.summary())
         forecast = egarch.forecast(start=split_date[0])
         forecast_egarch = forecast

         Zero Mean - EGARCH Model Results
==============================================================================
Dep. Variable:              Adj Close   R-squared:                      0.000
Mean Model:                 Zero Mean   Adj. R-squared:                 0.000
Vol Model:                     EGARCH   Log-Likelihood:              -3676.18
Distribution:                  Normal   AIC:                          7358.37
Method:            Maximum Likelihood   BIC:                          7376.30
No. Observations:                 2914

Date:                Mon, Sep 13 2021   Df Residuals:                    2914
Time:                        21:57:19   Df Model:                         0
Volatility Model

=============================================================================
                 coef    std err          t      P>|t|       95.0% Conf. Int.
-----------------------------------------------------------------------------
omega      2.3596e-03  6.747e-03      0.350      0.727  [-1.086e-02,1.558e-02]
alpha[1]       0.3266  3.427e-02      9.530  1.567e-21        [  0.259, 0.394]
beta[1]        0.9456  1.153e-02     82.023      0.000        [  0.923, 0.968]
=============================================================================

         Covariance estimator: robust

In [36]: rmse_egarch = np.sqrt(mse(realized_vol[-n:] / 100,
                                   np.sqrt(forecast_egarch.variance\
                                   .iloc[-len(split_date):] / 100)))
         print('The RMSE value of EGARCH models is {:.4f}'.format(rmse_egarch))
         The RMSE value of EGARCH models is 0.0904

In [37]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100, label='Realized Volatility')
         plt.plot(forecast_egarch.variance.iloc[-len(split_date):] / 100,
                  label='Volatility Prediction-EGARCH')
         plt.title('Volatility Prediction with EGARCH', fontsize=12)
         plt.legend()
         plt.show()

Given the RMSE results shown in Table 4-1, the best and worst performing models are GARCH and EGARCH, respectively. But there are no big differences in the performance of the models we have used here. In particular, during bad news/good news announcements, the performances of EGARCH and GJR-GARCH might be different due to the asymmetry in the market.

Table 4-1. RMSE results for all four models
Model	RMSE
ARCH	0.0896
GARCH	0.0878
GJR-GARCH	0.0882
EGARCH	0.0904

Up to now, we have discussed the classical volatility models, but from this point on, we will see how ML and the Bayesian approach can be used to model volatility. In the context of ML, support vector machines and neural networks will be the first models to explore. Let’s get started.

Support Vector Regression: GARCH

Support vector machine (SVM) is a supervised learning algorithm that can be applicable to both classification and regression. The aim of SVM is to find a line that separates two classes. It sounds easy but here is the challenging part: there are almost an infinite number of lines that can be used to distinguish the classes. But we are looking for the optimal line by which the classes can be perfectly discriminated.

In linear algebra, the optimal line is called hyperplane, which maximizes the distance between the points that are closest to the hyperplane but belong to different classes. The distance between the two points (support vectors) is known as margin. So, in SVM, what we are trying to do is to maximize the margin between support vectors.

SVM for classification is known as support vector classification (SVC). Keeping all characteristics of SVM, it can be applicable to regression. Again, in regression, the aim is to find the hyperplane that minimizes the error and maximizes the margin. This method is called support vector regression (SVR) and, in this part, we will apply this method to the GARCH model. Combining these two models gets us SVR-GARCH.

Kernel Functions

What happens if the data we are working on cannot be linearly separable? That would be a huge headache for us, but don’t worry: we have kernel functions to remedy this problem. This is a nice and easy method for modeling nonlinear and high-dimensional data. The steps we take in kernel SVM are:

Move the data into high dimension
Find a suitable hyperplane
Go back to the initial data

To do this, we use kernel functions. Using the idea of feature map, we indicate that our original variables are mapped to new set of quantities, and then passed to the learning algorithm.

Finally, instead of input data, we use the following main kernel functions in optimization procedures:

Polynomial kernel

$K (x, z) = (x^{T} z + b)$

Radial basis (Gaussian) kernel

$K (x, z) = exp (- \frac{{| x - z |}^{2}}{2 σ^{2}})$

Exponential kernel

$K (x, z) = ex p (- \frac{| x - z |}{σ})$

where x is input, b is bias or constant, and z is linear combination of x.²

The following code shows us the preparations before running the SVR-GARCH in Python. The most crucial step here is to obtain independent variables, which are realized volatility and square of historical returns:

In [38]: from sklearn.svm import SVR
         from scipy.stats import uniform as sp_rand
         from sklearn.model_selection import RandomizedSearchCV

In [39]: realized_vol = ret.rolling(5).std() 
         realized_vol = pd.DataFrame(realized_vol)
         realized_vol.reset_index(drop=True, inplace=True)

In [40]: returns_svm = ret ** 2
         returns_svm = returns_svm.reset_index()
         del returns_svm['Date']

In [41]: X = pd.concat([realized_vol, returns_svm], axis=1, ignore_index=True)
         X = X[4:].copy()
         X = X.reset_index()
         X.drop('index', axis=1, inplace=True)

In [42]: realized_vol = realized_vol.dropna().reset_index()
         realized_vol.drop('index', axis=1, inplace=True)

In [43]: svr_poly = SVR(kernel='poly', degree=2) 
         svr_lin = SVR(kernel='linear') 
         svr_rbf = SVR(kernel='rbf')

: Computing realized volatility and assigning a new variable to it named realized_vol
: Creating new variables for each SVR kernel

Let’s run and see our first SVR-GARCH application with linear kernel (and produce Figure 4-7); we’ll use the RMSE metric to compare the applications:

In [44]: para_grid = {'gamma': sp_rand(),
                      'C': sp_rand(),
                      'epsilon': sp_rand()} 
         clf = RandomizedSearchCV(svr_lin, para_grid) 
         clf.fit(X.iloc[:-n].values,
                 realized_vol.iloc[1:-(n-1)].values.reshape(-1,)) 
         predict_svr_lin = clf.predict(X.iloc[-n:]) 

In [45]: predict_svr_lin = pd.DataFrame(predict_svr_lin)
         predict_svr_lin.index = ret.iloc[-n:].index

In [46]: rmse_svr = np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                                predict_svr_lin / 100))
         print('The RMSE value of SVR with Linear Kernel is {:.6f}'
               .format(rmse_svr))
         The RMSE value of SVR with Linear Kernel is 0.000462

In [47]: realized_vol.index = ret.iloc[4:].index

In [48]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100, label='Realized Volatility')
         plt.plot(predict_svr_lin / 100, label='Volatility Prediction-SVR-GARCH')
         plt.title('Volatility Prediction with SVR-GARCH (Linear)', fontsize=12)
         plt.legend()
         plt.show()

: Identifying the hyperparameter space for tuning
: Applying hyperparameter tuning with RandomizedSearchCV
: Fitting SVR-GARCH with linear kernel to data
: Predicting the volatilities based on the last 252 observations and storing them in the predict_svr_lin

Figure 4-7 exhibits the predicted values and actual observation. By eyeballing it, we can tell that SVR-GARCH performs well. As you can guess, the linear kernel works fine if the dataset is linearly separable; it is also suggested by Occam’s razor.³ But what if the dataset isn’t linearly separable? Let’s continue with the radial basis function (RBF) and polynomial kernels. The former uses elliptical curves around the observations, and the latter, unlike the first two, focuses on the combinations of samples. Let’s now see how they work.

Let’s start with an SVR-GARCH application using the RBF kernel, a function that projects data into a new vector space. From a practical standpoint, SVR-GARCH application with different kernels is not a labor-intensive process; all we need to do is switch the kernel name, as shown in the following (and resulting in Figure 4-8):

In [49]: para_grid ={'gamma': sp_rand(),
                     'C': sp_rand(),
                     'epsilon': sp_rand()}
         clf = RandomizedSearchCV(svr_rbf, para_grid)
         clf.fit(X.iloc[:-n].values,
                 realized_vol.iloc[1:-(n-1)].values.reshape(-1,))
         predict_svr_rbf = clf.predict(X.iloc[-n:])


In [50]: predict_svr_rbf = pd.DataFrame(predict_svr_rbf)
         predict_svr_rbf.index = ret.iloc[-n:].index

In [51]: rmse_svr_rbf = np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                                    predict_svr_rbf / 100))
         print('The RMSE value of SVR with RBF Kernel is  {:.6f}'
               .format(rmse_svr_rbf))
         The RMSE value of SVR with RBF Kernel is  0.000970

In [52]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100, label='Realized Volatility')
         plt.plot(predict_svr_rbf / 100, label='Volatility Prediction-SVR_GARCH')
         plt.title('Volatility Prediction with SVR-GARCH (RBF)', fontsize=12)
         plt.legend()
         plt.show()

Both the RMSE score and the visualization suggest that SVR-GARCH with linear kernel outperforms SVR-GARCH with RBF kernel. The RMSEs of SVR-GARCH with linear and RBF kernels are 0.000462 and 0.000970, respectively. So SVR with linear kernel performs well.

Lastly, let’s try SVR-GARCH with the polynomial kernel. It will turn out that it has the highest RMSE (0.002386), implying that it is the worst-performing kernel among these three different applications. The predictive performance of SVR-GARCH with polynomial kernel can be found in Figure 4-9:

In [53]: para_grid = {'gamma': sp_rand(),
                     'C': sp_rand(),
                     'epsilon': sp_rand()}
         clf = RandomizedSearchCV(svr_poly, para_grid)
         clf.fit(X.iloc[:-n].values,
                 realized_vol.iloc[1:-(n-1)].values.reshape(-1,))
         predict_svr_poly = clf.predict(X.iloc[-n:])

In [54]: predict_svr_poly = pd.DataFrame(predict_svr_poly)
         predict_svr_poly.index = ret.iloc[-n:].index

In [55]: rmse_svr_poly = np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                                     predict_svr_poly / 100))
         print('The RMSE value of SVR with Polynomial Kernel is {:.6f}'\
               .format(rmse_svr_poly))
         The RMSE value of SVR with Polynomial Kernel is 0.002386

In [56]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol/100, label='Realized Volatility')
         plt.plot(predict_svr_poly/100, label='Volatility Prediction-SVR-GARCH')
         plt.title('Volatility Prediction with SVR-GARCH (Polynomial)',
                   fontsize=12)
         plt.legend()
         plt.show()

Neural Networks

Neural networks are the building block for deep learning. In an NN, data is processed in multiple stages to make a decision. Each neuron takes a result of a dot product as input and uses it in an activation function to make a decision:

z = w_{1} x_{1} + w_{2} x_{2} + b

where b is bias, w is weight, and x is input data.

During this process, input data is mathematically manipulated in various ways in hidden and output layers. Generally speaking, an NN has three types of layers:

Input layers
Hidden layers
Output layers

Figure 4-10 can help to illustrate the relationships among layers.

The input layer includes raw data. In going from the input layer to the hidden layer, we learn coefficients. There may be one or more than one hidden layers depending on the network structure. The more hidden layers the network has, the more complicated it is. Hidden layers, located between input and output layers, perform nonlinear transformations via activation functions.

Finally, the output layer is the layer in which output is produced and decisions are made.

In ML, gradient descent is applied to find the optimum parameters that minimize the cost function, but employing only gradient descent in NN is not feasible due to the chain-like structure within the NN. Thus, a new concept known as backpropagation is proposed to minimize the cost function. The idea of backpropagation rests on calculating the error between observed and actual output, and then passing this error to the hidden layer. So we move backward, and the main equation takes the form of:

δ^{l} = \frac{δ J}{δ z_{j}^{l}}

where z is linear transformation and $δ$ represents error. There is much more to say here, but to keep us on track we’ll stop here. For those who want to dig more into the math behind NNs, please refer to Wilmott (2013) and Alpaydin (2020).

Gradient Descent

Suppose that we are at the top of a hill and are trying to reach the plateau at which we minimize the cost function. Formally, gradient descent is an optimization algorithm used to search for best parameter space (w, b) that minimizes the cost function via following update rule:

θ_{t + 1} = θ_{t} - λ \frac{δ J}{δ θ_{t}}

where $theta left-parenthesis w comma b right-parenthesis$ is the function of weight, w, and bias, b. J is cost function, and $λ$ is the learning rate, which is a constant number deciding how fast we want to minimize the cost function. At each iteration, we update the parameters to minimize the error.

The gradient descent algorithm works in the following way:

Select initial values for w and b.
Take an $λ$ step in the direction opposite to where the gradient points.
Update w and b at each iteration.
Repeat from step 2 until convergence.

Now, we apply NN-based volatility prediction using the MLPRegressor module from scikit-learn, even though we have various options to run NNs in Python.⁴ Given the NN structure we’ve introduced, the result follows:

In [57]: from sklearn.neural_network import MLPRegressor 
         NN_vol = MLPRegressor(learning_rate_init=0.001, random_state=1)
         para_grid_NN = {'hidden_layer_sizes': [(100, 50), (50, 50), (10, 100)],
                        'max_iter': [500, 1000],
                        'alpha': [0.00005, 0.0005 ]} 
         clf = RandomizedSearchCV(NN_vol, para_grid_NN)
         clf.fit(X.iloc[:-n].values,
                 realized_vol.iloc[1:-(n-1)].values.reshape(-1, )) 
         NN_predictions = clf.predict(X.iloc[-n:]) 

In [58]: NN_predictions = pd.DataFrame(NN_predictions)
         NN_predictions.index = ret.iloc[-n:].index

In [59]: rmse_NN = np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                               NN_predictions / 100))
         print('The RMSE value of NN is {:.6f}'.format(rmse_NN))
         The RMSE value of NN is 0.000583

In [60]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100, label='Realized Volatility')
         plt.plot(NN_predictions / 100, label='Volatility Prediction-NN')
         plt.title('Volatility Prediction with Neural Network', fontsize=12)
         plt.legend()
         plt.show()

: Importing the MLPRegressor module
: Configuring the NN model with three hidden layers and varying neuron numbers
: Fitting the NN model to the training data⁵
: Predicting the volatilities based on the last 252 observations and storing them in the NN_predictions variable

Figure 4-11 shows the volatility prediction result based on the NN model. Despite its reasonable performance, we can play with the number of hidden neurons to generate a deep learning model. To do that, we can apply the Keras library, Python’s interface for artificial neural networks.

Now it’s time to predict volatility using deep learning. Based on Keras, it is easy to configure the network structure. All we need is to determine the number of neurons of the specific layer. Here, the number of neurons for the first and second hidden layers are 256 and 128, respectively. As volatility has a continuous type, we have only one output neuron:

In [61]: import tensorflow as tf
         from tensorflow import keras
         from tensorflow.keras import layers

In [62]: model = keras.Sequential(
             [layers.Dense(256, activation="relu"),
              layers.Dense(128, activation="relu"),
              layers.Dense(1, activation="linear"),]) 

In [63]: model.compile(loss='mse', optimizer='rmsprop') 

In [64]: epochs_trial = np.arange(100, 400, 4) 
         batch_trial = np.arange(100, 400, 4) 
         DL_pred = []
         DL_RMSE = []
         for i, j, k in zip(range(4), epochs_trial, batch_trial):
             model.fit(X.iloc[:-n].values,
                       realized_vol.iloc[1:-(n-1)].values.reshape(-1,),
                       batch_size=k, epochs=j, verbose=False) 
             DL_predict = model.predict(np.asarray(X.iloc[-n:])) 
             DL_RMSE.append(np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                                     DL_predict.flatten() / 100))) 
             DL_pred.append(DL_predict)
             print('DL_RMSE_{}:{:.6f}'.format(i+1, DL_RMSE[i]))
         DL_RMSE_1:0.000551
         DL_RMSE_2:0.000714
         DL_RMSE_3:0.000627
         DL_RMSE_4:0.000739

In [65]: DL_predict = pd.DataFrame(DL_pred[DL_RMSE.index(min(DL_RMSE))])
         DL_predict.index = ret.iloc[-n:].index

In [66]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100,label='Realized Volatility')
         plt.plot(DL_predict / 100,label='Volatility Prediction-DL')
         plt.title('Volatility Prediction with Deep Learning',  fontsize=12)
         plt.legend()
         plt.show()

: Configuring the network structure by deciding number of layers and neurons
: Compiling the model with loss and optimizer
: Deciding the epoch and batch size using np.arange
: Fitting the deep learning model
: Predicting the volatility based on the weights obtained from the training phase
: Calculating the RMSE score by flattening the predictions

It turns out that we get a minimum RMSE score when we have epoch number and batch size of 100. This shows that increasing the complexity of the model does not necessarily imply high predictive performance. The key is to find a sweet spot between complexity and predictive performance. Otherwise, the model can easily tend to overfit.

Figure 4-12 shows the volatility prediction result derived from the preceding code, and it implies that deep learning provides a strong tool for modeling volatility, too.

The Bayesian Approach

The way we approach probability is of central importance in the sense that it distinguishes the classical (or Frequentist) and Bayesian approaches. According to the former, the relative frequency will converge to the true probability. However, a Bayesian application is based on the subjective interpretation. Unlike the Frequentists, Bayesian statisticians consider the probability distribution as uncertain, and it is revised as new information comes in.

Due to the different interpretation in the probability of these two approaches, likelihood—defined as the probability of an observed event given a set of parameters—is computed differently.

Starting from the joint density function, we can give the mathematical representation of the likelihood function:

script upper L left-parenthesis theta vertical-bar x 1 comma x 2 comma period period period comma x Subscript p Baseline right-parenthesis equals probability left-parenthesis x 1 comma x 2 comma period period period comma x Subscript p Baseline vertical-bar theta right-parenthesis

Among possible $θ$ values, what we are trying to do is decide which one is more likely. Under the statistical model proposed by the likelihood function, the observed data $x 1 comma period period period comma x Subscript p Baseline$ is the most probable.

In fact, you are familiar with the method based on this approach, which is maximum likelihood estimation. Having defined the main difference between Bayesian and Frequentist approaches, it is time to delve more into Bayes’ theorem.

The Bayesian approach is based on conditional distribution, which states that probability gauges the extent to which one has about a uncertain event. So the Bayesian application suggests a rule that can be used to update the beliefs that one holds in light of new information:

Bayesian estimation is used when we have some prior information regarding a parameter. For example, before looking at a sample to estimate the mean of a distribution, we may have some prior belief that it is close to 2, between 1 and 3. Such prior beliefs are especially important when we have a small sample. In such a case, we are interested in combining what the data tells us, namely, the value calculated from the sample, and our prior information.

Rachev et al., 2008

Similar to the Frequentist application, Bayesian estimation is based on probability density $probability left-parenthesis x vertical-bar theta right-parenthesis$ . However, as we have discussed previously, Bayesian and Frequentist methods treat parameter set $θ$ differently. A Frequentist assumes $θ$ to be fixed, whereas in a Bayesian setting, $θ$ is taken as a random variable whose probability is known as prior density $probability left-parenthesis theta right-parenthesis$ . Well, we have another unknown term, but no worries—it is easy to understand.

In light of this information, we can estimate $script upper L left-parenthesis x vertical-bar theta right-parenthesis$ using prior density $probability left-parenthesis theta right-parenthesis$ and come up with the following formula. Prior is employed when we need to estimate the conditional distribution of the parameters given observations:

Pr (θ | x_{1}, x_{2}, . . ., x_{p}) = \frac{ℒ (x_{1}, x_{2}, . . ., x_{p} | θ) Pr (θ)}{Pr (x_{1}, x_{2}, . . ., x_{p})}

Pr (θ | d a t a) = \frac{ℒ (d a t a | θ) Pr (θ)}{Pr (d a t a)}

where

$probability left-parenthesis theta vertical-bar d a t a right-parenthesis$ is the posterior density, which gives us information about the parameters given observed data.
$script upper L left-parenthesis d a t a vertical-bar theta right-parenthesis$ is the likelihood function, which estimates the probability of the data given parameters.
$probability left-parenthesis theta right-parenthesis$ is prior probability. It is the probability of the parameters. Prior is basically the initial beliefs about estimates.
Finally, $Pr$ is the evidence, which is used to update the prior.

Consequently, Bayes’ theorem suggests that the posterior density is directly proportional to the prior and likelihood terms but inverserly related to the evidence term. As the evidence is there for scaling, we can describe this process as:

Posterior proportional-to Likelihood times prior

where $proportional-to$ means “is proportional to.”

Within this context, Bayes’ theorem sounds attractive, doesn’t it? Well, it does, but it comes with a cost, which is analytical intractability. Even if Bayes’ theorem is theoretically intuitive, it is, by and large, hard to solve analytically. This is the major drawback in wide applicability of Bayes’ theorem. However, the good news is that numerical methods provide solid methods to solve this probabilistic model.

Some methods proposed to deal with the computational issues in Bayes’ theorem provide solutions with approximation, including:

Quadrature approximation
Maximum a posteriori estimation (MAP) (discussed in Chapter 6)
Grid approach
Sampling-based approach
Metropolis–Hastings
Gibbs sampler
No U-Turn sampler

Of these approaches, let us restrict our attention to the Metropolis–Hastings algorithm (M-H), which will be our method for modeling Bayes’ theorem. The M-H method rests on the Markov chain Monte Carlo (MCMC) method. So before moving forward, let’s talk about the MCMC method.

Markov Chain Monte Carlo

The Markov chain is a model used to describe the transition probabilities among states. A chain is called Markovian if the probability of the current state $s_{t}$ depends only on the most recent state $s Subscript t minus 1$ :

Pr (s_{t} | s_{t - 1}, s_{t - 2}, . . ., s_{t - p}) = Pr (s_{t} | s_{t - 1})

Thus, MCMC relies on the Markov chain to find the parameter space $θ$ with the highest posterior probability. As the sample size grows, parameter values approximate to the posterior density:

limit Underscript j right-arrow plus normal infinity Endscripts theta Superscript j Baseline right-arrow Overscript upper D Endscripts probability left-parenthesis theta vertical-bar x right-parenthesis

where D refers to distributional approximation. Realized values of parameter space can be used to make inferences about the posterior. In a nutshell, the MCMC method helps us gather IID samples from posterior density so that we can calculate the posterior probability.

To illustrate this, we can refer to Figure 4-13. This figure shows the probability of moving from one state to another. For the sake of simplicity, we’ll set the probability to be 0.2, indicating that the transition from “studying” to “sleeping” has a probability of 0.2:

In [67]: import quantecon as qe
         from quantecon import MarkovChain
         import networkx as nx
         from pprint import pprint

In [68]: P = [[0.5, 0.2, 0.3],
              [0.2, 0.3, 0.5],
              [0.2, 0.2, 0.6]]

         mc = qe.MarkovChain(P, ('studying', 'travelling', 'sleeping'))
         mc.is_irreducible
Out[68]: True

In [69]: states = ['studying', 'travelling', 'sleeping']
         initial_probs = [0.5, 0.3, 0.6]
         state_space = pd.Series(initial_probs, index=states, name='states')

In [70]: q_df = pd.DataFrame(columns=states, index=states)
         q_df = pd.DataFrame(columns=states, index=states)
         q_df.loc[states[0]] = [0.5, 0.2, 0.3]
         q_df.loc[states[1]] = [0.2, 0.3, 0.5]
         q_df.loc[states[2]] = [0.2, 0.2, 0.6]

In [71]: def _get_markov_edges(Q):
             edges = {}
             for col in Q.columns:
                 for idx in Q.index:
                     edges[(idx,col)] = Q.loc[idx,col]
             return edges
         edges_wts = _get_markov_edges(q_df)
         pprint(edges_wts)
         {('sleeping', 'sleeping'): 0.6,
          ('sleeping', 'studying'): 0.2,
          ('sleeping', 'travelling'): 0.2,
          ('studying', 'sleeping'): 0.3,
          ('studying', 'studying'): 0.5,
          ('studying', 'travelling'): 0.2,
          ('travelling', 'sleeping'): 0.5,
          ('travelling', 'studying'): 0.2,
          ('travelling', 'travelling'): 0.3}

In [72]: G = nx.MultiDiGraph()
         G.add_nodes_from(states)
         for k, v in edges_wts.items():
             tmp_origin, tmp_destination = k[0], k[1]
             G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)

         pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='dot')
         nx.draw_networkx(G, pos)
         edge_labels = {(n1, n2):d['label'] for n1, n2, d in G.edges(data=True)}
         nx.draw_networkx_edge_labels(G , pos, edge_labels=edge_labels)
         nx.drawing.nx_pydot.write_dot(G, 'mc_states.dot')

There are two common MCMC methods: M–H and Gibbs sampler. Here, we delve into the former.

Metropolis–Hastings

M-H allows us to have an efficient sampling procedure with two steps. First, we draw a sample from proposal density, then we decide either to accept or reject it.

Let $q left-parenthesis theta vertical-bar theta Superscript t minus 1 Baseline right-parenthesis$ be a proposal density and $θ$ be a parameter space. The entire algorithm of M-H can be summarized as:

Select initial value for $theta Superscript 1$ from parameter space $θ$ .
Select a new parameter value $θ^{2}$ from proposal density, which can be, for the sake of easiness, Gaussian or uniform distribution.
Compute the following acceptance probability:

${Pr}_{a} (θ^{*}, θ^{t - 1}) = m i n (1, \frac{p (θ^{*}) / q (θ^{*} | θ^{t - 1})}{p (θ^{t - 1}) / q (θ^{t - 1} | θ^{*})})$
If $probability Underscript a Endscripts left-parenthesis theta Superscript asterisk Baseline comma theta Superscript t minus 1 Baseline right-parenthesis$ is greater than a sample value drawn from uniform distribution U(0,1), repeat this process from step 2.

Well, it appears intimidating, but don’t worry; we have built-in code in Python that makes the applicability of the M-H algorithm much easier. We use the PyFlux library to make use of Bayes’ theorem. Let’s apply the M-H algorithm to predict volatility:

In [73]: import pyflux as pf
         from scipy.stats import kurtosis

In [74]: model = pf.GARCH(ret.values, p=1, q=1) 
         print(model.latent_variables) 
         model.adjust_prior(1, pf.Normal()) 
         model.adjust_prior(2, pf.Normal()) 
         x = model.fit(method='M-H', iterations='1000') 
         print(x.summary())

         Index    Latent Variable           Prior           Prior Hyperparameters
            V.I. Dist  Transform
         ======== ========================= ===============
          ========================= ========== ==========
         0        Vol Constant              Normal          mu0: 0, sigma0: 3
            Normal     exp
         1        q(1)                      Normal          mu0: 0, sigma0: 0.5
            Normal     logit
         2        p(1)                      Normal          mu0: 0, sigma0: 0.5
            Normal     logit
         3        Returns Constant          Normal          mu0: 0, sigma0: 3
            Normal     None
         Acceptance rate of Metropolis-Hastings is 0.0023
         Acceptance rate of Metropolis-Hastings is 0.23925

         Tuning complete! Now sampling.
         Acceptance rate of Metropolis-Hastings is 0.239175
         GARCH(1,1)

         =======================================================
          ==================================================
         Dependent Variable: Series                          Method: Metropolis
          Hastings
         Start Date: 1                                       Unnormalized Log
          Posterior: -3635.1348
         End Date: 2913                                      AIC:
          7278.269645045323
         Number of observations: 2913                        BIC:
          7302.177400073161
         ======================================================================
          ====================================
         Latent Variable                          Median             Mean
              95% Credibility Interval
         ======================================== ==================
          ================== =========================
         Vol Constant                             0.04               0.0398
              (0.0315 | 0.0501)
         q(1)                                     0.1936             0.194
              (0.1638 | 0.2251)
         p(1)                                     0.7736             0.7737
              (0.7438 | 0.8026)
         Returns Constant                         0.0866             0.0855
              (0.0646 | 0.1038)
         ======================================================================
          ====================================
         None

In [75]: model.plot_z([1, 2]) 
         model.plot_fit(figsize=(15, 5)) 
         model.plot_ppc(T=kurtosis, nsims=1000)

: Configuring GARCH model using the PyFlux library
: Printing the estimation of latent variables (parameters)
: Adjusting the priors for the model latent variables
: Fitting the model using M-H process
: Plotting the latent variables
: Plotting the fitted model
: Plotting the histogram for posterior check

It is worthwhile to visualize the results of what we have done so far for volatility prediction with a Bayesian-based GARCH Model.

Figure 4-14 exhibits the distribution of latent variables. Latent variable q gathers around 0.2, and the other latent variable, p, mostly takes values between 0.7 and 0.8.

Figure 4-15 indicates the demeaned volatility series and the GARCH prediction result based on the Bayesian approach.

Figure 4-16 visualizes the posterior predictions of the Bayesian model with the data so that we are able to detect systematic discrepancies, if any. The vertical line represents the test statistic, and it turns out the observed value is larger than that of our model.

After we are done with the training part, we are all set to move on to the next phase, which is prediction. Prediction analysis is done for the 252 steps ahead, and the RMSE is calculated given the realized volatility:

In [76]: bayesian_prediction = model.predict_is(n, fit_method='M-H') 
         Acceptance rate of Metropolis-Hastings is 0.11515
         Acceptance rate of Metropolis-Hastings is 0.1787
         Acceptance rate of Metropolis-Hastings is 0.2675

         Tuning complete! Now sampling.
         Acceptance rate of Metropolis-Hastings is 0.2579

In [77]: bayesian_RMSE = np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                                  bayesian_prediction.values / 100)) 
         print('The RMSE of Bayesian model is {:.6f}'.format(bayesian_RMSE))
         The RMSE of Bayesian model is 0.004047

In [78]: bayesian_prediction.index = ret.iloc[-n:].index

: In-sample volatility prediction
: Calculating the RMSE score

Eventually, we are ready to observe the prediction result of the Bayesian approach, and the following code does it for us, generating Figure 4-17:

In [79]: plt.figure(figsize=(10, 6))
         plt.plot(realized_vol / 100,
                  label='Realized Volatility')
         plt.plot(bayesian_prediction['Series'] / 100,
                  label='Volatility Prediction-Bayesian')
         plt.title('Volatility Prediction with M-H Approach', fontsize=12)
         plt.legend()
         plt.show()

Figure 4-17 visualizes the volatility prediction based on an M-H–based Bayesian approach, and it seems to overshoot toward the end of 2020. The overall performance of this method shows that it is not among the best methods.

Conclusion

Volatility prediction is a key to understanding the dynamics of the financial market in the sense that it helps us to gauge uncertainty. With that being said, it is used as input in many financial models, including risk models. These facts emphasize the importance of having accurate volatility prediction. Traditionally, parametric methods such as ARCH, GARCH, and their extensions have been extensively used, but these models suffer from being inflexible. To remedy this issue, data-driven models are promising, and this chapter attempted to make use of these models, namely, SVMs, NNs, and deep learning-based models. It turns out that the data-driven models outperform the parametric models.

In the next chapter, market risk, a core financial risk topic, will be discussed both from theoretical and empirical standpoints, and the ML models will be incorporated to further improve the estimation of this risk.

References

Articles cited in this chapter:

Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. 2003. “Modeling and Forecasting Realized Volatility.” Econometrica 71 (2): 579-625.
Andersen, Torben G., and Tim Bollerslev. 1997. “Intraday Periodicity And Volatility Persistence in Financial Markets.” Journal of Empirical Finance 4 (2-3): 115-158.
Black, Fischer. 1976. “Studies of Stock Market Volatility Changes.” 1976 Proceedings of the American Statistical Association Business and Economic Statistics Section.
Bollerslev, T. 1986. “Generalized Autoregressive Conditional Heteroskedasticity.” Journal of Econometrics 31 (3): 307-327. 3): 542-547.
Burnham, Kenneth P., and David R. Anderson. 2004. “Multimodel Inference: Understanding AIC and BIC in Model Selection.” Sociological Methods and Research 33 (2): 261-304.
Eagle, Robert F. 1982. “Autoregressive Conditional Heteroskedasticity with Estimates of the Variance of UK Inflation.” Econometrica 50 (4): 987-1008.
De Stefani, Jacopo, Olivier Caelen, Dalila Hattab, and Gianluca Bontempi. 2017. “Machine Learning for Multi-step Ahead Forecasting of Volatility Proxies.” MIDAS@ PKDD/ECML, 17-28.
Dokuchaev, Nikolai. 2014. “Volatility Estimation from Short Time Series of Stock Prices.” Journal of Nonparametric Statistics 26 (2): 373-384.
Glosten, L. R., R. Jagannathan, and D. E. Runkle 1993. “On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks.” The Journal of Finance 48 (5): 1779-1801.
Karasan, Abdullah, and Esma Gaygisiz. 2020. “Volatility Prediction and Risk Management: An SVR-GARCH Approach.” The Journal of Financial Data Science 2 (4): 85-104.
Mandelbrot, Benoit. 1963. “New Methods in Statistical Economics.” Journal of Political Economy 71 (5): 421-440.

Nelson, Daniel B. 1991. Conditional Heteroskedasticity in Asset Returns: A New Approach. Econometrica 59 (2): 347-370.
Raju, M. T., and Anirban Ghosh. 2004. “Stock Market Volatility: An International Comparison.” Securities and Exchange Board of India.

Books cited in this chapter:

Alpaydin, E. 2020. Introduction to Machine Learning. Cambridge: MIT press.
Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. New York: Springer-Verlag.
Focardi, Sergio M. 1997. Modeling the Market: New Theories and Techniques. The Frank J. Fabozzi Series, Vol. 14. New York: John Wiley and Sons.
Rachev, Svetlozar T., John SJ Hsu, Biliana S. Bagasheva, and Frank J. Fabozzi. 2012. Bayesian Methods in Finance. New York: John Wiley and Sons.
Taylor, S. 1986. Modeling Financial Time Series. Chichester: Wiley.
Wilmott, Paul. 2019. Machine Learning: An Applied Mathematics Introduction. Panda Ohana Publishing.

¹ Conditional variance means that volatility estimation is a function of the past values of asset returns.

² For more information on these functions, see Andrew Ng’s lecture notes.

³ Occam’s razor, also known as law of parsimony, states that given a set of explanations, simpler explanation is the most plausible and likely one.

⁴ Of these alternatives, TensorFlow, PyTorch, and NeuroLab are the most prominent libraries.

⁵ For more detailed information, please see the MLPClassifier documentation.

Get Machine Learning for Financial Risk Management with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Machine Learning for Financial Risk Management with Python by Abdullah Karasan

Chapter 4. Machine Learning-Based Volatility Prediction

Figure 4-1. Realized volatility—S&P 500

ARCH Model

Figure 4-2. Volatility clustering—S&P 500

Figure 4-3. Volatility prediction with ARCH

GARCH Model

Figure 4-4. Volatility prediction with GARCH

GJR-GARCH

Figure 4-5. Volatility prediction with GJR-GARCH

EGARCH

Figure 4-6. Volatility prediction with EGARCH

Support Vector Regression: GARCH

Figure 4-7. Volatility prediction with SVR-GARCH linear kernel

Figure 4-8. Volatility prediction with the SVR-GARCH RBF kernel

Figure 4-9. Volatility prediction with SVR-GARCH polynomial kernel

Neural Networks

Figure 4-10. NN structure

Figure 4-11. Volatility prediction with an NN

Figure 4-12. Volatility prediction with deep learning

The Bayesian Approach

Markov Chain Monte Carlo

Figure 4-13. Interactions of different states

Metropolis–Hastings

Figure 4-14. Latent variables

Figure 4-15. Model fit

Figure 4-16. Posterior prediction

Figure 4-17. Bayesian volatility prediction

Conclusion

References

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly