Chapter 4. Machine Learning-Based Volatility Prediction
The most critical feature of the conditional return distribution is arguably its second moment structure, which is empirically the dominant time-varying characteristic of the distribution. This fact has spurred an enormous literature on the modeling and forecasting of return volatility.
Andersen et al. (2003)
“Some concepts are easy to understand but hard to define. This also holds true for volatility.” This could be a quote from someone living before Markowitz because the way he models volatility is very clear and intuitive. Markowitz proposed his celebrated portfolio theory in which he defined volatility as standard deviation so that from then onward, finance became more intertwined with mathematics.
Volatility is the backbone of finance in the sense that it not only provides an information signal to investors, but it also is an input to various financial models. What makes volatility so important? The answer stresses the importance of uncertainty, which is the main characteristic of the financial model.
Increased integration of financial markets has led to prolonged uncertainty in those markets, which in turn stresses the importance of volatility, the degree at which values of financial assets changes. Volatility used as a proxy of risk is among the most important variables in many fields, including asset pricing and risk management. Its strong presence and latency make it even compulsory to model. Volatility as a risk measure has taken on a key role in risk management following the Basel Accord that came into effect in 1996 (Karasan and Gaygisiz 2020).
A large and growing body of literature regarding the estimation of volatility has emerged after the ground-breaking studies of Black (1976), including Andersen and Bollerslev (1997), Raju and Ghosh (2004), Dokuchaev (2014), and De Stefani et al. (2017). We are talking about a long tradition of volatility prediction using ARCH- and GARCH-type models in which there are certain drawbacks that might cause failures, such as volatility clustering, information asymmetry, and so on. Even though these issues are addressed by different models, recent fluctuations in financial markets coupled with developments in ML have made researchers rethink volatility estimation.
In this chapter, our aim is to show how we can enhance the predictive performance using an ML-based model. We will visit various ML algorithms, namely support vector regression, neural network, and deep learning, so that we are able to compare the predictive performance.
Modeling volatility amounts to modeling uncertainty so that we better understand and approach uncertainty, enabling us to have good enough approximations of the real world. To gauge the extent to which proposed models account for the real-world situation, we need to calculate the return volatility, which is also known as realized volatility. Realized volatility is the square root of realized variance, which is the sum of squared return. Realized volatility is used to calculate the performance of the volatility prediction method. Here is the formula for return volatility:
where r and are return and mean of return, and n is number of observations.
Let’s see how return volatility is computed in Python:
In
[
1
]
:
import
numpy
as
np
from
scipy.stats
import
norm
import
scipy.optimize
as
opt
import
yfinance
as
yf
import
pandas
as
pd
import
datetime
import
time
from
arch
import
arch_model
import
matplotlib.pyplot
as
plt
from
numba
import
jit
from
sklearn.metrics
import
mean_squared_error
as
mse
import
warnings
warnings
.
filterwarnings
(
'
ignore
'
)
In
[
2
]
:
stocks
=
'
^GSPC
'
start
=
datetime
.
datetime
(
2010
,
1
,
1
)
end
=
datetime
.
datetime
(
2021
,
8
,
1
)
s_p500
=
yf
.
download
(
stocks
,
start
=
start
,
end
=
end
,
interval
=
'
1d
'
)
[
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
100
%
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
]
1
of
1
completed
In
[
3
]
:
ret
=
100
*
(
s_p500
.
pct_change
(
)
[
1
:
]
[
'
Adj Close
'
]
)
realized_vol
=
ret
.
rolling
(
5
)
.
std
(
)
In
[
4
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
.
index
,
realized_vol
)
plt
.
title
(
'
Realized Volatility- S&P-500
'
)
plt
.
ylabel
(
'
Volatility
'
)
plt
.
xlabel
(
'
Date
'
)
plt
.
show
(
)
Figure 4-1 shows the realized volatility of S&P 500 over the period of 2010–2021. The most striking observation is the spikes around the COVID-19 pandemic.
The way volatility is estimated has an undeniable impact on the reliability and accuracy of the related analysis. So this chapter deals with both classical and ML-based volatility prediction techniques with a view to showing the superior prediction performance of the ML-based models. To compare the brand-new ML-based models, we start with modeling the classical volatility models. Some very well known classical volatility models include, but are not limited to, the following:
-
ARCH
-
GARCH
-
GJR-GARCH
-
EGARCH
It’s time to dig into the classical volatility models. Let’s start off with the ARCH model.
ARCH Model
One of the early attempts to model volatility was proposed by Eagle (1982) and is known as the ARCH model. The ARCH model is a univariate model and based on historical asset returns. The ARCH(p) model has the following form:
where the mean model is:
where is assumed to be normally distributed. In this parametric model, we need to satisfy some assumptions to have strictly positive variance. In this respect, the following conditions should hold:
All of these equations tell us that ARCH is a univariate and nonlinear model in which volatility is estimated with the square of past returns. One of the most distinctive features of ARCH is that it has the property of time-varying conditional variance1 so that ARCH is able to model the phenomenon known as volatility clustering—that is, large changes tend to be followed by large changes of either sign, and small changes tend to be followed by small changes, as described by Mandelbrot (1963). Hence, once an important announcement is made to the market, it might result in huge volatility.
The following code block shows how to plot clustering and what it looks like:
In
[
5
]
:
retv
=
ret
.
values
In
[
6
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
s_p500
.
index
[
1
:
]
,
ret
)
plt
.
title
(
'
Volatility clustering of S&P-500
'
)
plt
.
ylabel
(
'
Daily returns
'
)
plt
.
xlabel
(
'
Date
'
)
plt
.
show
(
)
Similar to spikes in realized volatility, Figure 4-2 suggests some large movements, and, unsurprisingly, these ups and downs happen around important events such as the COVID-19 pandemic in mid-2020.
Despite its appealing features, such as simplicity, nonlinearity, easiness, and adjustment for forecast, the ARCH model has certain drawbacks:
-
Equal response to positive and negative shocks
-
Strong assumptions such as restrictions on parameters
-
Possible misprediction due to slow adjustments to large movements
These drawbacks motivated researchers to work on extensions of the ARCH model, notably the GARCH model proposed by Bollerslev (1986) and Taylor (1986), which we will discuss shortly.
Now let’s employ the ARCH model to predict volatility. First, let’s generate our own Python code, and then compare it with a built-in function from the arch
library to see the differences:
In
[
7
]
:
n
=
252
split_date
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
8
]
:
sgm2
=
ret
.
var
(
)
K
=
ret
.
kurtosis
(
)
alpha
=
(
-
3.0
*
sgm2
+
np
.
sqrt
(
9.0
*
sgm2
*
*
2
-
12.0
*
(
3.0
*
sgm2
-
K
)
*
K
)
)
/
(
6
*
K
)
omega
=
(
1
-
alpha
)
*
sgm2
initial_parameters
=
[
alpha
,
omega
]
omega
,
alpha
Out
[
8
]
:
(
0.6345749196895419
,
0.46656704131150534
)
In
[
9
]
:
@jit
(
nopython
=
True
,
parallel
=
True
)
def
arch_likelihood
(
initial_parameters
,
retv
)
:
omega
=
abs
(
initial_parameters
[
0
]
)
alpha
=
abs
(
initial_parameters
[
1
]
)
T
=
len
(
retv
)
logliks
=
0
sigma2
=
np
.
zeros
(
T
)
sigma2
[
0
]
=
np
.
var
(
retv
)
for
t
in
range
(
1
,
T
)
:
sigma2
[
t
]
=
omega
+
alpha
*
(
retv
[
t
-
1
]
)
*
*
2
logliks
=
np
.
sum
(
0.5
*
(
np
.
log
(
sigma2
)
+
retv
*
*
2
/
sigma2
)
)
return
logliks
In
[
10
]
:
logliks
=
arch_likelihood
(
initial_parameters
,
retv
)
logliks
Out
[
10
]
:
1453.127184488521
In
[
11
]
:
def
opt_params
(
x0
,
retv
)
:
opt_result
=
opt
.
minimize
(
arch_likelihood
,
x0
=
x0
,
args
=
(
retv
)
,
method
=
'
Nelder-Mead
'
,
options
=
{
'
maxiter
'
:
5000
}
)
params
=
opt_result
.
x
(
'
\n
Results of Nelder-Mead minimization
\n
{}
\n
{}
'
.
format
(
'
'
.
join
(
[
'
-
'
]
*
28
)
,
opt_result
)
)
(
'
\n
Resulting params = {}
'
.
format
(
params
)
)
return
params
In
[
12
]
:
params
=
opt_params
(
initial_parameters
,
retv
)
Results
of
Nelder
-
Mead
minimization
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
final_simplex
:
(
array
(
[
[
0.70168795
,
0.39039044
]
,
[
0.70163494
,
0.3904423
]
,
[
0.70163928
,
0.39033154
]
]
)
,
array
(
[
1385.79241695
,
1385.792417
,
1385.79241907
]
)
)
fun
:
1385.7924169507244
message
:
'
Optimization terminated successfully.
'
nfev
:
62
nit
:
33
status
:
0
success
:
True
x
:
array
(
[
0.70168795
,
0.39039044
]
)
Resulting
params
=
[
0.70168795
0.39039044
]
In
[
13
]
:
def
arch_apply
(
ret
)
:
omega
=
params
[
0
]
alpha
=
params
[
1
]
T
=
len
(
ret
)
sigma2_arch
=
np
.
zeros
(
T
+
1
)
sigma2_arch
[
0
]
=
np
.
var
(
ret
)
for
t
in
range
(
1
,
T
)
:
sigma2_arch
[
t
]
=
omega
+
alpha
*
ret
[
t
-
1
]
*
*
2
return
sigma2_arch
In
[
14
]
:
sigma2_arch
=
arch_apply
(
ret
)
Defining the split location and assigning the split data to
split
variableCalculating variance of the S&P 500
Calculating kurtosis of the S&P 500
Identifying the initial value for slope coefficient
Identifying the initial value for constant term
Using parallel processing to decrease the processing time
Taking absolute values and assigning the initial values into related variables
Identifying the initial values of volatility
Iterating the variance of S&P 500
Calculating the log-likelihood
Minimizing the log-likelihood function
Creating a variable
params
for optimized parameters
Well, we modeled volatility via ARCH using our own optimization method and ARCH equation. But how about comparing it with the built-in Python code? This built-in code can be imported from arch
library and is extremely easy to apply. The result of the built-in function follows; it turns out that these two results are very similar to each other:
In
[
15
]:
arch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'ARCH'
,
p
=
1
)
.
fit
(
disp
=
'off'
)
(
arch
.
summary
())
Zero
Mean
-
ARCH
Model
Results
\=============================================================================
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
ARCH
Log
-
Likelihood
:
-
4063.63
Distribution
:
Normal
AIC
:
8131.25
Method
:
Maximum
Likelihood
BIC
:
8143.21
No
.
Observations
:
2914
Date
:
Mon
,
Sep
13
2021
Df
Residuals
:
2914
Time
:
21
:
56
:
56
Df
Model
:
0
Volatility
Model
========================================================================
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
------------------------------------------------------------------------
omega
0.7018
5.006e-02
14.018
1.214e-44
[
0.604
,
0.800
]
alpha
[
1
]
0.3910
7.016e-02
5.573
2.506e-08
[
0.253
,
0.529
]
========================================================================
Covariance
estimator
:
robust
Although developing our own code is always helpful and improves our understanding, it does not necessarily mean that there’s no need to use built-in functions or libraries. Rather, these functions makes our lives easier in terms of efficiency and ease of use.
All we need is to create a for loop and define a proper information criteria. Here, we’ll choose Bayesian Information Criteria (BIC) as the model selection method and to select lag. The reason BIC is used is that as long as we have large enough samples, BIC is a reliable tool for model selection as per Burnham and Anderson (2002 and 2004). Now, we iterate ARCH model from 1 to 5 lags:
In
[
16
]
:
bic_arch
=
[
]
for
p
in
range
(
1
,
5
)
:
arch
=
arch_model
(
ret
,
mean
=
'
zero
'
,
vol
=
'
ARCH
'
,
p
=
p
)
\
.
fit
(
disp
=
'
off
'
)
bic_arch
.
append
(
arch
.
bic
)
if
arch
.
bic
==
np
.
min
(
bic_arch
)
:
best_param
=
p
arch
=
arch_model
(
ret
,
mean
=
'
zero
'
,
vol
=
'
ARCH
'
,
p
=
best_param
)
\
.
fit
(
disp
=
'
off
'
)
(
arch
.
summary
(
)
)
forecast
=
arch
.
forecast
(
start
=
split_date
[
0
]
)
forecast_arch
=
forecast
Zero
Mean
-
ARCH
Model
Results
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
ARCH
Log
-
Likelihood
:
-
3712.38
Distribution
:
Normal
AIC
:
7434.75
Method
:
Maximum
Likelihood
BIC
:
7464.64
No
.
Observations
:
2914
Date
:
Mon
,
Sep
13
2021
Df
Residuals
:
2914
Time
:
21
:
56
:
58
Df
Model
:
0
Volatility
Model
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
coef
std
err
t
P
>
|
t
|
95.0
%
Conf
.
Int
.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
omega
0.2798
2.584e-02
10.826
2.580e-27
[
0.229
,
0.330
]
alpha
[
1
]
0.1519
3.460e-02
4.390
1.136e-05
[
8.406e-02
,
0.220
]
alpha
[
2
]
0.2329
3.620e-02
6.433
1.249e-10
[
0.162
,
0.304
]
alpha
[
3
]
0.1917
3.707e-02
5.170
2.337e-07
[
0.119
,
0.264
]
alpha
[
4
]
0.1922
4.158e-02
4.623
3.780e-06
[
0.111
,
0.274
]
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Covariance
estimator
:
robust
In
[
17
]
:
rmse_arch
=
np
.
sqrt
(
mse
(
realized_vol
[
-
n
:
]
/
100
,
np
.
sqrt
(
forecast_arch
\
.
variance
.
iloc
[
-
len
(
split_date
)
:
]
/
100
)
)
)
(
'
The RMSE value of ARCH model is {:.4f}
'
.
format
(
rmse_arch
)
)
The
RMSE
value
of
ARCH
model
is
0.0896
In
[
18
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
forecast_arch
.
variance
.
iloc
[
-
len
(
split_date
)
:
]
/
100
,
label
=
'
Volatility Prediction-ARCH
'
)
plt
.
title
(
'
Volatility Prediction with ARCH
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
show
(
)
Iterating ARCH parameter p over specified interval
Running ARCH model with different p values
Finding the minimum BIC score to select the best model
Running ARCH model with the best p value
Forecasting the volatility based on the optimized ARCH model
Calculating the root mean square error (RMSE) score
The result of volatility prediction based on our first model is shown in Figure 4-3.
GARCH Model
The GARCH model is an extension of the ARCH model incorporating lagged conditional variance. So ARCH is improved by adding p number of delated conditional variance, which makes the GARCH model multivariate in the sense that it is an autoregressive moving average model for conditional variance with p number of lagged squared returns and q number of lagged conditional variance. GARCH(p, q) can be formulated as:
where , , and are parameters to be estimated and p and q are maximum lag in the model. To have consistent GARCH, the following conditions should hold:
-
> 0
-
< 1
The ARCH model is unable to capture the influence of historical innovations. However, as a more parsimonious model, the GARCH model can account for the change in historical innovations because GARCH models can be expressed as an infinite-order ARCH. Let’s see how GARCH can be shown as an infinite order of ARCH:
Then replace by :
Now, let’s substitute with and do the necessary math so that we end up with:
Similar to the ARCH model, there is more than one way to model volatility using GARCH in Python. Let us try to develop our own Python-based code using the optimization technique first. In what follows, the arch
library will be used to predict
volatility:
In
[
19
]:
a0
=
0.0001
sgm2
=
ret
.
var
()
K
=
ret
.
kurtosis
()
h
=
1
-
alpha
/
sgm2
alpha
=
np
.
sqrt
(
K
*
(
1
-
h
**
2
)
/
(
2.0
*
(
K
+
3
)))
beta
=
np
.
abs
(
h
-
omega
)
omega
=
(
1
-
omega
)
*
sgm2
initial_parameters
=
np
.
array
([
omega
,
alpha
,
beta
])
(
'Initial parameters for omega, alpha, and beta are
\n
{}
\n
{}
\n
{}'
.
format
(
omega
,
alpha
,
beta
))
Initial
parameters
for
omega
,
alpha
,
and
beta
are
0.43471178001576827
0.512827280537482
0.02677799855546381
In
[
20
]:
retv
=
ret
.
values
In
[
21
]:
@jit
(
nopython
=
True
,
parallel
=
True
)
def
garch_likelihood
(
initial_parameters
,
retv
):
omega
=
initial_parameters
[
0
]
alpha
=
initial_parameters
[
1
]
beta
=
initial_parameters
[
2
]
T
=
len
(
retv
)
logliks
=
0
sigma2
=
np
.
zeros
(
T
)
sigma2
[
0
]
=
np
.
var
(
retv
)
for
t
in
range
(
1
,
T
):
sigma2
[
t
]
=
omega
+
alpha
*
(
retv
[
t
-
1
])
**
2
+
beta
*
sigma2
[
t
-
1
]
logliks
=
np
.
sum
(
0.5
*
(
np
.
log
(
sigma2
)
+
retv
**
2
/
sigma2
))
return
logliks
In
[
22
]:
logliks
=
garch_likelihood
(
initial_parameters
,
retv
)
(
'The Log likelihood is {:.4f}'
.
format
(
logliks
))
The
Log
likelihood
is
1387.7215
In
[
23
]:
def
garch_constraint
(
initial_parameters
):
alpha
=
initial_parameters
[
0
]
gamma
=
initial_parameters
[
1
]
beta
=
initial_parameters
[
2
]
return
np
.
array
([
1
-
alpha
-
beta
])
In
[
24
]:
bounds
=
[(
0.0
,
1.0
),
(
0.0
,
1.0
),
(
0.0
,
1.0
)]
In
[
25
]:
def
opt_paramsG
(
initial_parameters
,
retv
):
opt_result
=
opt
.
minimize
(
garch_likelihood
,
x0
=
initial_parameters
,
constraints
=
np
.
array
([
1
-
alpha
-
beta
]),
bounds
=
bounds
,
args
=
(
retv
),
method
=
'Nelder-Mead'
,
options
=
{
'maxiter'
:
5000
})
params
=
opt_result
.
x
(
'
\n
Results of Nelder-Mead minimization
\n
{}
\n
{}'
\.
format
(
'-'
*
35
,
opt_result
))
(
'-'
*
35
)
(
'
\n
Resulting parameters = {}'
.
format
(
params
))
return
params
In
[
26
]:
params
=
opt_paramsG
(
initial_parameters
,
retv
)
Results
of
Nelder
-
Mead
minimization
-----------------------------------
final_simplex
:
(
array
([[
0.03918956
,
0.17370549
,
0.78991502
],
[
0.03920507
,
0.17374466
,
0.78987403
],
[
0.03916671
,
0.17377319
,
0.78993078
],
[
0.03917324
,
0.17364595
,
0.78998753
]]),
array
([
979.87109624
,
979.8710967
,
979.87109865
,
979.8711147
]))
fun
:
979.8710962352685
message
:
'Optimization terminated successfully.'
nfev
:
178
nit
:
102
status
:
0
success
:
True
x
:
array
([
0.03918956
,
0.17370549
,
0.78991502
])
-----------------------------------
Resulting
parameters
=
[
0.03918956
0.17370549
0.78991502
]
In
[
27
]:
def
garch_apply
(
ret
):
omega
=
params
[
0
]
alpha
=
params
[
1
]
beta
=
params
[
2
]
T
=
len
(
ret
)
sigma2
=
np
.
zeros
(
T
+
1
)
sigma2
[
0
]
=
np
.
var
(
ret
)
for
t
in
range
(
1
,
T
):
sigma2
[
t
]
=
omega
+
alpha
*
ret
[
t
-
1
]
**
2
+
beta
*
sigma2
[
t
-
1
]
return
sigma2
The parameters we get from our own GARCH code are approximately:
-
= 0.0392
-
= 0.1737
-
= 0.7899
Now, let’s try it with the built-in Python function:
In
[
28
]:
garch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'GARCH'
,
p
=
1
,
o
=
0
,
q
=
1
)
\.
fit
(
disp
=
'off'
)
(
garch
.
summary
())
Zero
Mean
-
GARCH
Model
Results
==============================================================================
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
GARCH
Log
-
Likelihood
:
-
3657.62
Distribution
:
Normal
AIC
:
7321.23
Method
:
Maximum
Likelihood
BIC
:
7339.16
No
.
Observations
:
2914
Date
:
Mon
,
Sep
13
2021
Df
Residuals
:
2914
Time
:
21
:
57
:
08
Df
Model
:
0
Volatility
Model
============================================================================
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
----------------------------------------------------------------------------
omega
0.0392
8.422e-03
4.652
3.280e-06
[
2.268e-02
,
5.569e-02
]
alpha
[
1
]
0.1738
2.275e-02
7.637
2.225e-14
[
0.129
,
0.218
]
beta
[
1
]
0.7899
2.275e-02
34.715
4.607e-264
[
0.745
,
0.835
]
============================================================================
Covariance
estimator
:
robust
The built-in function confirms that we did a great job, as the parameters obtained via the built-in code are almost the same as ours, so we have learned how to code GARCH and ARCH models to predict volatility.
It’s apparent that it is easy to work with GARCH(1, 1), but how do we know that the parameters are the optimum ones? Let’s decide the optimum parameter set given the lowest BIC value (and in doing so, generate Figure 4-4):
In
[
29
]:
bic_garch
=
[]
for
p
in
range
(
1
,
5
):
for
q
in
range
(
1
,
5
):
garch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'GARCH'
,
p
=
p
,
o
=
0
,
q
=
q
)
\.
fit
(
disp
=
'off'
)
bic_garch
.
append
(
garch
.
bic
)
if
garch
.
bic
==
np
.
min
(
bic_garch
):
best_param
=
p
,
q
garch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'GARCH'
,
p
=
best_param
[
0
],
o
=
0
,
q
=
best_param
[
1
])
\.
fit
(
disp
=
'off'
)
(
garch
.
summary
())
forecast
=
garch
.
forecast
(
start
=
split_date
[
0
])
forecast_garch
=
forecast
Zero
Mean
-
GARCH
Model
Results
==============================================================================
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
GARCH
Log
-
Likelihood
:
-
3657.62
Distribution
:
Normal
AIC
:
7321.23
Method
:
Maximum
Likelihood
BIC
:
7339.16
No
.
Observations
:
2914
Date
:
Mon
,
Sep
13
2021
Df
Residuals
:
2914
Time
:
21
:
57
:
10
Df
Model
:
0
Volatility
Model
============================================================================
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
----------------------------------------------------------------------------
omega
0.0392
8.422e-03
4.652
3.280e-06
[
2.268e-02
,
5.569e-02
]
alpha
[
1
]
0.1738
2.275e-02
7.637
2.225e-14
[
0.129
,
0.218
]
beta
[
1
]
0.7899
2.275e-02
34.715
4.607e-264
[
0.745
,
0.835
]
============================================================================
Covariance
estimator
:
robust
In
[
30
]:
rmse_garch
=
np
.
sqrt
(
mse
(
realized_vol
[
-
n
:]
/
100
,
np
.
sqrt
(
forecast_garch
\.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
)))
(
'The RMSE value of GARCH model is {:.4f}'
.
format
(
rmse_garch
))
The
RMSE
value
of
GARCH
model
is
0.0878
In
[
31
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
forecast_garch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
,
label
=
'Volatility Prediction-GARCH'
)
plt
.
title
(
'Volatility Prediction with GARCH'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
show
()
The volatility of returns is well-fitted by the GARCH model partly because of its volatility clustering and partly because GARCH does not assume that the returns are independent, which allows it to account for the leptokurtic property of returns. However, despite these useful properties and its intuitiveness, GARCH is not able to model the asymmetric response of the shocks (Karasan and Gaygisiz 2020). To remedy this issue, GJR-GARCH was proposed by Glosten, Jagannathan, and Runkle (1993).
GJR-GARCH
The GJR-GARCH model performs well in modeling the asymmetric effects of announcements in the way that bad news has a larger impact than good news. In other words, in the presence of asymmetry, the distribution of losses has a fatter tail than the distribution of gains. The equation of the model includes one more parameter, , and it takes the following form:
where controls for the asymmetry of the announcements and if:
- = 0
-
The response to the past shock is the same.
- > 0
-
The response to the past negative shock is stronger than a positive one.
- < 0
-
The response to the past positive shock is stronger than a negative one.
Let’s now run the GJR-GARCH model by finding the optimum parameter values using BIC, and producing Figure 4-5 as a result:
In
[
32
]:
bic_gjr_garch
=
[]
for
p
in
range
(
1
,
5
):
for
q
in
range
(
1
,
5
):
gjrgarch
=
arch_model
(
ret
,
mean
=
'zero'
,
p
=
p
,
o
=
1
,
q
=
q
)
\.
fit
(
disp
=
'off'
)
bic_gjr_garch
.
append
(
gjrgarch
.
bic
)
if
gjrgarch
.
bic
==
np
.
min
(
bic_gjr_garch
):
best_param
=
p
,
q
gjrgarch
=
arch_model
(
ret
,
mean
=
'zero'
,
p
=
best_param
[
0
],
o
=
1
,
q
=
best_param
[
1
])
.
fit
(
disp
=
'off'
)
(
gjrgarch
.
summary
())
forecast
=
gjrgarch
.
forecast
(
start
=
split_date
[
0
])
forecast_gjrgarch
=
forecast
Zero
Mean
-
GJR
-
GARCH
Model
Results
==============================================================================
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
GJR
-
GARCH
Log
-
Likelihood
:
-
3593.36
Distribution
:
Normal
AIC
:
7194.73
Method
:
Maximum
Likelihood
BIC
:
7218.64
No
.
Observations
:
2914
Date
:
Mon
,
Sep
13
2021
Df
Residuals
:
2914
Time
:
21
:
57
:
14
Df
Model
:
0
Volatility
Model
=============================================================================
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
-----------------------------------------------------------------------------
omega
0.0431
7.770e-03
5.542
2.983e-08
[
2.784e-02
,
5.829e-02
]
alpha
[
1
]
0.0386
3.060e-02
1.261
0.207
[
-
2.139e-02
,
9.855e-02
]
gamma
[
1
]
0.2806
4.818e-02
5.824
5.740e-09
[
0.186
,
0.375
]
beta
[
1
]
0.7907
2.702e-02
29.263
3.029e-188
[
0.738
,
0.844
]
=============================================================================
Covariance
estimator
:
robust
In
[
33
]:
rmse_gjr_garch
=
np
.
sqrt
(
mse
(
realized_vol
[
-
n
:]
/
100
,
np
.
sqrt
(
forecast_gjrgarch
\.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
)))
(
'The RMSE value of GJR-GARCH models is {:.4f}'
.
format
(
rmse_gjr_garch
))
The
RMSE
value
of
GJR
-
GARCH
models
is
0.0882
In
[
34
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
forecast_gjrgarch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
,
label
=
'Volatility Prediction-GJR-GARCH'
)
plt
.
title
(
'Volatility Prediction with GJR-GARCH'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
show
()
EGARCH
Together with the GJR-GARCH model, the EGARCH model, proposed by Nelson (1991), is another tool for controlling for the effect of asymmetric announcements. Additionally, it is specified in logarithmic form, so there is no need to add restrictions to avoid negative volatility:
The main difference in the EGARCH equation is that logarithm is taken of the variance on the left-hand side of the equation. This indicates the leverage effect, meaning that there exists a negative correlation between past asset returns and volatility. If , it implies leverage effect, and if , that shows asymmetry in volatility.
Following the same procedure we used previously, let’s model the volatility using the EGARCH model (resulting in Figure 4-6):
In
[
35
]:
bic_egarch
=
[]
for
p
in
range
(
1
,
5
):
for
q
in
range
(
1
,
5
):
egarch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'EGARCH'
,
p
=
p
,
q
=
q
)
\.
fit
(
disp
=
'off'
)
bic_egarch
.
append
(
egarch
.
bic
)
if
egarch
.
bic
==
np
.
min
(
bic_egarch
):
best_param
=
p
,
q
egarch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'EGARCH'
,
p
=
best_param
[
0
],
q
=
best_param
[
1
])
\.
fit
(
disp
=
'off'
)
(
egarch
.
summary
())
forecast
=
egarch
.
forecast
(
start
=
split_date
[
0
])
forecast_egarch
=
forecast
Zero
Mean
-
EGARCH
Model
Results
==============================================================================
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
EGARCH
Log
-
Likelihood
:
-
3676.18
Distribution
:
Normal
AIC
:
7358.37
Method
:
Maximum
Likelihood
BIC
:
7376.30
No
.
Observations
:
2914
Date
:
Mon
,
Sep
13
2021
Df
Residuals
:
2914
Time
:
21
:
57
:
19
Df
Model
:
0
Volatility
Model
=============================================================================
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
-----------------------------------------------------------------------------
omega
2.3596e-03
6.747e-03
0.350
0.727
[
-
1.086e-02
,
1.558e-02
]
alpha
[
1
]
0.3266
3.427e-02
9.530
1.567e-21
[
0.259
,
0.394
]
beta
[
1
]
0.9456
1.153e-02
82.023
0.000
[
0.923
,
0.968
]
=============================================================================
Covariance
estimator
:
robust
In
[
36
]:
rmse_egarch
=
np
.
sqrt
(
mse
(
realized_vol
[
-
n
:]
/
100
,
np
.
sqrt
(
forecast_egarch
.
variance
\.
iloc
[
-
len
(
split_date
):]
/
100
)))
(
'The RMSE value of EGARCH models is {:.4f}'
.
format
(
rmse_egarch
))
The
RMSE
value
of
EGARCH
models
is
0.0904
In
[
37
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
forecast_egarch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
,
label
=
'Volatility Prediction-EGARCH'
)
plt
.
title
(
'Volatility Prediction with EGARCH'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
show
()
Given the RMSE results shown in Table 4-1, the best and worst performing models are GARCH and EGARCH, respectively. But there are no big differences in the performance of the models we have used here. In particular, during bad news/good news announcements, the performances of EGARCH and GJR-GARCH might be different due to the asymmetry in the market.
Model | RMSE |
---|---|
ARCH |
0.0896 |
GARCH |
0.0878 |
GJR-GARCH |
0.0882 |
EGARCH |
0.0904 |
Up to now, we have discussed the classical volatility models, but from this point on, we will see how ML and the Bayesian approach can be used to model volatility. In the context of ML, support vector machines and neural networks will be the first models to explore. Let’s get started.
Support Vector Regression: GARCH
Support vector machine (SVM) is a supervised learning algorithm that can be applicable to both classification and regression. The aim of SVM is to find a line that separates two classes. It sounds easy but here is the challenging part: there are almost an infinite number of lines that can be used to distinguish the classes. But we are looking for the optimal line by which the classes can be perfectly discriminated.
In linear algebra, the optimal line is called hyperplane, which maximizes the distance between the points that are closest to the hyperplane but belong to different classes. The distance between the two points (support vectors) is known as margin. So, in SVM, what we are trying to do is to maximize the margin between support vectors.
SVM for classification is known as support vector classification (SVC). Keeping all characteristics of SVM, it can be applicable to regression. Again, in regression, the aim is to find the hyperplane that minimizes the error and maximizes the margin. This method is called support vector regression (SVR) and, in this part, we will apply this method to the GARCH model. Combining these two models gets us SVR-GARCH.
The following code shows us the preparations before running the SVR-GARCH in Python. The most crucial step here is to obtain independent variables, which are realized volatility and square of historical returns:
In
[
38
]
:
from
sklearn.svm
import
SVR
from
scipy.stats
import
uniform
as
sp_rand
from
sklearn.model_selection
import
RandomizedSearchCV
In
[
39
]
:
realized_vol
=
ret
.
rolling
(
5
)
.
std
(
)
realized_vol
=
pd
.
DataFrame
(
realized_vol
)
realized_vol
.
reset_index
(
drop
=
True
,
inplace
=
True
)
In
[
40
]
:
returns_svm
=
ret
*
*
2
returns_svm
=
returns_svm
.
reset_index
(
)
del
returns_svm
[
'
Date
'
]
In
[
41
]
:
X
=
pd
.
concat
(
[
realized_vol
,
returns_svm
]
,
axis
=
1
,
ignore_index
=
True
)
X
=
X
[
4
:
]
.
copy
(
)
X
=
X
.
reset_index
(
)
X
.
drop
(
'
index
'
,
axis
=
1
,
inplace
=
True
)
In
[
42
]
:
realized_vol
=
realized_vol
.
dropna
(
)
.
reset_index
(
)
realized_vol
.
drop
(
'
index
'
,
axis
=
1
,
inplace
=
True
)
In
[
43
]
:
svr_poly
=
SVR
(
kernel
=
'
poly
'
,
degree
=
2
)
svr_lin
=
SVR
(
kernel
=
'
linear
'
)
svr_rbf
=
SVR
(
kernel
=
'
rbf
'
)
Computing realized volatility and assigning a new variable to it named
realized_vol
Creating new variables for each SVR kernel
Let’s run and see our first SVR-GARCH application with linear kernel (and produce Figure 4-7); we’ll use the RMSE metric to compare the applications:
In
[
44
]
:
para_grid
=
{
'
gamma
'
:
sp_rand
(
)
,
'
C
'
:
sp_rand
(
)
,
'
epsilon
'
:
sp_rand
(
)
}
clf
=
RandomizedSearchCV
(
svr_lin
,
para_grid
)
clf
.
fit
(
X
.
iloc
[
:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)
]
.
values
.
reshape
(
-
1
,
)
)
predict_svr_lin
=
clf
.
predict
(
X
.
iloc
[
-
n
:
]
)
In
[
45
]
:
predict_svr_lin
=
pd
.
DataFrame
(
predict_svr_lin
)
predict_svr_lin
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
46
]
:
rmse_svr
=
np
.
sqrt
(
mse
(
realized_vol
.
iloc
[
-
n
:
]
/
100
,
predict_svr_lin
/
100
)
)
(
'
The RMSE value of SVR with Linear Kernel is {:.6f}
'
.
format
(
rmse_svr
)
)
The
RMSE
value
of
SVR
with
Linear
Kernel
is
0.000462
In
[
47
]
:
realized_vol
.
index
=
ret
.
iloc
[
4
:
]
.
index
In
[
48
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
predict_svr_lin
/
100
,
label
=
'
Volatility Prediction-SVR-GARCH
'
)
plt
.
title
(
'
Volatility Prediction with SVR-GARCH (Linear)
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
show
(
)
Identifying the hyperparameter space for tuning
Applying hyperparameter tuning with
RandomizedSearchCV
Fitting SVR-GARCH with linear kernel to data
Predicting the volatilities based on the last 252 observations and storing them in the
predict_svr_lin
Figure 4-7 exhibits the predicted values and actual observation. By eyeballing it, we can tell that SVR-GARCH performs well. As you can guess, the linear kernel works fine if the dataset is linearly separable; it is also suggested by Occam’s razor.3 But what if the dataset isn’t linearly separable? Let’s continue with the radial basis function (RBF) and polynomial kernels. The former uses elliptical curves around the observations, and the latter, unlike the first two, focuses on the combinations of samples. Let’s now see how they work.
Let’s start with an SVR-GARCH application using the RBF kernel, a function that projects data into a new vector space. From a practical standpoint, SVR-GARCH application with different kernels is not a labor-intensive process; all we need to do is switch the kernel name, as shown in the following (and resulting in Figure 4-8):
In
[
49
]:
para_grid
=
{
'gamma'
:
sp_rand
(),
'C'
:
sp_rand
(),
'epsilon'
:
sp_rand
()}
clf
=
RandomizedSearchCV
(
svr_rbf
,
para_grid
)
clf
.
fit
(
X
.
iloc
[:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)]
.
values
.
reshape
(
-
1
,))
predict_svr_rbf
=
clf
.
predict
(
X
.
iloc
[
-
n
:])
In
[
50
]:
predict_svr_rbf
=
pd
.
DataFrame
(
predict_svr_rbf
)
predict_svr_rbf
.
index
=
ret
.
iloc
[
-
n
:]
.
index
In
[
51
]:
rmse_svr_rbf
=
np
.
sqrt
(
mse
(
realized_vol
.
iloc
[
-
n
:]
/
100
,
predict_svr_rbf
/
100
))
(
'The RMSE value of SVR with RBF Kernel is {:.6f}'
.
format
(
rmse_svr_rbf
))
The
RMSE
value
of
SVR
with
RBF
Kernel
is
0.000970
In
[
52
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
predict_svr_rbf
/
100
,
label
=
'Volatility Prediction-SVR_GARCH'
)
plt
.
title
(
'Volatility Prediction with SVR-GARCH (RBF)'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
show
()
Both the RMSE score and the visualization suggest that SVR-GARCH with linear kernel outperforms SVR-GARCH with RBF kernel. The RMSEs of SVR-GARCH with linear and RBF kernels are 0.000462 and 0.000970, respectively. So SVR with linear kernel performs well.
Lastly, let’s try SVR-GARCH with the polynomial kernel. It will turn out that it has the highest RMSE (0.002386), implying that it is the worst-performing kernel among these three different applications. The predictive performance of SVR-GARCH with polynomial kernel can be found in Figure 4-9:
In
[
53
]:
para_grid
=
{
'gamma'
:
sp_rand
(),
'C'
:
sp_rand
(),
'epsilon'
:
sp_rand
()}
clf
=
RandomizedSearchCV
(
svr_poly
,
para_grid
)
clf
.
fit
(
X
.
iloc
[:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)]
.
values
.
reshape
(
-
1
,))
predict_svr_poly
=
clf
.
predict
(
X
.
iloc
[
-
n
:])
In
[
54
]:
predict_svr_poly
=
pd
.
DataFrame
(
predict_svr_poly
)
predict_svr_poly
.
index
=
ret
.
iloc
[
-
n
:]
.
index
In
[
55
]:
rmse_svr_poly
=
np
.
sqrt
(
mse
(
realized_vol
.
iloc
[
-
n
:]
/
100
,
predict_svr_poly
/
100
))
(
'The RMSE value of SVR with Polynomial Kernel is {:.6f}'
\.
format
(
rmse_svr_poly
))
The
RMSE
value
of
SVR
with
Polynomial
Kernel
is
0.002386
In
[
56
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
predict_svr_poly
/
100
,
label
=
'Volatility Prediction-SVR-GARCH'
)
plt
.
title
(
'Volatility Prediction with SVR-GARCH (Polynomial)'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
show
()
Neural Networks
Neural networks are the building block for deep learning. In an NN, data is processed in multiple stages to make a decision. Each neuron takes a result of a dot product as input and uses it in an activation function to make a decision:
where b is bias, w is weight, and x is input data.
During this process, input data is mathematically manipulated in various ways in hidden and output layers. Generally speaking, an NN has three types of layers:
-
Input layers
-
Hidden layers
-
Output layers
Figure 4-10 can help to illustrate the relationships among layers.
The input layer includes raw data. In going from the input layer to the hidden layer, we learn coefficients. There may be one or more than one hidden layers depending on the network structure. The more hidden layers the network has, the more complicated it is. Hidden layers, located between input and output layers, perform nonlinear transformations via activation functions.
Finally, the output layer is the layer in which output is produced and decisions are made.
In ML, gradient descent is applied to find the optimum parameters that minimize the cost function, but employing only gradient descent in NN is not feasible due to the chain-like structure within the NN. Thus, a new concept known as backpropagation is proposed to minimize the cost function. The idea of backpropagation rests on calculating the error between observed and actual output, and then passing this error to the hidden layer. So we move backward, and the main equation takes the form of:
where z is linear transformation and represents error. There is much more to say here, but to keep us on track we’ll stop here. For those who want to dig more into the math behind NNs, please refer to Wilmott (2013) and Alpaydin (2020).
Now, we apply NN-based volatility prediction using the MLPRegressor
module from scikit-learn, even though we have various options to run NNs in Python.4 Given the NN structure we’ve introduced, the result follows:
In
[
57
]
:
from
sklearn.neural_network
import
MLPRegressor
NN_vol
=
MLPRegressor
(
learning_rate_init
=
0.001
,
random_state
=
1
)
para_grid_NN
=
{
'
hidden_layer_sizes
'
:
[
(
100
,
50
)
,
(
50
,
50
)
,
(
10
,
100
)
]
,
'
max_iter
'
:
[
500
,
1000
]
,
'
alpha
'
:
[
0.00005
,
0.0005
]
}
clf
=
RandomizedSearchCV
(
NN_vol
,
para_grid_NN
)
clf
.
fit
(
X
.
iloc
[
:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)
]
.
values
.
reshape
(
-
1
,
)
)
NN_predictions
=
clf
.
predict
(
X
.
iloc
[
-
n
:
]
)
In
[
58
]
:
NN_predictions
=
pd
.
DataFrame
(
NN_predictions
)
NN_predictions
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
59
]
:
rmse_NN
=
np
.
sqrt
(
mse
(
realized_vol
.
iloc
[
-
n
:
]
/
100
,
NN_predictions
/
100
)
)
(
'
The RMSE value of NN is {:.6f}
'
.
format
(
rmse_NN
)
)
The
RMSE
value
of
NN
is
0.000583
In
[
60
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
NN_predictions
/
100
,
label
=
'
Volatility Prediction-NN
'
)
plt
.
title
(
'
Volatility Prediction with Neural Network
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
show
(
)
Importing the
MLPRegressor
moduleConfiguring the NN model with three hidden layers and varying neuron numbers
Fitting the NN model to the training data5
Predicting the volatilities based on the last 252 observations and storing them in the
NN_predictions
variable
Figure 4-11 shows the volatility prediction result based on the NN model. Despite its reasonable performance, we can play with the number of hidden neurons to generate a deep learning model. To do that, we can apply the Keras library, Python’s interface for artificial neural networks.
Now it’s time to predict volatility using deep learning. Based on Keras, it is easy to configure the network structure. All we need is to determine the number of neurons of the specific layer. Here, the number of neurons for the first and second hidden layers are 256 and 128, respectively. As volatility has a continuous type, we have only one output neuron:
In
[
61
]
:
import
tensorflow
as
tf
from
tensorflow
import
keras
from
tensorflow.keras
import
layers
In
[
62
]
:
model
=
keras
.
Sequential
(
[
layers
.
Dense
(
256
,
activation
=
"
relu
"
)
,
layers
.
Dense
(
128
,
activation
=
"
relu
"
)
,
layers
.
Dense
(
1
,
activation
=
"
linear
"
)
,
]
)
In
[
63
]
:
model
.
compile
(
loss
=
'
mse
'
,
optimizer
=
'
rmsprop
'
)
In
[
64
]
:
epochs_trial
=
np
.
arange
(
100
,
400
,
4
)
batch_trial
=
np
.
arange
(
100
,
400
,
4
)
DL_pred
=
[
]
DL_RMSE
=
[
]
for
i
,
j
,
k
in
zip
(
range
(
4
)
,
epochs_trial
,
batch_trial
)
:
model
.
fit
(
X
.
iloc
[
:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)
]
.
values
.
reshape
(
-
1
,
)
,
batch_size
=
k
,
epochs
=
j
,
verbose
=
False
)
DL_predict
=
model
.
predict
(
np
.
asarray
(
X
.
iloc
[
-
n
:
]
)
)
DL_RMSE
.
append
(
np
.
sqrt
(
mse
(
realized_vol
.
iloc
[
-
n
:
]
/
100
,
DL_predict
.
flatten
(
)
/
100
)
)
)
DL_pred
.
append
(
DL_predict
)
(
'
DL_RMSE_{}:{:.6f}
'
.
format
(
i
+
1
,
DL_RMSE
[
i
]
)
)
DL_RMSE_1
:
0.000551
DL_RMSE_2
:
0.000714
DL_RMSE_3
:
0.000627
DL_RMSE_4
:
0.000739
In
[
65
]
:
DL_predict
=
pd
.
DataFrame
(
DL_pred
[
DL_RMSE
.
index
(
min
(
DL_RMSE
)
)
]
)
DL_predict
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
66
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
DL_predict
/
100
,
label
=
'
Volatility Prediction-DL
'
)
plt
.
title
(
'
Volatility Prediction with Deep Learning
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
show
(
)
Configuring the network structure by deciding number of layers and neurons
Compiling the model with loss and optimizer
Deciding the epoch and batch size using
np.arange
Fitting the deep learning model
Predicting the volatility based on the weights obtained from the training phase
Calculating the RMSE score by flattening the predictions
It turns out that we get a minimum RMSE score when we have epoch number and batch size of 100. This shows that increasing the complexity of the model does not necessarily imply high predictive performance. The key is to find a sweet spot between complexity and predictive performance. Otherwise, the model can easily tend to overfit.
Figure 4-12 shows the volatility prediction result derived from the preceding code, and it implies that deep learning provides a strong tool for modeling volatility, too.
The Bayesian Approach
The way we approach probability is of central importance in the sense that it distinguishes the classical (or Frequentist) and Bayesian approaches. According to the former, the relative frequency will converge to the true probability. However, a Bayesian application is based on the subjective interpretation. Unlike the Frequentists, Bayesian statisticians consider the probability distribution as uncertain, and it is revised as new information comes in.
Due to the different interpretation in the probability of these two approaches, likelihood—defined as the probability of an observed event given a set of parameters—is computed differently.
Starting from the joint density function, we can give the mathematical representation of the likelihood function:
Among possible values, what we are trying to do is decide which one is more likely. Under the statistical model proposed by the likelihood function, the observed data is the most probable.
In fact, you are familiar with the method based on this approach, which is maximum likelihood estimation. Having defined the main difference between Bayesian and Frequentist approaches, it is time to delve more into Bayes’ theorem.
The Bayesian approach is based on conditional distribution, which states that probability gauges the extent to which one has about a uncertain event. So the Bayesian application suggests a rule that can be used to update the beliefs that one holds in light of new information:
Bayesian estimation is used when we have some prior information regarding a parameter. For example, before looking at a sample to estimate the mean of a distribution, we may have some prior belief that it is close to 2, between 1 and 3. Such prior beliefs are especially important when we have a small sample. In such a case, we are interested in combining what the data tells us, namely, the value calculated from the sample, and our prior information.
Rachev et al., 2008
Similar to the Frequentist application, Bayesian estimation is based on probability density . However, as we have discussed previously, Bayesian and Frequentist methods treat parameter set differently. A Frequentist assumes to be fixed, whereas in a Bayesian setting, is taken as a random variable whose probability is known as prior density . Well, we have another unknown term, but no worries—it is easy to understand.
In light of this information, we can estimate using prior density and come up with the following formula. Prior is employed when we need to estimate the conditional distribution of the parameters given observations:
or
where
-
is the posterior density, which gives us information about the parameters given observed data.
-
is the likelihood function, which estimates the probability of the data given parameters.
-
is prior probability. It is the probability of the parameters. Prior is basically the initial beliefs about estimates.
-
Finally, is the evidence, which is used to update the prior.
Consequently, Bayes’ theorem suggests that the posterior density is directly proportional to the prior and likelihood terms but inverserly related to the evidence term. As the evidence is there for scaling, we can describe this process as:
where means “is proportional to.”
Within this context, Bayes’ theorem sounds attractive, doesn’t it? Well, it does, but it comes with a cost, which is analytical intractability. Even if Bayes’ theorem is theoretically intuitive, it is, by and large, hard to solve analytically. This is the major drawback in wide applicability of Bayes’ theorem. However, the good news is that numerical methods provide solid methods to solve this probabilistic model.
Some methods proposed to deal with the computational issues in Bayes’ theorem provide solutions with approximation, including:
-
Quadrature approximation
-
Maximum a posteriori estimation (MAP) (discussed in Chapter 6)
-
Grid approach
-
Sampling-based approach
-
Metropolis–Hastings
-
Gibbs sampler
-
No U-Turn sampler
Of these approaches, let us restrict our attention to the Metropolis–Hastings algorithm (M-H), which will be our method for modeling Bayes’ theorem. The M-H method rests on the Markov chain Monte Carlo (MCMC) method. So before moving forward, let’s talk about the MCMC method.
Markov Chain Monte Carlo
The Markov chain is a model used to describe the transition probabilities among states. A chain is called Markovian if the probability of the current state depends only on the most recent state :
Thus, MCMC relies on the Markov chain to find the parameter space with the highest posterior probability. As the sample size grows, parameter values approximate to the posterior density:
where D refers to distributional approximation. Realized values of parameter space can be used to make inferences about the posterior. In a nutshell, the MCMC method helps us gather IID samples from posterior density so that we can calculate the posterior probability.
To illustrate this, we can refer to Figure 4-13. This figure shows the probability of moving from one state to another. For the sake of simplicity, we’ll set the probability to be 0.2, indicating that the transition from “studying” to “sleeping” has a probability of 0.2:
In
[
67
]:
import
quantecon
as
qe
from
quantecon
import
MarkovChain
import
networkx
as
nx
from
pprint
import
pprint
In
[
68
]:
P
=
[[
0.5
,
0.2
,
0.3
],
[
0.2
,
0.3
,
0.5
],
[
0.2
,
0.2
,
0.6
]]
mc
=
qe
.
MarkovChain
(
P
,
(
'studying'
,
'travelling'
,
'sleeping'
))
mc
.
is_irreducible
Out
[
68
]:
True
In
[
69
]:
states
=
[
'studying'
,
'travelling'
,
'sleeping'
]
initial_probs
=
[
0.5
,
0.3
,
0.6
]
state_space
=
pd
.
Series
(
initial_probs
,
index
=
states
,
name
=
'states'
)
In
[
70
]:
q_df
=
pd
.
DataFrame
(
columns
=
states
,
index
=
states
)
q_df
=
pd
.
DataFrame
(
columns
=
states
,
index
=
states
)
q_df
.
loc
[
states
[
0
]]
=
[
0.5
,
0.2
,
0.3
]
q_df
.
loc
[
states
[
1
]]
=
[
0.2
,
0.3
,
0.5
]
q_df
.
loc
[
states
[
2
]]
=
[
0.2
,
0.2
,
0.6
]
In
[
71
]:
def
_get_markov_edges
(
Q
):
edges
=
{}
for
col
in
Q
.
columns
:
for
idx
in
Q
.
index
:
edges
[(
idx
,
col
)]
=
Q
.
loc
[
idx
,
col
]
return
edges
edges_wts
=
_get_markov_edges
(
q_df
)
pprint
(
edges_wts
)
{(
'sleeping'
,
'sleeping'
):
0.6
,
(
'sleeping'
,
'studying'
):
0.2
,
(
'sleeping'
,
'travelling'
):
0.2
,
(
'studying'
,
'sleeping'
):
0.3
,
(
'studying'
,
'studying'
):
0.5
,
(
'studying'
,
'travelling'
):
0.2
,
(
'travelling'
,
'sleeping'
):
0.5
,
(
'travelling'
,
'studying'
):
0.2
,
(
'travelling'
,
'travelling'
):
0.3
}
In
[
72
]:
G
=
nx
.
MultiDiGraph
()
G
.
add_nodes_from
(
states
)
for
k
,
v
in
edges_wts
.
items
():
tmp_origin
,
tmp_destination
=
k
[
0
],
k
[
1
]
G
.
add_edge
(
tmp_origin
,
tmp_destination
,
weight
=
v
,
label
=
v
)
pos
=
nx
.
drawing
.
nx_pydot
.
graphviz_layout
(
G
,
prog
=
'dot'
)
nx
.
draw_networkx
(
G
,
pos
)
edge_labels
=
{(
n1
,
n2
):
d
[
'label'
]
for
n1
,
n2
,
d
in
G
.
edges
(
data
=
True
)}
nx
.
draw_networkx_edge_labels
(
G
,
pos
,
edge_labels
=
edge_labels
)
nx
.
drawing
.
nx_pydot
.
write_dot
(
G
,
'mc_states.dot'
)
There are two common MCMC methods: M–H and Gibbs sampler. Here, we delve into the former.
Metropolis–Hastings
M-H allows us to have an efficient sampling procedure with two steps. First, we draw a sample from proposal density, then we decide either to accept or reject it.
Let be a proposal density and be a parameter space. The entire algorithm of M-H can be summarized as:
-
Select initial value for from parameter space .
-
Select a new parameter value from proposal density, which can be, for the sake of easiness, Gaussian or uniform distribution.
-
Compute the following acceptance probability:
-
If is greater than a sample value drawn from uniform distribution U(0,1), repeat this process from step 2.
Well, it appears intimidating, but don’t worry; we have built-in code in Python that makes the applicability of the M-H algorithm much easier. We use the PyFlux library to make use of Bayes’ theorem. Let’s apply the M-H algorithm to predict volatility:
In
[
73
]
:
import
pyflux
as
pf
from
scipy.stats
import
kurtosis
In
[
74
]
:
model
=
pf
.
GARCH
(
ret
.
values
,
p
=
1
,
q
=
1
)
(
model
.
latent_variables
)
model
.
adjust_prior
(
1
,
pf
.
Normal
(
)
)
model
.
adjust_prior
(
2
,
pf
.
Normal
(
)
)
x
=
model
.
fit
(
method
=
'
M-H
'
,
iterations
=
'
1000
'
)
(
x
.
summary
(
)
)
Index
Latent
Variable
Prior
Prior
Hyperparameters
V
.
I
.
Dist
Transform
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
==
==
==
0
Vol
Constant
Normal
mu0
:
0
,
sigma0
:
3
Normal
exp
1
q
(
1
)
Normal
mu0
:
0
,
sigma0
:
0.5
Normal
logit
2
p
(
1
)
Normal
mu0
:
0
,
sigma0
:
0.5
Normal
logit
3
Returns
Constant
Normal
mu0
:
0
,
sigma0
:
3
Normal
None
Acceptance
rate
of
Metropolis
-
Hastings
is
0.0023
Acceptance
rate
of
Metropolis
-
Hastings
is
0.23925
Tuning
complete
!
Now
sampling
.
Acceptance
rate
of
Metropolis
-
Hastings
is
0.239175
GARCH
(
1
,
1
)
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Dependent
Variable
:
Series
Method
:
Metropolis
Hastings
Start
Date
:
1
Unnormalized
Log
Posterior
:
-
3635.1348
End
Date
:
2913
AIC
:
7278.269645045323
Number
of
observations
:
2913
BIC
:
7302.177400073161
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Latent
Variable
Median
Mean
95
%
Credibility
Interval
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
Vol
Constant
0.04
0.0398
(
0.0315
|
0.0501
)
q
(
1
)
0.1936
0.194
(
0.1638
|
0.2251
)
p
(
1
)
0.7736
0.7737
(
0.7438
|
0.8026
)
Returns
Constant
0.0866
0.0855
(
0.0646
|
0.1038
)
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
None
In
[
75
]
:
model
.
plot_z
(
[
1
,
2
]
)
model
.
plot_fit
(
figsize
=
(
15
,
5
)
)
model
.
plot_ppc
(
T
=
kurtosis
,
nsims
=
1000
)
Configuring GARCH model using the PyFlux library
Printing the estimation of latent variables (parameters)
Adjusting the priors for the model latent variables
Fitting the model using M-H process
Plotting the latent variables
Plotting the fitted model
Plotting the histogram for posterior check
It is worthwhile to visualize the results of what we have done so far for volatility prediction with a Bayesian-based GARCH Model.
Figure 4-14 exhibits the distribution of latent variables. Latent variable q gathers around 0.2, and the other latent variable, p, mostly takes values between 0.7 and 0.8.
Figure 4-15 indicates the demeaned volatility series and the GARCH prediction result based on the Bayesian approach.
Figure 4-16 visualizes the posterior predictions of the Bayesian model with the data so that we are able to detect systematic discrepancies, if any. The vertical line represents the test statistic, and it turns out the observed value is larger than that of our model.
After we are done with the training part, we are all set to move on to the next phase, which is prediction. Prediction analysis is done for the 252 steps ahead, and the RMSE is calculated given the realized volatility:
In
[
76
]
:
bayesian_prediction
=
model
.
predict_is
(
n
,
fit_method
=
'
M-H
'
)
Acceptance
rate
of
Metropolis
-
Hastings
is
0.11515
Acceptance
rate
of
Metropolis
-
Hastings
is
0.1787
Acceptance
rate
of
Metropolis
-
Hastings
is
0.2675
Tuning
complete
!
Now
sampling
.
Acceptance
rate
of
Metropolis
-
Hastings
is
0.2579
In
[
77
]
:
bayesian_RMSE
=
np
.
sqrt
(
mse
(
realized_vol
.
iloc
[
-
n
:
]
/
100
,
bayesian_prediction
.
values
/
100
)
)
(
'
The RMSE of Bayesian model is {:.6f}
'
.
format
(
bayesian_RMSE
)
)
The
RMSE
of
Bayesian
model
is
0.004047
In
[
78
]
:
bayesian_prediction
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
Eventually, we are ready to observe the prediction result of the Bayesian approach, and the following code does it for us, generating Figure 4-17:
In
[
79
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
bayesian_prediction
[
'Series'
]
/
100
,
label
=
'Volatility Prediction-Bayesian'
)
plt
.
title
(
'Volatility Prediction with M-H Approach'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
show
()
Figure 4-17 visualizes the volatility prediction based on an M-H–based Bayesian approach, and it seems to overshoot toward the end of 2020. The overall performance of this method shows that it is not among the best methods.
Conclusion
Volatility prediction is a key to understanding the dynamics of the financial market in the sense that it helps us to gauge uncertainty. With that being said, it is used as input in many financial models, including risk models. These facts emphasize the importance of having accurate volatility prediction. Traditionally, parametric methods such as ARCH, GARCH, and their extensions have been extensively used, but these models suffer from being inflexible. To remedy this issue, data-driven models are promising, and this chapter attempted to make use of these models, namely, SVMs, NNs, and deep learning-based models. It turns out that the data-driven models outperform the parametric models.
In the next chapter, market risk, a core financial risk topic, will be discussed both from theoretical and empirical standpoints, and the ML models will be incorporated to further improve the estimation of this risk.
References
Articles cited in this chapter:
Books cited in this chapter:
1 Conditional variance means that volatility estimation is a function of the past values of asset returns.
2 For more information on these functions, see Andrew Ng’s lecture notes.
3 Occam’s razor, also known as law of parsimony, states that given a set of explanations, simpler explanation is the most plausible and likely one.
4 Of these alternatives, TensorFlow, PyTorch, and NeuroLab are the most prominent libraries.
5 For more detailed information, please see the MLPClassifier
documentation.
Get Machine Learning for Financial Risk Management with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.