Appendix A. Deep Dives
In this section, we dive deep into a few technical areas that are important to understand for completion, but are not essential.
Matrix Chain Rule
First up is an explanation of why we can substitute WT for in the chain rule expression from Chapter 1.
Remember that L is literally:
where this is shorthand for the fact that:
and so on. Let’s zoom in on just one of these expressions. What would it look like if we took the partial derivative of, say, with respect to every element of (which is ultimately what we’ll want to do with all six components of )?
Well, since:
it isn’t too hard to see that the partial derivative of this with respect to , via a very simple application of the chain rule, is:
Since the only thing that x11 is multiplied by in the XW11 expression is w11, the partial derivative with respect to everything else is 0.
So, computing the partial derivative of σ(XW11) with respect to all of the elements of X gives us the following overall expression for :
Get Deep Learning from Scratch now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.