Математическое исчисление, необходимое для глубокого обучения

Матричное исчисление, необходимое для глубокого обучения

Теренция Парр и Джереми Говард

(Мы преподаем в MS University of San Francisco в программе Data Science и проводим другие гнусные проекты. Возможно, вы знаете Terence как создателя генератора парсеров ANTLR.

Версия для печати (Этот HTML-код был создан из разметки с использованием bookish)

Реферат

Настоящая статья представляет собой попытку объяснить все необходимые математические расчеты, чтобы понять, как тренироваться глубокие нейронные сети. Мы не предполагаем, что математические знания превышают то, что вы изучили в исчислении 1, и предоставляете ссылки, которые помогут вам обновить необходимую математику там, где это необходимо. Обратите внимание, что вы не должны понимать этот материал, прежде чем начинать учиться обучать и использовать глубокое обучение на практике; скорее, этот материал предназначен для тех, кто уже знаком с основами нейронных сетей и желает углубить свое понимание основной математики. Не беспокойтесь, если вы застряли в какой-то момент на пути — просто вернитесь назад и перечитайте предыдущий раздел, и попробуйте записать и проработать некоторые примеры. И если вы все еще застряли, мы будем рады ответить на ваши вопросы в категории Theory на forums.fast.ai. Примечание : В конце статьи имеется ссылочный раздел, в котором суммируются все основные правила и терминология основных матричных матриц.

Содержание статьи

1 Введение
2 Отзыв: правила производной скалярной
3 Введение в векторное исчисление и частные производные
4 Матричное исчисление
5 The gradient of neuron activation
6 The gradient of the neural network loss function
- 6.1 The gradient with respect to the weights
- 6.2 The derivative with respect to the bias
7 Summary
8 Matrix Calculus Reference
9 Notation
10 Resources

Введение

Большинство из нас изучали исчисление в школе, но производные являются важной частью машинного обучения, особенно глубоких нейронных сетей, которые обучаются путем оптимизации функции потерь. Возьмите бумагу для машинного обучения или документацию библиотеки, такую как PyTorch, и исчисление приходит в вашу жизнь, как далекие родственники во время праздников. И появляется не просто старое скалярное исчисление — вам нужно дифференциальное матричное исчисление свадьба дробовика линейной алгебры и многомерное исчисление.

Ну … может быть, нужно — это не правильное слово; Курсы Джереми показывают, как стать практиком глубокого обучения мирового уровня с минимальным уровнем скалярного исчисления, благодаря использованию автоматической дифференциации, встроенной в современные библиотеки глубокого обучения. Но если вы действительно хотите понять, что происходит под капотом этих библиотек, и загляните в научные статьи, обсуждая последние достижения в методах обучения модели, вам нужно будет понять некоторые биты поля матричного исчисления.

Например, активация одного вычислительного блока в нейронной сети обычно вычисляется с использованием точечного произведения (из линейной алгебры) вектора веса края w с входным вектором x ] плюс скалярное смещение (порог): . Функция называется аффинной функцией устройства за которой следует выпрямленная линейная единица, которая фиксирует отрицательные значения до нуля: . Такая вычислительная единица иногда упоминается как «искусственный нейрон» и выглядит следующим образом:

Нейронные сети состоят из многих из этих единиц, организованных в несколько коллекций нейронов, называемых слоев . Активация единиц одного уровня становится входом в единицы следующего уровня.

Тренировка Этот нейрон означает выбор весов w и смещение b ]чтобы получить желаемый выход для всех N входов x . Для этого мы минимизируем функцию потерь которая сравнивает конечную сеть с (желаемый выход x ) для всех входных данных x векторы. Чтобы свести к минимуму потерю, мы используем некоторые вариации на градиентном спуске, такие как простой стохастический градиентный спуск (SGD), SGD с импульсом или Adam. Все они требуют частичной производной (градиент) относительно параметров модели w и b . Наша цель — постепенно подстраивать w и b так что общая функция потерь продолжает уменьшаться по всем входам x .

(19459011)

Но это всего лишь один нейрон, а нейронные сети должны тренировать массы и смещения всех нейронов во всех слоях одновременно. Поскольку существует несколько входов и (потенциально) множественных сетевых выходов, нам действительно нужны общие правила для производной функции по отношению к векторному и четному правилам для производной вектор-функции по отношению к вектору.

В этой статье рассматриваются некоторые важные правила вычисления частных производных по векторам, особенно те, которые полезны для обучения нейронных сетей. Это поле известно как матричное исчисление и хорошей новостью является то, что нам нужно только небольшое подмножество этого поля, которое мы здесь приводим. Хотя существует много онлайн-материалов по многомерному исчислению и линейной алгебре, их обычно преподают как два отдельных курса бакалавриата, поэтому большинство материалов рассматривает их изолированно. Страницы, которые обсуждают матричное исчисление, часто представляют собой просто списки правил с минимальным объяснением или просто фрагменты истории. Они также имеют тенденцию быть довольно неясными для всех, кроме узкой аудитории математиков, благодаря использованию плотной нотации и минимальному обсуждению основополагающих концепций. (См. Аннотированный список ресурсов в конце.)

. Напротив, мы собираемся использовать и заново раскрывать некоторые правила расчета ключевых матриц, чтобы объяснить их. Оказывается, матричное исчисление действительно не так уж и тяжело! Для изучения не существует десятков новых правил; просто пара ключевых понятий. Мы надеемся, что эта короткая статья поможет вам быстро начать работу в мире матричного исчисления, поскольку это касается обучения нейронных сетей. Мы предполагаем, что вы уже знакомы с основами архитектуры нейронной сети и обучения. Если вы этого не сделаете, перейдите к курсу Джереми и заполните часть 1 этого, и мы увидим вас здесь, когда вы закончите. (Обратите внимание, что в отличие от многих других академических подходов мы настоятельно рекомендуем сначала учиться обучать и использовать нейронные сети на практике, а тогда изучают основную математику. Математика будет намного понятнее

Примечание к нотации : Курс Джереми исключительно использует код вместо математической записи, но не обязательно, объяснить понятия, поскольку незнакомые функции в коде легко искать и экспериментировать. В этой статье мы делаем обратное: есть много математических заметок, потому что одна из целей этой статьи — помочь вам понять обозначения, которые вы увидите в документах и книгах по глубокому изучению. В конце статьи вы найдете краткую таблицу использованных обозначений, в том числе слово или фразу, которую вы можете использовать для поиска более подробной информации.

Отзыв: правила производной скалярной

Надеюсь, вы помните некоторые из этих основных правил скалярной производной. Если ваша память немного нечеткая, взгляните на академию хана на правила скалярных производных.

Существуют другие правила тригонометрии, экспоненты и т. Д., Которые вы можете найти на курсе дифференциального исчисления Академии Хан.

Когда функция имеет единственный параметр, вы будете часто см. и используемые в качестве сокращений для . Мы рекомендуем против этой нотации, поскольку она не дает ясности переменной, которую мы берем с производной по отношению к.

Вы можете представить себе как оператор, который отображает функцию одного параметра на другую функцию. Это означает, что отображает на свою производную по отношению к х что является тем же самым, что и . Кроме того, если то . Мысль о производной как операторе помогает упростить сложные производные, потому что оператор является дистрибутивным и позволяет вытащить константы. Например, в следующем уравнении мы можем вытащить константу 9 и распределить производный оператор по элементам в круглых скобках.

. Эта процедура уменьшила производную от до некоторой части арифметики и производных x и которые намного легче решить, чем исходная производная.

Введение в векторное исчисление и частные производные

Нейронные сетевые слои не являются одиночными функциями одного параметра . Итак, перейдем к функциям множества параметров, таких как . Например, что такое производная от xy (т. Е. Умножение x и y )? Другими словами, как изменяется произведение xy когда мы покачиваем переменные? Ну, это зависит от того, изменяемся ли мы x или y . Мы вычисляем производные по одной переменной (параметру) за раз, давая нам две различные частные производные для этой двухпараметрической функции (одна для x и одна для y ). Вместо использования оператора оператор частной производной (стилизованный d а не греческая буква ). Итак, и являются частными производными xy ; часто они называются частицами . Для функций одного параметра оператор эквивалентен (для достаточно гладких функций). Тем не менее, лучше использовать чтобы было ясно, что вы имеете в виду скалярную производную.

Частная производная по x является просто обычной скалярной производной, просто обрабатывающей любая другая переменная в уравнении как константа. Рассмотрим функцию . Частная производная по х написана . Существует три константы с точки зрения : 3, 2 и y . Поэтому . Частная производная по у рассматривает х как константу: . Это хорошая идея, чтобы вывести их сами, прежде чем продолжить, остальная часть статьи не имеет смысла. Если вы нуждаетесь в помощи, обратитесь к фильму Академии Хана о частицах.

Чтобы сделать это, мы делаем векторное исчисление, а не только многомерное исчисление, рассмотрим, что мы делаем с частными производными и ( другой способ сказать и ), которые мы вычислили для . Вместо того, чтобы они просто плавали вокруг и не организовывались каким-либо образом, давайте организуем их в горизонтальный вектор. Мы называем этот вектор градиентом из и записываем его как:

Таким образом, градиент является просто вектором его частичных , Градиенты являются частью мира векторного исчисления, в котором рассматриваются функции, которые сопоставляют скалярные параметры n с одним скаляром. Теперь

Матричное исчисление

Когда мы переходим от производных от одной функции к производным от многих функций, мы переходим из мира векторного исчисления в матричное исчисление. Вычислим частные производные для двух функций, оба из которых принимают два параметра. Мы можем сохранить тот же из последнего раздела, но давайте также приведем . Градиент для g имеет две записи, частную производную для каждого параметра:

давая нам градиент .

Градиентные векторы организуют все частные производные для конкретной скалярной функции. Если у нас есть две функции, мы можем также организовать их градиенты в матрицу путем укладки градиентов. Когда мы это делаем, мы получаем матрицу Якоби (или просто Якобиан ), где градиенты являются рядами:

Добро пожаловать к матричному исчислению!

Обратите внимание, что существует множество способов представления якобиана. Мы используем так называемый макет числителя, но многие документы и программное обеспечение будут использовать макет знаменателя . Это просто переносит компоновку числителя Якобиан (переверните его по диагонали):

Обобщение якобиана

. До сих пор мы рассматривали конкретный пример матрицы Якоби. Чтобы более детально определить матрицу якобиана, объединим несколько параметров в один векторный аргумент: . (Иногда вы также видите обозначения для векторов в литературе.) Строчные буквы жирным шрифтом, такие как x являются векторами, а шрифты курсивом, такие как x являются скалярами. x _i является элементом вектора x и выделен курсивом, поскольку один векторный элемент является скаляром. Нам также необходимо определить ориентацию для вектора x . Будем считать, что все векторы по умолчанию по вертикали по размеру :

С помощью нескольких скалярнозначных функций мы можем объединить их все в вектор так же, как мы с параметрами. Пусть — вектор скалярнозначных функций m каждый из которых принимает вектор x длины где — мощность (счет) элементов в х . Каждая функция f _i в f возвращает скаляр так же, как в предыдущем разделе:

Например, мы представим и из последнего раздела в качестве

Очень часто бывает, что потому что мы будем иметь результат скалярной функции для каждого элемента вектора x . Например, рассмотрим функцию тождества :

Таким образом, в этом случае мы имеем функции и параметры . Вообще говоря, матрица Якоби — это совокупность всех возможных частных производных ( м ) и n ), которая представляет собой стек м x :

Каждый является горизонтальным n -вектором, поскольку частная производная с относительно вектора х длина которого . Ширина якобиана равна n если мы возьмем частную производную по отношению к x так как есть n параметры, которые мы можем изменять, каждый из которых потенциально меняет значение функции. Поэтому якобиан всегда м рядов для уравнений м . Это помогает визуально визуализировать возможные формы Якоби:

$Начать {табличной} {с | CCl} & begin {tabular} [t] {c} скаляр \ Framebox (18,18) {$ х $} \ end {tabular} & begin {tabular} {c} вектор\ Framebox (18,40) {$ mathbf {х} $} Конец {табличной} \ HLine \ [dimexpr-normalbaselineskip+5pt] Начать {табличные} [b] {C} скаляр \ Framebox (18,18) {$ F $} \ end {tabular} & framebox (18,18) {$ frac { partial f} { partial {x}} $} & framebox (40,18) {$ frac { partial f} { partial { mathbf {x}}} $} & \ Начать {табличные} [b] {C} вектор\ Framebox (18,40) {$ mathbf {F} $} \ end {tabular} & framebox (18,40) {$ frac { partial mathbf {f}} { partial {x}} $} & framebox (40,40) {$ frac { partial mathbf {f}} { partial mathbf {x}} $} \ Конец {табличной}$

Якобиан тождественной функции с имеет функции n и каждая функция имеет параметры n удерживаемые в одном векторе х . Поэтому якобиан является квадратной матрицей, поскольку :

$Начать {eqnarray *} frac { partial mathbf {y}} { partial mathbf {x}} = begin {bmatrix} frac { partial} { partial {x}} f_1 ( mathbf {x}) \ frac { partial} { partial {x}} f_2 ( mathbf {x}) \ Ldots \ frac { partial} { partial {x}} f_m ( mathbf {x}) end {bmatrix} & = & begin {bmatrix} frac { partial} { partial {x_1}} f_1 ( mathbf {x}) ~ frac { partial} { partial {x_2}} f_1 ( mathbf {x}) ~ ldots ~ frac { partial} { partial {x_n}} f_1 ( mathbf {x}) \ frac { partial} { partial {x_1}} f_2 ( mathbf {x}) ~ frac { partial} { partial {x_2}} f_2 ( mathbf {x}) ~ ldots ~ frac { partial} { partial {x_n}} f_2 ( mathbf {x}) \ Ldots \ ~ frac { partial} { partial {x_1}} f_m ( mathbf {x}) ~ frac { partial} { partial {x_2}} f_m ( mathbf {x}) ~ ldots ~ frac { partial} { partial {x_n}} f_m ( mathbf {x}) \ Конец {} bmatrix \\ & = & begin {bmatrix} frac { partial} { partial {x_1}} x_1 ~ frac { partial} { partial {x_2}} x_1 ~ ldots ~ frac { partial} { partial {x_n}} x_1 \ frac { partial} { partial {x_1}} x_2 ~ frac { partial} { partial {x_2}} x_2 ~ ldots ~ frac { partial} { partial {x_n}} x_2 \ Ldots \ ~ frac { partial} { partial {x_1}} x_n ~ frac { partial} { partial {x_2}} x_n ~ ldots ~ frac { partial} { partial {x_n}} x_n \ Конец {} bmatrix \\ & & ( text {и since} frac { partial} { partial {x_j}} x_i = 0 text {for} j neq i) \ & = & begin {bmatrix} frac { partial} { partial {x_1}} x_1 & 0 & ldots & 0 \ 0 & frac { partial} { partial {x_2}} x_2 & ldots & 0 \ & & ddots \ 0 & 0 & ldots & frac { partial} { partial {x_n}} x_n \ Конец {} bmatrix \\ & = & begin {bmatrix} 1 & 0 & ldots & 0 \ 0 & 1 & ldots & 0 \ & & ddots \ 0 & 0 & ldots & 1 \ Конец {} bmatrix \\ & = & I ~~~ (I text {- единичная матрица с единицами по диагонали}) \ Конец {eqnarray *}$

Убедитесь, что вы можете вывести каждый шаг выше, прежде чем двигаться дальше. Если вы застряли, просто рассмотрите каждый элемент матрицы отдельно и примените обычные правила скалярной производной. Это обычно полезный трюк: уменьшите векторные выражения до набора скалярных выражений, а затем возьмите все частичные части, соответственно совместив результаты с векторами и матрицами.

Также будьте осторожны, матрица вертикальная, x или горизонтальная, где означает x транспонировать. Также убедитесь, что вы обращаете внимание на то, что-то скалярнозначная функция, или вектор функций (или векторнозначная функция), .

Производные векторных элементов бинарные операторы

Элементарные двоичные операции над векторами, такие как сложение векторов важны, поскольку мы можем выразить много общих векторных операций, таких как умножение вектора на скаляр, как элементное двоичные операции. Под «элементарными двоичными операциями» мы просто подразумеваем применение оператора к первому элементу каждого вектора для получения первого элемента вывода, затем ко второму элементу входов для второго элемента вывода и т. Д. Например, все базовые математические операторы применяются по умолчанию в numpy или tensorflow, например. Примерами, которые часто возникают в глубоком обучении, являются и (возвращает вектор единиц и нулей).

Мы можем обобщить элементарные двоичные операции с обозначениями где . (Напоминание: — количество элементов в x .) Символ представляет любой элементный оператор (такой как ), а не оператор композиции функции . Вот как выглядит уравнение когда мы приближаемся к скалярным уравнениям:

где мы пишем n (не м ), чтобы подчеркнуть тот факт, что результат элементарных операторов дает векторные результаты .

Используя идеи из последнего параграфа, мы видим, что общий случай для якобиана с w — квадратная матрица:

и якобиан относительно x :

Это довольно furball, но, к счастью, якобиан очень часто является диагональной матрицей, матрицей которой является нуль всюду, но диагональ. Поскольку это значительно упрощает якобиан, рассмотрим подробно, когда якобиан сводится к диагональной матрице для элементарных операций.

В диагональном якобиане все элементы от диагонали равны нулю, где . (Заметим, что мы берем частную производную по отношению к w _j не w _i .) В каких условиях эти вне- диагональные элементы равны нулю? Именно, когда f _i и g _i являются контингентами относительно w _j , . Независимо от оператора, если эти частные производные стремятся к нулю, операция переходит в нуль, независимо от того, что и частная производная от константы равна нулю.

Те частичные числа обращаются к нулю, когда _i и g _i не являются функциями w _j . Мы знаем, что элементарные операции подразумевают, что f _i является чисто функцией w _i и g _i является чисто функцией x _i . Например, суммы . Следовательно, сводится к и цель становится . и выглядят как константы для оператора частичного дифференцирования по отношению к w _j когда поэтому частицы являются нулевыми по диагонали. (Обозначение технически является злоупотреблением нашими обозначениями, поскольку f _i и g _i являются функциями векторов, а не отдельных элементов. Мы действительно должны написать что-то вроде но это еще глубже уравнений, а программисты — удобные функции перегрузки, поэтому мы все равно примем обозначение.)

Мы воспользуемся этим упрощением позднее и w _i и x _i соответственно, как (19459011)

При этом условии элементы по диагонали якобиана :

] (Большие «0» — это сокращение, указывающее на все недиагональные 0).

Более кратко, мы можем написать: [19459[1945901]

где построена матрица, диагональные элементы которой взяты из вектора x .

. Поскольку мы выполняем много простых векторных арифметических операций, общая функция в двоичной элементной операции часто является просто вектором w . Всякий раз, когда общая функция является вектором, мы знаем, что сводится к . Например, векторное дополнение соответствует нашему поэтапному диагональному условию, поскольку имеет скалярные уравнения которые сводятся к простому с частными производными:

Это дает нам единичную матрицу, поскольку каждый элемент по диагонали равен 1. I представляет квадратную единичную матрицу соответствующих размеров, всюду нуль диагональ, которая содержит все единицы.

Учитывая простоту этого частного случая сводящуюся к вы должны иметь возможность вывести якобиан для общих элементарных двоичных операций над векторами:

Операторы и являются элементарным умножением и делением; иногда называют продуктом Адамара . Нет стандартного обозначения для умножения и деления по типу, поэтому мы используем подход, соответствующий нашей общей бинарной нотации.

Производные с использованием скалярного расширения

Когда мы умножаем или добавляем скаляры к векторам, мы неявно расширяем скаляр до вектора, а затем выполняем двоичную операцию по элементам. Например, добавление скалярного z к вектору x действительно где и . (Обозначение представляет собой вектор из соответствующей длины.) z — любой скаляр, который не зависит от x что полезно, потому что тогда для любого x _i что упростит наши частные производные вычисления. (Здесь можно подумать об переменной z как константе нашего обсуждения здесь.) Аналогичным образом, умножение на скаляр, действительно где является умножением по элементам ( Адамара) двух векторов.

Частные производные вектор-скалярного сложения и умножения по отношению к вектору x используют наше элементарное правило:

Это следует из того, что функции и явно удовлетворяют нашему элементарному диагональному условию для якобиана ( относятся не более к x _i и относится к значению для вектора ).

Используя обычные правила для скалярных частных производных, мы приходим к следующим диагональным элементам якобиана для вектор-скалярного сложения:

Итак, ]

. Вычисление частной производной по скалярному параметру z однако, приводит к вертикальному вектору, а не диагональной матрице. Элементы вектора:

Поэтому .

Диагональные элементы якобиана для векторно-скалярного умножения включают в себя правило произведения для скалярные производные:

Итак, .

Частная производная по скалярному параметру z представляет собой вертикальный вектор, элементами которого являются:

Это дает нам

Уменьшение векторной суммы

Подведение итогов элементов вектора является важной операцией в глубоком обучении, такой как функция потери сети, но мы также можем использовать ее как способ упрощения вычислений производная векторного точечного произведения и другие операции, которые сводят векторы к скалярам.

Пусть . Обратите внимание, что мы были осторожны, чтобы оставить параметр в виде вектора x поскольку каждая функция f _i может использовать все значения в векторе, а не только x _я . Сумма вычисляется по результатам функции, а не параметру. Градиент ( Якобиан) векторного суммирования:

(Суммирование внутри элементов градиента может быть сложным, поэтому следите за тем, чтобы ваши обозначения были согласованными.)

Давайте посмотрим на градиент простого . Функция внутри суммирования — это просто а затем градиент:

Поскольку для мы можем упростить:

Обратите внимание, что результатом является горизонтальный вектор, полный 1s, а не вертикальный вектор, поэтому градиент . (Показатель T представляет собой транспонирование указанного вектора. В этом случае он переворачивает вертикальный вектор в горизонтальный вектор.) Очень важно сохранять форму всех ваших векторов и матриц, иначе невозможно вычислить производные комплексных функций.

В качестве другого примера суммируем результат умножения вектора на постоянный скаляр. Если то . Градиент:

Производная по отношению к скалярной переменной z :

Цепочные правила

Мы не можем вычислять частные производные очень сложных функций, используя только основные правила вычисления матрицы, которые мы видели до сих пор. Например, мы не можем взять производную от вложенных выражений типа непосредственно, не сводя ее к ее скалярному эквиваленту. Мы должны иметь возможность комбинировать наши основные векторные правила, используя то, что мы можем назвать [векторнойцепью] . К сожалению, существует ряд правил для дифференциации, которые подпадают под название «правила цепи», поэтому мы должны быть осторожны в отношении правила цепи, о котором мы говорим. Часть нашей цели здесь состоит в том, чтобы четко определить и назвать три разных правила цепи и указать, в какой ситуации они подходят. To get warmed up, we'll start with what we'll call the single-variable chain rulewhere we want the derivative of a scalar function with respect to a scalar. Then we'll move on to an important concept called the total derivative and use it to define what we'll pedantically call the single-variable total-derivative chain rule. Then, we'll be ready for the vector chain rule in its full glory as needed for neural networks.

The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into subexpressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple subexpression in isolation yet still combine the intermediate results to get the correct overall result.

The chain rule comes into play when we need the derivative of an expression composed of nested subexpressions. For example, we need the chain rule when confronted with expressions like . The outermost expression takes the sin of an intermediate result, a nested subexpression that squares x. Specifically, we need the single-variable chain rule, so let's start by digging into that in more detail.

Single-variable chain rule

Let's start with the solution to the derivative of our nested expression: . It doesn't take a mathematical genius to recognize components of the solution that smack of scalar differentiation rules, and . It looks like the solution is to multiply the derivative of the outer expression by the derivative of the inner expression or “chain the pieces together,” which is exactly right. In this section, we'll explore the general principle at work and provide a process that works for highly-nested expressions of a single variable.

Chain rules are typically defined in terms of nested functions, such as for single-variable chain rules. (You will also see the chain rule defined using function composition which is the same thing.) Some sources write the derivative using shorthand notation but that hides the fact that we are introducing an intermediate variable: which we'll see shortly. It's better to define the single-variable chain rule of explicitly so we never take the derivative with respect to the wrong variable. Here is the formulation of the single-variable chain rule we recommend:

To deploy the single-variable chain rule, follow these steps:

Introduce intermediate variables for nested subexpressions and subexpressions for both binary and unary operators; e.g., is binary, and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications.
Compute derivatives of the intermediate variables with respect to their parameters.
Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
Substitute intermediate variables back in if any are referenced in the derivative equation.

The third step puts the “chain” in “chain rule” because it chains together intermediate results. Multiplying the intermediate derivatives together is the common theme among all variations of the chain rule.

Let's try this process on :

Introduce intermediate variables. Let represent subexpression (shorthand for ). This gives us:

The order of these subexpressions does not affect the answer, but we recommend working in the reverse order of operations dictated by the nesting (innermost to outermost). That way, expressions and derivatives are always functions of previously-computed elements.
Compute derivatives.
Combine.
Substitute.

Notice how easy it is to compute the derivatives of the intermediate variables in isolation! The chain rule says it's legal to do that and tells us how to combine the intermediate results to get .

You can think of the combining step of the chain rule in terms of units canceling. If we let y be miles, x be the gallons in a gas tank, and u as gallons we can interpret as . The gallon denominator and numerator cancel.

Another way to to think about the single-variable chain rule is to visualize the overall expression as a dataflow diagram or chain of operations (or abstract syntax tree for compiler people):

Changes to function parameter x bubble up through a squaring operation then through a sin operation to change result y. You can think of as “getting changes from x to u” and as “getting changes from u to y.” Getting from x to y requires an intermediate hop. The chain rule is, by convention, usually written from the output variable down to the parameter(s), . But, the x-to-y perspective would be more clear if we reversed the flow and used the equivalent .

Conditions under which the single-variable chain rule applies. Notice that there is a single dataflow path from x to the root y. Changes in x can influence output y in only one way. That is the condition under which we can apply the single-variable chain rule. An easier condition to remember, though one that's a bit looser, is that none of the intermediate subexpression functions, and have more than one parameter. Consider which would become after introducing intermediate variable u. As we'll see in the next section, has multiple paths from x to y. To handle that situation, we'll deploy the single-variable total-derivative chain rule.

As an aside for those interested in automatic differentiation, papers and library documentation use terminology forward differentiation and backward differentiation (for use in the back-propagation algorithm). From a dataflow perspective, we are computing a forward differentiation because it follows the normal data flow direction. Backward differentiation, naturally, goes the other direction and we're asking how a change in the output would affect function parameter x. Because backward differentiation can determine changes in all function parameters at once, it turns out to be much more efficient for computing the derivative of functions with lots of parameters. Forward differentiation, on the other hand, must consider how a change in each parameter, in turn, affects the function output y. The following table emphasizes the order in which partial derivatives are computed for the two techniques.

Forward differentiation from x to y	Backward differentiation from y to x

Automatic differentiation is beyond the scope of this article, but we're setting the stage for a future article.

Many readers can solve in their heads, but our goal is a process that will work even for very complicated expressions. This process is also how automatic differentiation works in libraries like PyTorch. So, by solving derivatives manually in this way, you're also learning how to define functions for custom neural networks in PyTorch.

With deeply nested expressions, it helps to think about deploying the chain rule the way a compiler unravels nested function calls like into a sequence (chain) of calls. The result of calling function f_i is saved to a temporary variable called a register, which is then passed as a parameter to . Let's see how that looks in practice by using our process on a highly-nested equation like :

Introduce intermediate variables.
Compute derivatives.
Combine four intermediate values.
Substitute.

Here is a visualization of the data flow through the chain of operations from x to y:

At this point, we can handle derivatives of nested expressions of a single variable, xusing the chain rule but only if x can affect y through a single data flow path. To handle more complicated expressions, we need to extend our technique, which we'll do next.

Single-variable total-derivative chain rule

Our single-variable chain rule has limited applicability because all intermediate variables must be functions of single variables. But, it demonstrates the core mechanism of the chain rule, that of multiplying out all derivatives of intermediate subexpressions. To handle more general expressions such as however, we need to augment that basic chain rule.

Of course, we immediately see but that is using the scalar addition derivative rule, not the chain rule. If we tried to apply the single-variable chain rule, we'd get the wrong answer. In fact, the previous chain rule is meaningless in this case because derivative operator does not apply to multivariate functions, such as among our intermediate variables:

Let's try it anyway to see what happens. If we pretend that and then instead of the right answer .

Because has multiple parameters, partial derivatives come into play. Let's blindly apply the partial derivative operator to all of our equations and see what we get:

Ooops! The partial is wrong because it violates a key assumption for partial derivatives. When taking the partial derivative with respect to xthe other variables must not vary as x varies. Otherwise, we could not act as if the other variables were constants. Clearly, though, is a function of x and therefore varies with x. because . A quick look at the data flow diagram for shows multiple paths from x to ythus, making it clear we need to consider direct and indirect (through ) dependencies on x:

A change in x affects y both as an operand of the addition and as the operand of the square operator. Here's an equation that describes how tweaks to x affect the output:

Then, which we can read as “the change in y is the difference between the original y and y at a tweaked x.”

If we let then . If we bump x by 1, then . The change in y is not as would lead us to believe, but !

Enter the “law” of total derivatives, which basically says that to compute we need to sum up all possible contributions from changes in x to the change in y. The total derivative with respect to x assumes all variables, such as in this case, are functions of x and potentially vary as x varies. The total derivative of that depends on x directly and indirectly via intermediate variable is given by:

Using this formula, we get the proper answer:

That is an application of what we can call the single-variable total-derivative chain rule:

The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but x are constants.

There is something subtle going on here with the notation. All of the derivatives are shown as partial derivatives because f and u_i are functions of multiple variables. This notation mirrors that of MathWorld's notation but differs from Wikipedia, which uses instead (possibly to emphasize the total derivative nature of the equation). We'll stick with the partial derivative notation so that it's consistent with our discussion of the vector chain rule in the next section.

In practice, just keep in mind that when you take the total derivative with respect to xother variables might also be functions of x so add in their contributions as well. The left side of the equation looks like a typical partial derivative but the right-hand side is actually the total derivative. It's common, however, that many temporary variables are functions of a single parameter, which means that the single-variable total-derivative chain rule degenerates to the single-variable chain rule.

Let's look at a nested subexpression, such as . We introduce three intermediate variables:

and partials:

where both and have terms that take into account the total derivative.

Also notice that the total derivative formula always sums versus, say, multiplies terms . It's tempting to think that summing up terms in the derivative makes sense because, for example, adds two terms. Nope. The total derivative is adding terms because it represents a weighted sum of all x contributions to the change in y. For example, given instead of the total-derivative chain rule formula still adds partial derivative terms. ( simplifies to but for this demonstration, let's not combine the terms.) Here are the intermediate variables and partial derivatives:

The form of the total derivative remains the same, however:

It's the partials (weights) that change, not the formula, when the intermediate variable operators change.

Those readers with a strong calculus background might wonder why we aggressively introduce intermediate variables even for the non-nested subexpressions such as in . We use this process for three reasons: (i) computing the derivatives for the simplified subexpressions is usually trivial, (ii) we can simplify the chain rule, and (iii) the process mirrors how automatic differentiation works in neural network libraries.

Using the intermediate variables even more aggressively, let's see how we can simplify our single-variable total-derivative chain rule to its final form. The goal is to get rid of the sticking out on the front like a sore thumb:

We can achieve that by simply introducing a new temporary variable as an alias for x: . Then, the formula reduces to our final form:

This chain rule that takes into consideration the total derivative degenerates to the single-variable chain rule when all intermediate variables are functions of a single variable. Consequently, you can remember this more general formula to cover both cases. As a bit of dramatic foreshadowing, notice that the summation sure looks like a vector dot product, or a vector multiply .

Before we move on, a word of caution about terminology on the web. Unfortunately, the chain rule given in this section, based upon the total derivative, is universally called “multivariable chain rule” in calculus discussions, which is highly misleading! Only the intermediate variables are multivariate functions. The overall function, say, is a scalar function that accepts a single parameter x. The derivative and parameter are scalars, not vectors, as one would expect with a so-called multivariate chain rule. (Within the context of a non-matrix calculus class, “multivariate chain rule” is likely unambiguous.) To reduce confusion, we use “single-variable total-derivative chain rule” to spell out the distinguishing feature between the simple single-variable chain rule, and this one.

Vector chain rule

Now that we've got a good handle on the total-derivative chain rule, we're ready to tackle the chain rule for vectors of functions and vector variables. Surprisingly, this more general chain rule is just as simple looking as the single-variable chain rule for scalars. Rather than just presenting the vector chain rule, let's rediscover it ourselves so we get a firm grip on it. We can start by computing the derivative of a sample vector function with respect to a scalar, to see if we can abstract a general formula.

Let's introduce two intermediate variables, and one for each f_i so that y looks more like :

The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule:

Ok, so now we have the answer using just the scalar rules, albeit with the derivatives grouped into a vector. Let's try to abstract from that result what it looks like in vector form. The goal is to convert the following vector of scalar operations to a vector operation.

If we split the terms, isolating the terms into a vector, we get a matrix by vector multiplication:

That means that the Jacobian is the multiplication of two other Jacobians, which is kinda cool. Let's check our results:

Whew! We get the same answer as the scalar approach. This vector chain rule for vectors of functions and a single parameter appears to be correct and, indeed, mirrors the single-variable chain rule. Compare the vector rule:

with the single-variable chain rule:

To make this formula work for multiple parameters or vector xwe just have to change x to vector x in the equation. The effect is that and the resulting Jacobian, are now matrices instead of vertical vectors. Our complete vector chain rule is:

The beauty of the vector formula over the single-variable chain rule is that it automatically takes into consideration the total derivative while maintaining the same notational simplicity. The Jacobian contains all possible combinations of f_i with respect to g_j and g_i with respect to x_j. For completeness, here are the two Jacobian components in their full glory:

where and . The resulting Jacobian is (an matrix multiplied by a matrix).

Even within this formula, we can simplify further because, for many applications, the Jacobians are square () and the off-diagonal entries are zero. It is the nature of neural networks that the associated mathematics deals with functions of vectors not vectors of functions. For example, the neuron affine function has term and the activation function is ; we'll consider derivatives of these functions in the next section.

As we saw in a previous section, element-wise operations on vectors w and x yield diagonal matrices with elements because w_i is a function purely of x_i but not x_j for . The same thing happens here when f_i is purely a function of g_i and g_i is purely a function of x_i:

In this situation, the vector chain rule simplifies to:

Therefore, the Jacobian reduces to a diagonal matrix whose elements are the single-variable chain rule values.

After slogging through all of that mathematics, here's the payoff. All you need is the vector chain rule because the single-variable formulas are special cases of the vector chain rule. The following table summarizes the appropriate components to multiply in order to get the Jacobian.

$begin{tabular}[t]{c|cccc} & multicolumn{2}{c}{ begin{tabular}[t]{c} scalar\ framebox(18,18){$x$}\ end{tabular}} & &begin{tabular}{c} vector\ framebox(18,40){$mathbf{x}$}\ end{tabular} \ begin{tabular}{c}$frac{partial}{partial mathbf{x}} mathbf{f}(mathbf{g}(mathbf{x}))$ = $frac{partial mathbf{f}}{partial mathbf{g}}frac{partialmathbf{g}}{partial mathbf{x}}$ \ end{tabular} & begin{tabular}[t]{c} scalar\ framebox(18,18){$u$}\ end{tabular} & begin{tabular}{c} vector\ framebox(18,40){$mathbf{u}$} end{tabular}& & begin{tabular}{c} vector\ framebox(18,40){$mathbf{u}$}\ end{tabular} \ hline \[dimexpr-normalbaselineskip+5pt] begin{tabular}[b]{c} scalar\ framebox(18,18){$f$}\ end{tabular} &framebox(18,18){$frac{partial f}{partial {u}}$} framebox(18,18){$frac{partial u}{partial {x}}$} ~~~& raisebox{22pt}{framebox(40,18){$frac{partial f}{partial {mathbf{u}}}$}} framebox(18,40){$frac{partial mathbf{u}}{partial x}$} & ~~~& raisebox{22pt}{framebox(40,18){$frac{partial f}{partial {mathbf{u}}}$}} framebox(40,40){$frac{partial mathbf{u}}{partial mathbf{x}}$} \ begin{tabular}[b]{c} vector\ framebox(18,40){$mathbf{f}$}\ end{tabular} & framebox(18,40){$frac{partial mathbf{f}}{partial {u}}$} raisebox{22pt}{framebox(18,18){$frac{partial u}{partial {x}}$}} & framebox(40,40){$frac{partial mathbf{f}}{partial mathbf{u}}$} framebox(18,40){$frac{partial mathbf{u}}{partial x}$} & & framebox(40,40){$frac{partial mathbf{f}}{partial mathbf{u}}$} framebox(40,40){$frac{partial mathbf{u}}{partial mathbf{x}}$}\ end{tabular}$

The gradient of neuron activation

We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, w and b:

(This represents a neuron with fully connected weights and rectified linear unit activation. There are, however, other affine functions such as convolution and other activation functions, such as exponential linear units, that follow similar logic.)

Let's worry about max later and focus on computing and . (Recall that neural networks learn through optimization of their weights and biases.) We haven't discussed the derivative of the dot product yet, but we can use the chain rule to avoid having to memorize yet another rule. (Note notation y not y as the result is a scalar not a vector.)

The dot product is just the summation of the element-wise multiplication of the elements: . (You might also find it useful to remember the linear algebra notation .) We know how to compute the partial derivatives of and but haven't looked at partial derivatives for . We need the chain rule for that and so we can introduce an intermediate vector variable u just as we did using the single-variable chain rule:

Once we've rephrased ywe recognize two subexpressions for which we already know the partial derivatives:

The vector chain rule says to multiply the partials:

To check our results, we can grind the dot product down into a pure scalar function:

Then:

Hooray! Our scalar results match the vector chain rule results.

Now, let the full expression within the max activation function call. We have two different partials to compute, but we don't need the chain rule:

Let's tackle the partials of the neuron activation, . The use of the function call on scalar z just says to treat all negative z values as 0. The derivative of the max function is a piecewise function. When the derivative is 0 because z is a constant. When the derivative of the max function is just the derivative of zwhich is :

An aside on broadcasting functions across scalars. When one or both of the max arguments are vectors, such as

we broadcast the single-variable function max across the elements. This is an example of an element-wise unary operator. Just to be clear:

For the derivative of the broadcast version then, we get a vector of zeros and ones where:

To get the derivative of the function, we need the chain rule because of the nested subexpression, . Following our process, let's introduce intermediate scalar variable z to represent the affine function giving:

The vector chain rule tells us:

which we can rewrite as follows:

and then substitute back in:

That equation matches our intuition. When the activation function clips affine function output z to 0, the derivative is zero with respect to any weight w_i. When it's as if the max function disappears and we get just the derivative of z with respect to the weights.

Turning now to the derivative of the neuron activation with respect to bwe get:

Let's use these partial derivatives now to handle the entire loss function.

The gradient of the neural network loss function

Training a neuron requires that we take the derivative of our loss or “cost” function with respect to the parameters of our model, w and b. Because we train with multiple vector inputs (e.g., multiple images) and scalar targets (e.g., one classification per image), we need some more notation. Let

where and then let

where y_i is a scalar. Then the cost equation becomes:

Following our chain rule process introduces these intermediate variables:

Let's compute the gradient with respect to w first.

The gradient with respect to the weights

From before, we know:

and

Then, for the overall gradient, we get:

$begin{eqnarray*} frac{partial C(v)}{partial mathbf{w}} & = & frac{partial }{partial mathbf{w}}frac{1}{N} sum_{i=1}^N v^2\\ & = & frac{1}{N} sum_{i=1}^N frac{partial}{partial mathbf{w}} v^2\\ & = & frac{1}{N} sum_{i=1}^N frac{partial v^2}{partial v} frac{partial v}{partial mathbf{w}} \\ & = & frac{1}{N} sum_{i=1}^N 2v frac{partial v}{partial mathbf{w}} \\ & = & frac{1}{N} sum_{i=1}^N begin{cases} 2vvec{0}^T = vec{0}^T & mathbf{w} cdot mathbf{x}_i + b leq 0\ -2vmathbf{x}^T & mathbf{w} cdot mathbf{x}_i + b > 0\ end{cases}\\ & = & frac{1}{N} sum_{i=1}^N begin{cases} vec{0}^T & mathbf{w} cdot mathbf{x}_i + b leq 0\ -2(y_i-u)mathbf{x}_i^T & mathbf{w} cdot mathbf{x}_i + b > 0\ end{cases}\\ & = & frac{1}{N} sum_{i=1}^N begin{cases} vec{0}^T & mathbf{w} cdot mathbf{x}_i + b leq 0\ -2(y_i-max(0, mathbf{w}cdotmathbf{x}_i+b))mathbf{x}_i^T & mathbf{w} cdot mathbf{x}_i + b > 0\ end{cases}\ phantom{frac{partial C(v)}{partial mathbf{w}}} & = & frac{1}{N} sum_{i=1}^N begin{cases} vec{0}^T & mathbf{w} cdot mathbf{x}_i + b leq 0\ -2(y_i-(mathbf{w}cdotmathbf{x}_i+b))mathbf{x}_i^T & mathbf{w} cdot mathbf{x}_i + b > 0\ end{cases}\\ & = & begin{cases} vec{0}^T & mathbf{w} cdot mathbf{x}_i + b leq 0\ frac{-2}{N} sum_{i=1}^N (y_i-(mathbf{w}cdotmathbf{x}_i+b))mathbf{x}_i^T & mathbf{w} cdot mathbf{x}_i + b > 0\ end{cases}\\ & = & begin{cases} vec{0}^T & mathbf{w} cdot mathbf{x}_i + b leq 0\ frac{2}{N} sum_{i=1}^N (mathbf{w}cdotmathbf{x}_i+b-y_i)mathbf{x}_i^T & mathbf{w} cdot mathbf{x}_i + b > 0\ end{cases} end{eqnarray*}$

To interpret that equation, we can substitute an error term yielding:

From there, notice that this computation is a weighted average across all x_i in X. The weights are the error terms, the difference between the target output and the actual neuron output for each x_i input. The resulting gradient will, on average, point in the direction of higher cost or loss because large e_i emphasize their associated x_i. Imagine we only had one input vector, then the gradient is just . If the error is 0, then the gradient is zero and we have arrived at the minimum loss. If is some small positive difference, the gradient is a small step in the direction of . If is large, the gradient is a large step in that direction. If is negative, the gradient is reversed, meaning the highest cost is in the negative direction.

Of course, we want to reduce, not increase, the loss, which is why the gradient descent recurrence relation takes the negative of the gradient to update the current position (for scalar learning rate ):

Because the gradient indicates the direction of higher cost, we want to update x in the opposite direction.

The derivative with respect to the bias

To optimize the bias, bwe also need the partial with respect to b. Here are the intermediate variables again:

We computed the partial with respect to the bias for equation previously:

For vthe partial is:

And for the partial of the cost function itself we get:

<img class=»blkeqn» src=»http://explained.ai/images/latex-D2EB5E709D4D4474EDB3DE6699F91F9A.svg» alt=»
begin{eqnarray*}
frac{partial C(v)}{partial b} & = & frac{partial }{partial b}frac{1}{N} sum_{i=1}^N v^2\\
& = & frac{1}{N} sum_{i=1}^N frac{partial}{partial b} v^2\\
& = & frac{1}{N} sum_{i=1}^N frac{partial v^2}{partial v} frac{partial v}{partial b} \\
& = & frac{1}{N} sum_{i=1}^N 2v frac{partial v}{partial b} \\
& = & frac{1}{N} sum_{i=1}^N begin{cases}
0 & mathbf{w} cdot mathbf{x} + b leq 0\
-2v & mathbf{w} cdot mathbf{x} + b > 0\
end{cases}\\
& = & frac{1}{N} sum_{i=1}^N begin{cases}
0 & mathbf{w} cdot mathbf{x} + b leq 0\
-2(y_i-max(0, mathbf{w}cdotmathbf{x}_i+b)) & mathbf{w} cdot mathbf{x} + b > 0\
end{cases}\\
& = & frac{1}{N} sum_{i=1}^N begin{cases}
0 & mathbf{w} cdot mathbf{x} + b leq 0\
2(mathbf{w}cdotmathbf{x}_i+b-y_i) & mathbf{w} cdot mathbf{x} + b > 0\
end{cases}\\
& = & begin{cases}
0 & mathbf{w} cdot mathbf{x}_i + b leq 0\
frac{2}{N} sum_{i=1}^N (mathbf{w}cdotmathbf{x}_i+b-y_i) & mathbf{w} cdot mathbf{x}_i + b > 0\
end{cases}
end{eqnarray*}
«/>

As before, we can substitute an error term:

The partial derivative is then just the average error or zero, according to the activation level. To update the neuron bias, we nudge it in the opposite direction of increased cost:

In practice, it is convenient to combine w and b into a single vector parameter rather than having to deal with two different partials: . This requires a tweak to the input vector x as well but simplifies the activation function. By tacking a 1 onto the end of x becomes .

This finishes off the optimization of the neural network loss function because we have the two partials necessary to perform a gradient descent.

Summary

Hopefully you've made it all the way through to this point. You're well on your way to understanding matrix calculus! We've included a reference that summarizes all of the rules from this article in the next section. Also check out the annotated resource link below.

Your next step would be to learn about the partial derivatives of matrices not just vectors. For example, you can take a look at the matrix differentiation section of Matrix calculus.

Acknowledgements. We thank Yannet Interian (Faculty in MS data science program at University of San Francisco) and David Uminsky (Faculty/director of MS data science) for their help with the notation presented here.

Matrix Calculus Reference

Gradients and Jacobians

The gradient of a function of two variables is a horizontal 2-vector:

The Jacobian of a vector-valued function that is a function of a vector is an ( and ) matrix containing all possible scalar partial derivatives:

The Jacobian of the identity function is I.

Element-wise operations on vectors

Define generic element-wise operations on vectors w and x using operator such as :

The Jacobian with respect to w (similar for x) is:

Given the constraint (element-wise diagonal condition) that and access at most w_i and x_irespectively, the Jacobian simplifies to a diagonal matrix:

Here are some sample element-wise operators:

Scalar expansion

Adding scalar z to vector xis really where and .

Scalar multiplication yields:

Vector reductions

The partial derivative of a vector sum with respect to one of the vectors is:

For :

For and we get:

Vector dot product . Substituting and using the vector chain rule, we get:

Similarly, .

Chain rules

The vector chain rule is the general form as it degenerates to the others. When f is a function of a single variable x and all intermediate variables u are functions of a single variable, the single-variable chain rule applies. When some or all of the intermediate variables are functions of multiple variables, the single-variable total-derivative chain rule applies. In all other cases, the vector chain rule applies.

Single-variable rule	Single-variable total-derivative rule	Vector rule

Notation

Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. x_i is the element of vector x and is in italics because a single vector element is a scalar. means “length of vector x.”

The T exponent of represents the transpose of the indicated vector.

is just a for-loop that iterates i from a to bsumming all the x_i.

Notation refers to a function called f with an argument of x.

I represents the square “identity matrix” of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones.

constructs a matrix whose diagonal elements are taken from vector x.

The dot product is the summation of the element-wise multiplication of the elements: . Or, you can look at it as .

Differentiation is an operator that maps a function of one parameter to another function. That means that maps to its derivative with respect to xwhich is the same thing as . Also, if then .

The partial derivative of the function with respect to xperforms the usual scalar derivative holding all other variables constant.

The gradient of f with respect to vector xorganizes all of the partial derivatives for a specific scalar function.

The Jacobian organizes the gradients of multiple functions into a matrix by stacking them:

The following notation means that y has the value a upon and value b upon .

Resources

Wolfram Alpha can do symbolic matrix algebra and there is also a cool dedicated matrix calculus differentiator.

When looking for resources on the web, search for “matrix calculus” not “vector calculus.” Here are some comments on the top links that come up from a Google search:

To learn more about neural networks and the mathematics behind optimization and back propagation, we highly recommend Michael Nielsen's book.

For those interested specifically in convolutional neural networks, check out A guide to convolution arithmetic for deep learning.

We reference the law of total derivative, which is an important concept that just means derivatives with respect to x must take into consideration the derivative with respect x of all variables that are a function of x.