Deep Neural Nets Assignment 1

After one step, the reparametrized linear model adjusts the slope more than the original model. This can be seen analytically. Consider the weight $w_{1}^{(0)}$ and bias $b_{1}^{(0)}$ of the linear model $f$ at initialization. The reparametrized model $g$ has initial parameters $w_{2}^{(0)} := w_{1}^{(0)} / c$ , $b_{2}^{(0)} := b_{1}^{(1)}$ . $L_{f} ((x, y)) = (f (x) - y)^{2}$ , so by the chain rule:

\frac{\partial L _{f}}{\partial w _{1}} \frac{\partial L _{f}}{\partial b _{1}} \frac{\partial L _{g}}{\partial w _{2}} \frac{\partial L _{g}}{\partial b _{2}} = \frac{\partial}{\partial w _{1}} (w_{1} x + b - y)^{2} = 2 x (w_{1} x + b - y) = 2 x (f (x) - y) = 2 (w_{1} x + b - y) = 2 (f (x) - y) = \frac{\partial}{\partial w _{2}} (c \cdot w_{2} x + b - y)^{2} = c \cdot 2 x (c \cdot w_{2} x + b - y) = c \cdot 2 x (g (x) - y) = 2 (c \cdot w_{2} x + b - y) = 2 (g (x) - y)

At initialization, $f = g$ , so the gradient update for the biases will be the same, but the gradient update for $w_{2}$ is scaled by $c$ . This explains the behaviour in Figure 1: the reparametrized model updates the slope more than than the original model. Note that this analysis does not necessarily hold after initialization, as $w_{2}^{(1)} \neq = w_{1}^{(1)} / c$ .

A natural extension is to ask whether appropriately scaling the learning rate could erase this effect. We have $w_{2}^{(1)} = w_{2}^{(0)} - η_{2} \cdot 2 c (g (x) - y)$ , and

w_{1}^{(1)} \frac{w _{1}^{(1)}}{c} = w_{1}^{(0)} - η_{1} \cdot 2 (g (x) - y) = \frac{w _{1}^{(0)}}{c} - \frac{η _{1}}{c} \cdot 2 (g (x) - y) = w_{2}^{(0)} - \frac{η _{1}}{c} \cdot 2 (g (x) - y)

Setting $η_{2} = \frac{η _{1}}{c ^{2}}$ means $η_{2} \cdot 2 c = \frac{η _{1}}{c} \cdot 2$ , which would mean $w_{2}^{(1)} = \frac{w _{1}^{(1)}}{c}$ . This is seen in

We can empirically test this:

![[Pasted image 20250220224058.png]

The

dron's garden!

Explorer

Deep Neural Nets Assignment 1

Graph View

Backlinks