I understand the basics of how dual numbers work, as well as how they are used for automatic differentiation, as described here: Dual Numbers & Automatic Differentiation
I was wondering, how would you extend this concept to get partial derivatives of a function? Basically I have a multivariable function, and I'd like to calculate it's value and gradient for a specific input.
I started off by looking at how multiplication of dual numbers is derived for a function of a single variable of the form $y=f(x)$. (Note the $\epsilon^2$ in the last step turns to 0 which makes that term disappear):
$(a+b\epsilon)*(c+d\epsilon) = \\ ac+(bc+ad)\epsilon+bd\epsilon^2 = \\ ac + (bc+ad)\epsilon$
That made me think that maybe I could just have an $\epsilon$ defined per variable in a $z=f(x,y)$ function, so I gave it a shot. (Note that the $x^2$ and $y^2$ terms disappear below for the same reason as above):
$ x=\epsilon_x \\ y=\epsilon_y \\ (a+bx+cy)*(d+ex+fy)= \\ ad+(ae+bd)x+(af+cd)y+(bf+ce)xy+bex^2+cfy^2= \\ ad+(ae+bd)x+(af+cd)y+(bf+ce)xy $
This looks pretty good except for the $xy$ term, which I have no idea how to account for in the gradient, or how to interpret.
Can anyone help me out towards understanding how to do multivariable automatic differentiation?
I find the notation somewhat confusing.
Let's restate. Let's say function is f(x,y)=xy. Then if you want df/dt (annotated f'), you'd pass f(x+x'e, y+y'e). You'd get the stated result f+f'e = xy + (x*y'+x'*y)*e.
Now a gradient, what it sounds like you're trying to evaluate is just a vector or partial derivatives.
Let's restate f(a,d)=a*d. You could do each separately passing only x derivatives and then with y. If you pass both, you get the equation you list. Now there's the real value and 3 different e terms. The 2 single e terms are the components of the gradient. The cross term on the other hand is the derivative relative to x and y, which is of no interest for the gradient, so xy can cancel like x^2 and y^2 do.
If you look at what you get out of the full expansion without canceling out e^2 anywhere, the terms are just multiple derivatives by the same or multiple variables. They only cancel because we decided to restrict ourselves to the first derivative. You could keep track of all these and only cancel when you get terms of even larger power (e.g. any multiple of 3 e's) and get different kinds of higher order derivatives.
Now this deserves more thorough derivation (probably just the standard doing math with d/dx_i tricks) and better presentation (mathy text) but it's late and I'm typing on a phone. :)
Thanks, Adrian