Proof for Softmax Regression gradient

The notation of this proof comes from the article Softmax Regression by UFLDL Tutorial.

    \[ J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K}  1\left\{y^{(i)} = k\right\} \log \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}\right] \]

We have:

    \[ \begin{array}{lll} & \nabla_{\theta^{(n)}} J(\theta) &= - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \nabla_{\theta^{(n)}} \log \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}\right]\\ \Leftrightarrow & \nabla_{\theta^{(n)}} J(\theta) &= - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \frac{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}{\exp(\theta^{(k)\top} x^{(i)})}\nabla_{\theta^{(n)}} \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}\right] \text{ [*] } \end{array} \]

 

When k = n:

 

    \[ \nabla_{\theta^{(n)}} \frac{\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} = \frac{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})\nabla_{\theta^{(n)}}\exp(\theta^{(n)\top} x^{(i)}) - \exp(\theta^{(n)\top} x^{(i)})\nabla_{\theta^{(n)}} \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}{\left( \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) \right)^{2}} \text{ [1] } \]

 

    \[ \nabla_{\theta^{(n)}} \exp(\theta^{(n)\top} x^{(i)}) = \exp(\theta^{(n)\top} x^{(i)})\nabla_{\theta^{(n)}} \theta^{(n)\top} x^{(i)} \Leftrightarrow \nabla_{\theta^{(n)}} \exp(\theta^{(n)\top} x^{(i)}) = \exp(\theta^{(n)\top} x^{(i)})x^{(i)} \text{ [2] } \]

 

    \[ \nabla_{\theta^{(n)}} \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) = \nabla_{\theta^{(n)}} \exp(\theta^{(n)\top} x^{(i)})\nabla_{\theta^{(n)}} \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) = \exp(\theta^{(n)\top} x^{(i)})x^{(i)} \text{ [3] } \]

 

    \[ \text{ [1][2][3] } \Leftrightarrow \nabla_{\theta^{(n)}} \frac{\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} = \exp(\theta^{(n)\top} x^{(i)})x^{(i)}\frac{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) - \exp(\theta^{(n)\top} x^{(i)})}{\left( \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) \right)^{2}} \text{ [4] } \]

 

    \[ \text{ [4] } \Rightarrow 1\left\{y^{(i)} = k\right\} \frac{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}{\exp(\theta^{(k)\top} x^{(i)})}\nabla_{\theta^{(n)}} \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} = 1\left\{y^{(i)} = n\right\}x^{(i)}\left ( 1 - \frac{\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} \right ) \text{ [5] } \]

 

When k \neq n:

 

    \[ \nabla_{\theta^{(n)}} \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} = \frac{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})\nabla_{\theta^{(n)}} \exp(\theta^{(k)\top} x^{(i)}) - \exp(\theta^{(k)\top} x^{(i)})\nabla_{\theta^{(n)}} \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}{\left ( \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) \right )^{2}} \text{ [6] } \]

 

    \[ \nabla_{\theta^{(n)}} \exp(\theta^{(k)\top} x^{(i)}) = 0 \text{ [7] } \]

 

    \[ \text{ [3][6][7] } \Leftrightarrow \nabla_{\theta^{(n)}} \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} = \exp(\theta^{(n)\top} x^{(i)})x^{(i)}\frac{ - \exp(\theta^{(k)\top} x^{(i)})}{\left ( \sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) \right )^{2}} \text{ [8] } \]

 

    \[ \text{ [8] } \Rightarrow 1\left\{y^{(i)} = k\right\} \frac{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}{\exp(\theta^{(k)\top} x^{(i)})}\nabla_{\theta^{(n)}} \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} = 1\left\{y^{(i)} = k\right\}x^{(i)}\frac{-\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} \text{ [9] } \]

 


 

Now we look carefully at the result [*], [5], [9]. Notice that y^{(i)} has one and only one value, so for only one value of k we have 1\left\{y^{(i)} = k\right\} \neq 0 in [*]. If  y^{(i)} = n then we have  x^{(i)}\left ( 1 - \frac{\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} \right ) in [5], if  y^{(i)} \neq n then we have  x^{(i)}\frac{-\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} in [9]. So for whatever value of y^{(i)}, we have:

    \[ \text{ [*] } \Leftrightarrow  \nabla_{\theta^{(k)}} J(\theta) = - \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = n\}  - \frac{\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})} \right) \right]  } \]

Also notice that:  P(y^{(i)} = n | x^{(i)}; \theta) = \frac{\exp(\theta^{(n)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}. Let’s change the notation a little bit, we have proved:

    \[ \nabla_{\theta^{(k)}} J(\theta) = - \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}; \theta) \right) \right]  \]


Facebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *