Why is the softmax function "off by one"?¶
Working through the Attention is Off by One articles which was interesting.
Softmax is a function which converts a real-valued vector into a probability vector. It's used to select a choice from a vector. It's given by:
$$\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$
We can write this in code as:
import numpy as np
def softmax(xs: list):
return np.exp(xs) / np.sum(np.exp(xs))
We can now convert a given list of numbers into probabilities:
softmax([2, 5, 3])
array([0.04201007, 0.84379473, 0.1141952 ])
These probabilities all sum to one (or close enough):
softmax([2, 5, 3]).sum()
1.0
One issue with the above implementation is that $e$ to the power of a large number tends towards infinity. This means you get infinities in both the numerator and denominator which give you nan
values.
softmax([1_000, 1_000, 1_000])
/tmp/ipykernel_48829/4267736518.py:4: RuntimeWarning: overflow encountered in exp return np.exp(xs) / np.sum(np.exp(xs)) /tmp/ipykernel_48829/4267736518.py:4: RuntimeWarning: invalid value encountered in divide return np.exp(xs) / np.sum(np.exp(xs))
array([nan, nan, nan])
We can solve this by scaling the vector using the maximum value:
def softmax(xs: list):
e_x = np.exp(xs - np.max(xs))
return e_x / e_x.sum()
softmax([2, 5, 3])
array([0.04201007, 0.84379473, 0.1141952 ])
softmax([1_000, 1_000, 1_000])
array([0.33333333, 0.33333333, 0.33333333])
To prove this is equal to the initial softmax formula, we have:
$$\text{softmax}(x)_i = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$
Using the property:
$$a^{b-c} = \frac{a^b}{a^c}$$
and also acknowledging that $\max(x)$ is a constant, we get:
$$ \begin{align*} \text{softmax}(x)_i &= \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}\\[6pt] &=\frac{\frac{e^{x_i}}{e^{\max(x)}}}{\sum\frac{e^{x_j}}{e^{\max(x)}}}\\[6pt] &=\frac{e^{x_i}}{\frac{1}{e^{\max(x)}}\sum\frac{e^{x_j}}{e^{\max(x)}}}\\[6pt] &=\frac{e^{x_i}}{\sum_j e^{x_j}}\\ \end{align*} $$
The problem raised in the Attention is Off by One article is that softmax always has to make a choice of which element to pick, even if it doesn't want to. It can't return a zero vector and pick nothing.
The solution? Add a $+1$ to the denominator in order to allow the softmax function to output a zero vector if all the real-valued elements are large negative numbers.
$$\text{softmax}_1(x)_i = \frac{e^{x_i}}{1 + \sum_j e^{x_j}}$$
def softmax_one(xs: list):
return np.exp(xs) / (1 + np.sum(np.exp(xs)))
softmax_one([-1000, -1000, -1000])
array([0., 0., 0.])
The above implementation has the same numerical instability issues as with the initial implementation of softmax. The stable implementation, which I've taken from here is as follows:
def softmax_one(xs: list):
m = np.max([np.max(xs), 0])
e_x = np.exp(xs - m)
return e_x / (e_x.sum() + np.exp(-m))
This allows the vector to tend to zero when they're all large negative numbers:
softmax_one([-1000, -1000, -1000])
array([0., 0., 0.])
And also have the numerical stability to work with large positive numbers:
softmax_one([1_000, 1_000, 1_000])
array([0.33333333, 0.33333333, 0.33333333])
What's the point of a function that returns a zero vector? The $\text{softmax}_1$ is designed to be used in the attention calculation of a transformer model, allowing it to pay attention to nothing. The transformer architecture uses residual connections, which allow the previous layer values to be passed unaltered if the $\text{softmax}_1$ function is outputting a zero