Why is the softmax function "off by one"?¶

Working through the Attention is Off by One articles which was interesting.

Softmax is a function which converts a real-valued vector into a probability vector. It's used to select a choice from a vector. It's given by:

$$\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

We can write this in code as:

In [1]:
import numpy as np


def softmax(xs: list):
    return np.exp(xs) / np.sum(np.exp(xs))

We can now convert a given list of numbers into probabilities:

In [2]:
softmax([2, 5, 3])
Out[2]:
array([0.04201007, 0.84379473, 0.1141952 ])

These probabilities all sum to one (or close enough):

In [3]:
softmax([2, 5, 3]).sum()
Out[3]:
1.0

One issue with the above implementation is that $e$ to the power of a large number tends towards infinity. This means you get infinities in both the numerator and denominator which give you nan values.

In [4]:
softmax([1_000, 1_000, 1_000])
/tmp/ipykernel_48829/4267736518.py:4: RuntimeWarning: overflow encountered in exp
  return np.exp(xs) / np.sum(np.exp(xs))
/tmp/ipykernel_48829/4267736518.py:4: RuntimeWarning: invalid value encountered in divide
  return np.exp(xs) / np.sum(np.exp(xs))
Out[4]:
array([nan, nan, nan])

We can solve this by scaling the vector using the maximum value:

In [5]:
def softmax(xs: list):
    e_x = np.exp(xs - np.max(xs))
    return e_x / e_x.sum()
In [6]:
softmax([2, 5, 3])
Out[6]:
array([0.04201007, 0.84379473, 0.1141952 ])
In [7]:
softmax([1_000, 1_000, 1_000])
Out[7]:
array([0.33333333, 0.33333333, 0.33333333])

To prove this is equal to the initial softmax formula, we have:

$$\text{softmax}(x)_i = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

Using the property:

$$a^{b-c} = \frac{a^b}{a^c}$$

and also acknowledging that $\max(x)$ is a constant, we get:

$$ \begin{align*} \text{softmax}(x)_i &= \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}\\[6pt] &=\frac{\frac{e^{x_i}}{e^{\max(x)}}}{\sum\frac{e^{x_j}}{e^{\max(x)}}}\\[6pt] &=\frac{e^{x_i}}{\frac{1}{e^{\max(x)}}\sum\frac{e^{x_j}}{e^{\max(x)}}}\\[6pt] &=\frac{e^{x_i}}{\sum_j e^{x_j}}\\ \end{align*} $$

The problem raised in the Attention is Off by One article is that softmax always has to make a choice of which element to pick, even if it doesn't want to. It can't return a zero vector and pick nothing.

The solution? Add a $+1$ to the denominator in order to allow the softmax function to output a zero vector if all the real-valued elements are large negative numbers.

$$\text{softmax}_1(x)_i = \frac{e^{x_i}}{1 + \sum_j e^{x_j}}$$

In [8]:
def softmax_one(xs: list):
    return np.exp(xs) / (1 + np.sum(np.exp(xs)))
In [9]:
softmax_one([-1000, -1000, -1000])
Out[9]:
array([0., 0., 0.])

The above implementation has the same numerical instability issues as with the initial implementation of softmax. The stable implementation, which I've taken from here is as follows:

In [10]:
def softmax_one(xs: list):
    m = np.max([np.max(xs), 0])
    e_x = np.exp(xs - m)
    return e_x / (e_x.sum() + np.exp(-m))

This allows the vector to tend to zero when they're all large negative numbers:

In [11]:
softmax_one([-1000, -1000, -1000])
Out[11]:
array([0., 0., 0.])

And also have the numerical stability to work with large positive numbers:

In [12]:
softmax_one([1_000, 1_000, 1_000])
Out[12]:
array([0.33333333, 0.33333333, 0.33333333])

What's the point of a function that returns a zero vector? The $\text{softmax}_1$ is designed to be used in the attention calculation of a transformer model, allowing it to pay attention to nothing. The transformer architecture uses residual connections, which allow the previous layer values to be passed unaltered if the $\text{softmax}_1$ function is outputting a zero