# Why is the softmax function "off by one"?¶

Working through the Attention is Off by One articles which was interesting.

Softmax is a function which converts a real-valued vector into a probability vector. It's used to select a choice from a vector. It's given by:

$$\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

We can write this in code as:

```
import numpy as np
def softmax(xs: list):
return np.exp(xs) / np.sum(np.exp(xs))
```

We can now convert a given list of numbers into probabilities:

```
softmax([2, 5, 3])
```

array([0.04201007, 0.84379473, 0.1141952 ])

These probabilities all sum to one (or close enough):

```
softmax([2, 5, 3]).sum()
```

1.0

One issue with the above implementation is that $e$ to the power of a large number tends towards infinity. This means you get infinities in both the numerator and denominator which give you `nan`

values.

```
softmax([1_000, 1_000, 1_000])
```

/tmp/ipykernel_48829/4267736518.py:4: RuntimeWarning: overflow encountered in exp return np.exp(xs) / np.sum(np.exp(xs)) /tmp/ipykernel_48829/4267736518.py:4: RuntimeWarning: invalid value encountered in divide return np.exp(xs) / np.sum(np.exp(xs))

array([nan, nan, nan])

We can solve this by scaling the vector using the maximum value:

```
def softmax(xs: list):
e_x = np.exp(xs - np.max(xs))
return e_x / e_x.sum()
```

```
softmax([2, 5, 3])
```

array([0.04201007, 0.84379473, 0.1141952 ])

```
softmax([1_000, 1_000, 1_000])
```

array([0.33333333, 0.33333333, 0.33333333])

To prove this is equal to the initial softmax formula, we have:

$$\text{softmax}(x)_i = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

Using the property:

$$a^{b-c} = \frac{a^b}{a^c}$$

and also acknowledging that $\max(x)$ is a constant, we get:

$$ \begin{align*} \text{softmax}(x)_i &= \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}\\[6pt] &=\frac{\frac{e^{x_i}}{e^{\max(x)}}}{\sum\frac{e^{x_j}}{e^{\max(x)}}}\\[6pt] &=\frac{e^{x_i}}{\frac{1}{e^{\max(x)}}\sum\frac{e^{x_j}}{e^{\max(x)}}}\\[6pt] &=\frac{e^{x_i}}{\sum_j e^{x_j}}\\ \end{align*} $$

The problem raised in the Attention is Off by One article is that softmax always has to make a choice of which element to pick, even if it doesn't want to. It can't return a zero vector and pick nothing.

The solution? Add a $+1$ to the denominator in order to allow the softmax function to output a zero vector if all the real-valued elements are large negative numbers.

$$\text{softmax}_1(x)_i = \frac{e^{x_i}}{1 + \sum_j e^{x_j}}$$

```
def softmax_one(xs: list):
return np.exp(xs) / (1 + np.sum(np.exp(xs)))
```

```
softmax_one([-1000, -1000, -1000])
```

array([0., 0., 0.])

The above implementation has the same numerical instability issues as with the initial implementation of softmax. The stable implementation, which I've taken from here is as follows:

```
def softmax_one(xs: list):
m = np.max([np.max(xs), 0])
e_x = np.exp(xs - m)
return e_x / (e_x.sum() + np.exp(-m))
```

This allows the vector to tend to zero when they're all large negative numbers:

```
softmax_one([-1000, -1000, -1000])
```

array([0., 0., 0.])

And also have the numerical stability to work with large positive numbers:

```
softmax_one([1_000, 1_000, 1_000])
```

array([0.33333333, 0.33333333, 0.33333333])

What's the point of a function that returns a zero vector? The $\text{softmax}_1$ is designed to be used in the attention calculation of a transformer model, allowing it to pay attention to nothing. The transformer architecture uses residual connections, which allow the previous layer values to be passed unaltered if the $\text{softmax}_1$ function is outputting a zero