# Bootstrapping confidence intervals¶

A confidence interval is used to estimate the range within which a population parameter is likely to lie, with a given level of confidence. What do we mean by *population parameter*? It's a value for which describes our population (data), such as the mean.

You'll commonly hear about *95% confidence intervals*. These are intervals (ranges) which will contain the population parameter 95% of the time.

The `seaborn`

library has a `regplot`

function which creates a scatterplot, fits a line using linear regression, and also displays the 95% confidence interval. We use the `"tips"`

dataset provided by `seaborn`

as example data. (I've also added some arguments to set the colors in the plot.)

```
import seaborn as sns
tips = sns.load_dataset("tips")
X = tips["total_bill"]
y = tips["tip"]
sns.regplot(x=X, y=y, ci=95, scatter_kws={"color": "blue"}, line_kws={"color": "red"})
```

<AxesSubplot:xlabel='total_bill', ylabel='tip'>

Above, we can see that the confidence intervals are narrow when the data is more clustered (low variance) and thus our estimate of the population parameter is more precise. When the data is more spread out (high variance) there's greater uncertainty in the estimation of the population parameter, thus the confidence intervals increase.

How can we create these confidence intervals ourselves? The method we'll look at in this notebook is creating them via *bootstrapping*. Bootstrapping is used to describe a wide variety of techniques, all of which use re-sampling using replacement. For example, if we wanted to estimate the mean of some data using bootstrapping, we would:

- sample (with replacement) from our original data, taking as many samples as are in the original data
- calculate the mean from our bootstrapped sample
- repeat this multiple times, keeping track of each mean calculated
- sort the mean values
- if we wanted the 95% confidence interval, we'd take the 2.5th and 97.5th percentiles as the minimum and maximum (respectively)

Why not just use the actual mean from the data? Because bootstrapping is used to calculate the uncertainty around a value, rather than the value itself.

Below, we start creating the `regplot`

. We fit a scikit learn `LinearRegression`

model (`clf`

) to the data and then plot both the data and the model with `matplotlib`

.

```
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
X = np.array(tips["total_bill"]).reshape(-1, 1)
y = np.array(tips["tip"]).reshape(-1, 1)
clf = LinearRegression()
clf.fit(X, y)
y_pred = clf.predict(X)
fig, ax = plt.subplots()
ax.scatter(X, y, color="blue")
ax.plot(X, y_pred, color="red")
```

[<matplotlib.lines.Line2D at 0x7f13d59b1a60>]

For the actual bootstrapping, we're not estimating the mean, but instead: for each bootstrapped estimate (of which there will be `n_boots`

), we will perform the re-sampling with replacement (to get `sampled_X`

and `sampled_Y`

), then fit a `LinearRegression`

model (`_clf`

) to the samples, and then use that model to predict values across `n_bins`

points between the minimum and maximum of our data (`_preds`

obtained across `grid`

).

This gives us `n_boots`

bootstrapped estimates along the `n_bins`

points, and we get the bootstrapped 95% confidence intervals by calculating the 2.5th and 97.5th percentiles across the `n_bins`

points.

```
def get_bootstrap_results(X, y, n_bins=100, n_boots=1_000):
"""
Generate bootstrap estimates for a linear regression model.
"""
results = np.zeros(shape=(n_boots, n_bins))
grid = np.linspace(np.min(X), np.max(X), n_bins).reshape(-1, 1)
for i in range(n_boots):
sample_indices = np.random.choice(len(X), len(X))
sampled_X = X[sample_indices]
sampled_y = y[sample_indices]
_clf = LinearRegression()
_clf.fit(sampled_X, sampled_y)
_preds = _clf.predict(grid).reshape(-1)
results[i] = _preds
grid = grid.reshape(-1)
return grid, results
def get_ci_values(bootstrap_results, ci=95):
"""
Calculate confidence intervals from bootstrap results.
"""
lower, upper = 50 - (ci / 2), 50 + (ci / 2)
return np.nanpercentile(bootstrap_results, (lower, upper), axis=0)
grid, bootstrap_results = get_bootstrap_results(X, y)
ci_values = get_ci_values(bootstrap_results, 95)
```

Our `ci_values`

gives us the range of the confidence interval for each value in `grid`

.

```
grid[0], ci_values[:, 0]
```

(3.07, array([0.9122646, 1.5724489]))

We can then use `fill_between`

to plot the confidence intervals.

```
fig, ax = plt.subplots()
ax.scatter(X, y, color="blue")
ax.plot(X, y_pred, color="red")
ci_min, ci_max = ci_values[0, :], ci_values[1, :]
ax.fill_between(grid, ci_min, ci_max, color="red", alpha=0.2)
```

<matplotlib.collections.PolyCollection at 0x7f13d594b550>