TF-IDF from scratch¶
TF-IDF (term frequency-inverse document frequency) is a method of calculating the importance of each word in a set of documents. The TF-IDF values for a document are also commonly used as an embedding (a dense vector representation) for that document. These embeddings can then be used as the input to a machine learning model, e.g. a logistic regression model to classify your documents.
The sklearn library has an implementation of TF-IDF, but let's try and figure out how to implement it ourselves!
First, we'll implement it using sklearn and use the embeddings of our documents to make sure our implementation is correct.
Here's our documents:
documents = [
"I love baseball. It's my favorite sport of all time.",
"My favorite sport: cricket. I love cricket.",
"Cricket. There's a sport. A sport for the ages!",
]
TF-IDF in sklearn¶
Here's how create TF-IDF embeddings for documents using sklearn.
- Create an instance of
TfidfVectorizer
- "Fit" the
TfidfVectorizer
on your documents to calculate the TF and IDF values. - "Transform" your documents, which applies the TF-IDF algorithm to them
- We then cast the embeddings to a numpy array with
toarray
- We then cast the embeddings to a numpy array with
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
vec.fit(documents)
tfidf = vec.transform(documents).toarray()
The TFIDF values are a [n documents, vocabulary size]
array, where the vocabulary size is the number of unique tokens across all documents after preprocessing and tokenization.
tfidf
array([[0. , 0.37571621, 0.37571621, 0. , 0.28574186, 0. , 0.37571621, 0.28574186, 0.28574186, 0.37571621, 0.22190405, 0. , 0. , 0.37571621], [0. , 0. , 0. , 0.72532878, 0.36266439, 0. , 0. , 0.36266439, 0.36266439, 0. , 0.28164125, 0. , 0. , 0. ], [0.40914568, 0. , 0. , 0.31116583, 0. , 0.40914568, 0. , 0. , 0. , 0. , 0.48329606, 0.40914568, 0.40914568, 0. ]])
We can get the actual vocabulary items using get_feature_names_out
.
vec.get_feature_names_out()
array(['ages', 'all', 'baseball', 'cricket', 'favorite', 'for', 'it', 'love', 'my', 'of', 'sport', 'the', 'there', 'time'], dtype=object)
Finally, we can combine the tfidf
array and the vocabulary to make things easier to visualize by converting them into a pandas DataFrame:
import pandas as pd
sklearn_df = pd.DataFrame(tfidf, columns=vec.get_feature_names_out())
sklearn_df
ages | all | baseball | cricket | favorite | for | it | love | my | of | sport | the | there | time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.375716 | 0.375716 | 0.000000 | 0.285742 | 0.000000 | 0.375716 | 0.285742 | 0.285742 | 0.375716 | 0.221904 | 0.000000 | 0.000000 | 0.375716 |
1 | 0.000000 | 0.000000 | 0.000000 | 0.725329 | 0.362664 | 0.000000 | 0.000000 | 0.362664 | 0.362664 | 0.000000 | 0.281641 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.409146 | 0.000000 | 0.000000 | 0.311166 | 0.000000 | 0.409146 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.483296 | 0.409146 | 0.409146 | 0.000000 |
Now, we'll go through how to do the above, but from scratch in Python.
First, we need a tokenizer. This is a function that converts our documents (strings) into tokens (a list of strings). We'll do this using a regular expression, (the one that's used as the default token_pattern
argument in the TfidfVectorizer documentation).
Note: The TfidfVectorizer
also does some preprocessing -- namely lowercasing and stripping accents from the text -- in our implementation we'll only lowercase for simplicity. This means that technically you may get different results than using the TfidfVectorizer
depending on what's in your documents.
import re
def tokenizer(s: str) -> list[str]:
return [token.lower() for token in re.findall(r"(?u)\b\w\w+\b", s)]
tokenizer("hello, world")
['hello', 'world']
We can view the tokens in each document:
for i, document in enumerate(documents, start=1):
print(f"document {i}:", tokenizer(document))
document 1: ['love', 'baseball', 'it', 'my', 'favorite', 'sport', 'of', 'all', 'time'] document 2: ['my', 'favorite', 'sport', 'cricket', 'love', 'cricket'] document 3: ['cricket', 'there', 'sport', 'sport', 'for', 'the', 'ages']
Vocabulary¶
Next, we want to get the vocabulary of our documents. This is a list of all the unique tokens in our documents.
We do this by using our tokenizer function and Python's collections.Counter
class which keeps a count of tokens seen when using the update
method.
Sorting the vocabulary isn't necessary, but it's done in the TfidfVectorizer
, so we do it here.
import typing
from collections import Counter
def get_vocabulary(documents: list[str], tokenizer: typing.Callable) -> list[str]:
cnt = Counter()
for doc in documents:
tokens = tokenizer(doc)
cnt.update(tokens)
return sorted([k for k, _ in cnt.most_common()])
vocabulary = get_vocabulary(documents, tokenizer)
vocabulary
['ages', 'all', 'baseball', 'cricket', 'favorite', 'for', 'it', 'love', 'my', 'of', 'sport', 'the', 'there', 'time']
Term Frequency¶
We then calculate the term frequency values. This value tells us the relative frequency of a token within a document.
Term frequency is given by:
$$\text{tf}(t,d) = \frac{f_{t,d}}{|d|}$$
The term frequency for each token, $t$, in each document, $d$ is $\text{tf}(t,d)$, and is calculating by dividing the the number of times that $t$ appears in the $d$, ($f_{t,d}$) by the total number of tokens in the document, $|d|$.
In the code below, $f_{t,d}$, is given by cnt[token]
, and $|d|$ is given by tot_tokens
. We calculate both for each document individually.
Note: we calculate the term frequency per document for each token in the vocabulary, not just each token in the document. Tokens in the vocabulary but not in the document will get a term frequency value of zero.
def get_tf(
documents: list[str], vocabulary: list[str], tokenizer: typing.Callable
) -> list[dict[str, float]]:
tf = []
for doc in documents:
doc_tf = dict()
tokens = tokenizer(doc)
cnt = Counter(tokens)
tot_tokens = sum(cnt.values())
doc_tf = {token: cnt[token] / tot_tokens for token in vocabulary}
tf.append(doc_tf)
return tf
tf = get_tf(documents, vocabulary, tokenizer)
tf
[{'ages': 0.0, 'all': 0.1111111111111111, 'baseball': 0.1111111111111111, 'cricket': 0.0, 'favorite': 0.1111111111111111, 'for': 0.0, 'it': 0.1111111111111111, 'love': 0.1111111111111111, 'my': 0.1111111111111111, 'of': 0.1111111111111111, 'sport': 0.1111111111111111, 'the': 0.0, 'there': 0.0, 'time': 0.1111111111111111}, {'ages': 0.0, 'all': 0.0, 'baseball': 0.0, 'cricket': 0.3333333333333333, 'favorite': 0.16666666666666666, 'for': 0.0, 'it': 0.0, 'love': 0.16666666666666666, 'my': 0.16666666666666666, 'of': 0.0, 'sport': 0.16666666666666666, 'the': 0.0, 'there': 0.0, 'time': 0.0}, {'ages': 0.14285714285714285, 'all': 0.0, 'baseball': 0.0, 'cricket': 0.14285714285714285, 'favorite': 0.0, 'for': 0.14285714285714285, 'it': 0.0, 'love': 0.0, 'my': 0.0, 'of': 0.0, 'sport': 0.2857142857142857, 'the': 0.14285714285714285, 'there': 0.14285714285714285, 'time': 0.0}]
Let's verify this is correct. In the first document, the token "baseball" has a term-frequency value of 0.125.
This is because the token "baseball" appears once in the document, and there are nine tokens in the document:
$$\text{tf}(\text{"baseball"},D_1) = \frac{1}{9} = 0.111 \dots$$
For the second document, the token "cricket" appears twice, and there's six tokens in the document:
$$\text{tf}(\text{"cricket"},D_2) = \frac{2}{6} = 0.333 \dots$$
For the third document, the token "sports" doesn't appear at all, and there's also six tokens in the document:
$$\text{tf}(\text{"sports"},D_3) = \frac{0}{6} = 0$$
Inverse Document Frequency¶
Next, we'll calculate the inverse document frequency. This value tells us how important a token is by how many documents it appears in. If a token appears in many documents, it's probably not important. If it appears in very few documents, then it might be important.
There's a few different methods to calculate this, the most basic of which is:
$$\text{idf}(t) = \ln\bigg(\frac{|D|}{n_t}\bigg)$$
$|D|$ is the total number of documents, and $n_t$ is the number of documents that contain token, $t$. Note that inverse document frequency, unlike term frequency, is not calculated per document. Also, $n_t$ does not care how many times the token appears in each document, just how many documents contain that token.
The formula for inverse document frequency used in sklearn is a little different (details here), it's calculated as:
$$\text{idf}(t) = 1 + \ln\bigg(\frac{1 + |D|}{1 + n_t}\bigg)$$
In the code below: $|D|$ is given by n
, and $n_t$ is given by cnt[v]
.
from collections import defaultdict
import numpy as np
def get_idf(
documents: list[str], vocabulary: list[str], tokenizer: typing.Callable
) -> dict[str, float]:
n = len(documents)
cnt = defaultdict(int)
for v in vocabulary:
for doc in documents:
tokens = tokenizer(doc)
if v in tokens:
cnt[v] += 1
idf = {v: 1 + np.log((1 + n) / (1 + cnt[v])) for v in vocabulary}
return idf
idf = get_idf(documents, vocabulary, tokenizer)
idf
{'ages': 1.6931471805599454, 'all': 1.6931471805599454, 'baseball': 1.6931471805599454, 'cricket': 1.2876820724517808, 'favorite': 1.2876820724517808, 'for': 1.6931471805599454, 'it': 1.6931471805599454, 'love': 1.2876820724517808, 'my': 1.2876820724517808, 'of': 1.6931471805599454, 'sport': 1.0, 'the': 1.6931471805599454, 'there': 1.6931471805599454, 'time': 1.6931471805599454}
To verify, the token "absolutely" appears in one document, and we have three documents in total, therefore:
$$\text{idf}(\text{"absolutely"})=1 + \ln\bigg(\frac{1 + 3}{1 + 1}\bigg)=1+\ln\bigg(\frac{4}{2}\bigg)=1+\ln\big(2\big)=1+0.693\dots=1.693$$
The token "sport" appears in all three documents, so we have:
$$\text{idf}(\text{"sport"})=1 + \ln\bigg(\frac{1 + 3}{1 + 3}\bigg)=1 + \ln\bigg(\frac{4}{4}\bigg)=1 + \ln\big(1\big)=1+0=1$$
TF-IDF (Term Frequency-Inverse Document Frequency)¶
Once we have the term frequency and inverse document frequency, we can calculate term frequency-inverse document frequency. This is a measure of how "important" each token is within a document.
by simplying multiplying them together:
$$\text{tf-idf}(t,d) = \text{tf}(t,d) \cdot \text{idf}(t)$$
This means that:
- When a token doesn't appear in many documents, but appears a lot in those documents, it will have a high IDF value and a high TF value for those documents it does appear in. This gives it a high TF-IDF value in those documents. The TF value will be zero in documents which the term does not appear, so the TF-IDF value is zero. We assume this token is important in those documents.
- When a token appears in every document, and also appears a lot in those documents, it will have a low IDF value and a high TF value. The low IDF value will cause it to have a low TF-IDF value. We assume these are common words, e.g. "the", "and", "an", and have low information.
- When a token doesn't appear in many documents and only appears rarely in those documents, it will have a high IDF value and a low TF value for those documents (zero in all documents in doesn't appear). The low TF value will cause a low TF-IDF value. The assumption is that these are rare tokens that don't contain much information.
Remember that TF-IDF is calculated for each token in the vocabulary, not just for each token in each document.
This means for each document we get a $V$-dimensional vector (where $V$ is the size of the vocabulary). These vectors are what are referred to as TF-IDF embeddings, and are used as input to machine learning models.
The sklearn implementation goes a step further and we actually normalize these vectors by calculating:
$$v_{\text{norm}} = \frac{v}{\|v\|_2} = \frac{v}{\sqrt{v_1^2+v_2^2+\cdots+v_n^2}}$$
We do this in code by calculating the norm and then dividing each individual TF-IDF value by it.
def get_tfidf(
tf: list[dict[str, float]], idf: dict[str, float]
) -> list[dict[str, float]]:
tfidf = []
for doc_tf in tf:
doc_tfidf = dict()
for token in doc_tf.keys():
doc_tfidf[token] = doc_tf[token] * idf[token]
vec = np.array([v for v in doc_tfidf.values()])
norm = np.linalg.norm(vec)
for token, value in doc_tfidf.items():
doc_tfidf[token] = value / norm
tfidf.append(doc_tfidf)
return tfidf
tfidf = get_tfidf(tf, idf)
tfidf
[{'ages': 0.0, 'all': 0.3757162113174268, 'baseball': 0.3757162113174268, 'cricket': 0.0, 'favorite': 0.2857418629625308, 'for': 0.0, 'it': 0.3757162113174268, 'love': 0.2857418629625308, 'my': 0.2857418629625308, 'of': 0.3757162113174268, 'sport': 0.22190404687274298, 'the': 0.0, 'there': 0.0, 'time': 0.3757162113174268}, {'ages': 0.0, 'all': 0.0, 'baseball': 0.0, 'cricket': 0.7253287753645998, 'favorite': 0.3626643876822999, 'for': 0.0, 'it': 0.0, 'love': 0.3626643876822999, 'my': 0.3626643876822999, 'of': 0.0, 'sport': 0.2816412493743718, 'the': 0.0, 'there': 0.0, 'time': 0.0}, {'ages': 0.4091456783838912, 'all': 0.0, 'baseball': 0.0, 'cricket': 0.31116583432624145, 'favorite': 0.0, 'for': 0.4091456783838912, 'it': 0.0, 'love': 0.0, 'my': 0.0, 'of': 0.0, 'sport': 0.4832960572849686, 'the': 0.4091456783838912, 'there': 0.4091456783838912, 'time': 0.0}]
We can also put these values in a DataFrame to visualize them:
scratch_df = pd.DataFrame(tfidf)
scratch_df
ages | all | baseball | cricket | favorite | for | it | love | my | of | sport | the | there | time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.375716 | 0.375716 | 0.000000 | 0.285742 | 0.000000 | 0.375716 | 0.285742 | 0.285742 | 0.375716 | 0.221904 | 0.000000 | 0.000000 | 0.375716 |
1 | 0.000000 | 0.000000 | 0.000000 | 0.725329 | 0.362664 | 0.000000 | 0.000000 | 0.362664 | 0.362664 | 0.000000 | 0.281641 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.409146 | 0.000000 | 0.000000 | 0.311166 | 0.000000 | 0.409146 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.483296 | 0.409146 | 0.409146 | 0.000000 |
Here's the DataFrame of values calculated by sklearn:
sklearn_df
ages | all | baseball | cricket | favorite | for | it | love | my | of | sport | the | there | time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.375716 | 0.375716 | 0.000000 | 0.285742 | 0.000000 | 0.375716 | 0.285742 | 0.285742 | 0.375716 | 0.221904 | 0.000000 | 0.000000 | 0.375716 |
1 | 0.000000 | 0.000000 | 0.000000 | 0.725329 | 0.362664 | 0.000000 | 0.000000 | 0.362664 | 0.362664 | 0.000000 | 0.281641 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.409146 | 0.000000 | 0.000000 | 0.311166 | 0.000000 | 0.409146 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.483296 | 0.409146 | 0.409146 | 0.000000 |
We can verify the values are close by using the np.allclose
function:
np.allclose(sklearn_df.values, scratch_df.values)
True
However, the values are not actually equal!
scratch_df.equals(sklearn_df)
False
If we check the values in the first document, we find that some are actually different.
These differences are most likely due to numerical rounding issues as they are miniscule.
abs(scratch_df.iloc[0] - sklearn_df.iloc[0])
ages 0.000000e+00 all 0.000000e+00 baseball 0.000000e+00 cricket 0.000000e+00 favorite 5.551115e-17 for 0.000000e+00 it 0.000000e+00 love 5.551115e-17 my 5.551115e-17 of 0.000000e+00 sport 2.775558e-17 the 0.000000e+00 there 0.000000e+00 time 0.000000e+00 Name: 0, dtype: float64
We can now calculate TF-IDF embeddings for arbitrary documents. We have to calculate the term frequency values for the document, but we re-use the vocabulary and IDF values.
doc = ["cricket is my favorite sport."]
doc_tf = get_tf(doc, vocabulary, tokenizer)
doc_tfidf = get_tfidf(doc_tf, idf)
doc_tfidf
[{'ages': 0.0, 'all': 0.0, 'baseball': 0.0, 'cricket': 0.5268201732399633, 'favorite': 0.5268201732399633, 'for': 0.0, 'it': 0.0, 'love': 0.0, 'my': 0.5268201732399633, 'of': 0.0, 'sport': 0.4091228607670865, 'the': 0.0, 'there': 0.0, 'time': 0.0}]
One thing to note is that all tokens not in the vocabulary will not have a value in the output TF-IDF vector, e.g. if you have documents that mention a sport that isn't cricket or baseball. Therefore, it's important your documents used to create the TF-IDF transformation are representative of what you actually want to apply the transformation on.