Retrieval augmented generation¶
Retrieval augmented generation (RAG) is a method of adapting LLM (large language model) text generation by giving the model access to external data sources, e.g. PDFs, transcripts, web pages, internal documentation, etc.
One common example of this is asking a LLM questions about a document (or set of documents). Here, we'll ask questions about a specific web page, Paul Graham's "How to Do Great Work" article, found here.
Generally, RAG works in two main steps:
- Given a query, retrieve the relevant information from the document(s)
- Use the retrieved information to generate text
Retrieving Information¶
The first thing we need to do is get the text from the web page. We grab the actual HTML content using requests
and then parse the HTML using BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = "http://www.paulgraham.com/greatwork.html"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
We can get the actual text from the parsed HTML using the get_text
method:
text = soup.body.get_text()
text[:435]
'July 2023If you collected lists of techniques for doing great work in a lot\nof different fields, what would the intersection look like? I decided\nto find out by making it.Partly my goal was to create a guide that could be used by someone\nworking in any field. But I was also curious about the shape of the\nintersection. And one thing this exercise shows is that it does\nhave a definite shape; it\'s not just a point labelled "work hard.'
Next, we'll parse the text into chunks. Each chunk is a piece of information that we can potentially retrieve, e.g. a sentence, a paragraph, a line, etc.
There's no right or wrong way to chunk information. It all depends on your use-case. If the information you want to retrieve is generally in self-contained sentences, then you should probably chunk into sentences. If the information spans multiple sentences, you can either have chunks be three sentences long (they can be overlapping or not) or even entire paragraphs.
Here, we'll parse the information into individual sentences by:
- removing the date at the start of the article
- removing newline characters
- removing footnote indicators
- splitting the text into sentences by assuming each sentence ends in "." or "?"
- removing any empty strings
import re
text = text[9:]
text = re.sub("\n", " ", text)
text = re.sub("\[\d+\]", "", text)
chunks = [s.strip() for s in re.split("\.|\?", text) if len(s.strip())]
len(chunks)
758
for chunk in chunks[:5]:
print(">", chunk)
> If you collected lists of techniques for doing great work in a lot of different fields, what would the intersection look like > I decided to find out by making it > Partly my goal was to create a guide that could be used by someone working in any field > But I was also curious about the shape of the intersection > And one thing this exercise shows is that it does have a definite shape; it's not just a point labelled "work hard
We have our chunks, but how do we retrieve "relevant" chunks from a query? Using pre-trained text embedding models!
The models we want to use are those trained on sentence similarity tasks, i.e. they are trained to map similar sentences to nearby points in n-dimensional space. Conversely, disimilary sentences should be far away from each other in n-dimensional space.
There's a very useful leaderboard for text embedding models, however we'll use the stencen-transformers/all-MiniLM-L6-v2
model because it generally works well and is relatively small with low inference time.
The code below defines a get_embeddings
function which tokenizes input sentences (list of strings), passes them through the model, pools them (to go from a [batch size, sequence length, embedding dim.]
tensor to a [batch size, embedding dim.]
tensor), and then normalizes the embeddings.
import transformers
import torch
import torch.nn.functional as F
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
embedding_model = transformers.AutoModel.from_pretrained(model_name)
def mean_pooling(last_hidden_state, attention_mask):
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
)
return torch.sum(last_hidden_state * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
def get_embeddings(sentences, tokenizer, model):
if isinstance(sentences, str):
sentences = [sentences]
encoded_input = tokenizer(
sentences, padding=True, truncation=True, return_tensors="pt"
)
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(
model_output.last_hidden_state, encoded_input["attention_mask"]
)
embeddings = F.normalize(embeddings, p=2, dim=-1)
return embeddings
We then get the sentence embedding for each chunk in our document:
chunk_embeddings = get_embeddings(chunks, tokenizer, embedding_model)
chunk_embeddings.shape
torch.Size([758, 384])
Given a query, we calculate the sentence embedding for it:
query = "What should I work on?"
query_embedding = get_embeddings(query, tokenizer, embedding_model)
query_embedding.shape
torch.Size([1, 384])
We then calculate similarity between the query embedding and the chunk embeddings.
There's two ways we can do this. We can either use the cosine similarity, which gives us a value between -1 and +1 (higher = more similar) for each chunk embedding. Or, we can calculate it using the dot product. However, our embeddings are normalized (like they are in the get_embeddings
function) then these two methods both give the same values!
If the embeddings aren't normalized, then the dot product similarity takes the magnitude of the vectors into account, whilst cosine similarity doesn't. Will that matter in your case? It's hard to tell. Generally, dot product similarity is slightly cheaper to compute, but both are used in practice.
cos_similarity = torch.cosine_similarity(query_embedding, chunk_embeddings)
dot_similarity = torch.mm(query_embedding, chunk_embeddings.T).squeeze(0)
cos_similarity.shape, dot_similarity.shape
(torch.Size([758]), torch.Size([758]))
torch.allclose(cos_similarity, dot_similarity)
True
Note: They're not actually equal, due to some numerical instability, but the most they are different by is around 1.2e-7, so a very small amount.
abs(dot_similarity - cos_similarity).max()
tensor(1.1921e-07)
We now have a measure of how similar each chunk is to the query. We can then sort the similarities and get their indices, so indices[0]
will be the index of the most similar chunk embedding (and thus the most similar chunk string).
indices = torch.argsort(cos_similarity, descending=True)
However, we don't want to get just the single most similar chunk, as it might not contain enough (or any) information to help us answer our query. Just because the chunk is similar to the query, doesn't mean it will be useful for answering the query.
Generally, we take the highest $k$ similarity chunks.
Just like the size of the chunks, $k$ is something that should be tuned. If $k$ is too low, you risk not getting information relevant for answering the query. If $k$ is too high, you risk getting lots of irrelevant information which could cause you to answer the query incorrectly.
There's also a balancing act between the size of the chunks, $k$, and the LLM's context size. The larger your chunks, the lower $k$ you can use for all the tokens to fit into the LLM's context. There is no single correct answer for setting $k$ and the size of your chunks, it all depends on your task and data.
Below, we'll get the 10 most relevant chunks.
k = 10
retrieved_chunks = [chunks[i.item()] for i in indices[:k]]
retrieved_chunks
["If you're not sure what to work on, guess", "What should you do if you're young and ambitious but don't know what to work on", 'The way to figure out what to work on is by working', 'The first step is to decide what to work on', "If in the course of working on one thing you discover another that's more exciting, don't be afraid to switch", 'Most people who do great work have a mix, and the more you have of the former, the harder it will be to decide what to do', 'The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work', "When you're young you don't know what you're good at or what different kinds of work are like", 'What should your projects be', "Let's talk a little more about the complicated business of figuring out what to work on"]
Now we have the chunks, we can format them into a prompt for the LLM.
Generating Text¶
Generally, the prompt should provide the retrieved chunks and the query you want the LLM to answer. If you only want the LLM to answer the query using the retrieved chunks (and not it's "internal knowledge") then you should explicitly tell it to do so, as we do below. I also found it beneficial to tell the model not to mention the query was answered using extracted text or else the answers usually started with some variation of "According to the extracted text...".
You'll most probably have to do plenty of prompt engineering here, so it's a good idea to collect a dataset of queries and answers to judge how good your prompt is. Going further, a dataset of queries and relevant chunks helps you evaluate your retrieval, allowing you to compare different text embedding models.
Anway, let's build our prompt:
def get_prompt(query, retrieved_chunks):
prompt = "Here's some text extracted from a document:\n"
for chunk in retrieved_chunks:
prompt += f"- {chunk}\n"
prompt += "\n"
prompt += "Answer the following question by using the above extracted text only):\n"
prompt += query
prompt += "\n"
prompt += "Do not mention the query was answered using extracted text."
return prompt
prompt = get_prompt(query, retrieved_chunks)
print(prompt)
Here's some text extracted from a document: - If you're not sure what to work on, guess - What should you do if you're young and ambitious but don't know what to work on - The way to figure out what to work on is by working - The first step is to decide what to work on - If in the course of working on one thing you discover another that's more exciting, don't be afraid to switch - Most people who do great work have a mix, and the more you have of the former, the harder it will be to decide what to do - The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work - When you're young you don't know what you're good at or what different kinds of work are like - What should your projects be - Let's talk a little more about the complicated business of figuring out what to work on Answer the following question by using the above extracted text only): What should I work on? Do not mention the query was answered using extracted text.
Finally, we pass the prompt to the LLM, here we use GPT-4, and receive our response:
import openai
openai.api_key = "YOUR_API_KEY_HERE"
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
response["choices"][0]["message"]["content"]
"You should work on something that you have a natural aptitude for, a deep interest in, and that offers scope to do great work. If you're unsure, start working on something and if in the course of working on one thing you discover another that's more exciting, don't be afraid to switch. The first step is to decide what to work on, and the way to figure that out is by actually working."
Now, let's bundle the whole process of query/response into a single function:
def answer_query(query, tokenizer, model, chunk_embeddings, k):
query_embedding = get_embeddings(query, tokenizer, model)
similarity = torch.cosine_similarity(query_embedding, chunk_embeddings)
indices = torch.argsort(similarity, descending=True)
retrieved_chunks = [chunks[i.item()] for i in indices[:k]]
prompt = get_prompt(query, retrieved_chunks)
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
return response["choices"][0]["message"]["content"]
answer_query("Should I take risks?", tokenizer, embedding_model, chunk_embeddings, k)
"Yes, you should take risks. It's important to take as much risk as you can afford, especially when you're young. Risk often comes with the fear of rejection and failure, but it's a necessary part of discovering new things. Sharing your ideas, despite the risk, can lead to new discoveries. Remember, in an efficient market, risk is proportionate to reward. So, instead of looking for certainty, look for a bet with high expected value."
RAG is also a step towards giving LLMs the ability for providing attribution/citations as we can return the retrieved chunks to show a user what information was used to answer a query, and where that information came from.
Note how you only need to create the chunk embeddings once per document. You may have heard of vector databases and companies such as Pinecone, Weaviate, Chroma (to name a few). What these companies do is allow you to upload your chunks and specify a model used to create text embeddings from your chunks. Then, you can send a query via their API and they will do similar to what we've done above in the retrieval step: find relevant chunks using your query. They do a little more than that, but that's basically how they work!