Making LLMs smarter with structured generation and Outlines¶
Structured generation is a method of improving performance of your LLM without doing additional training or increasing inference time. It constrains the output of your LLM to a specified schema, allowing it to be more predictable and machine readable. It can also improve performance (though this is contested); though today we'll show how using the Outlines library for structured does generate some improvements on a small anecdotal example.
import transformers
import outlines
import torch
import string
import pydantic
import enum
import json
import warnings
warnings.filterwarnings("ignore")
We'll use the Qwen2.5-0.5B-Instruct model to perform our experiments. The model is relatively capable for its size and is good for using to test different libraries.
First, we'll load it using the HuggingFace transformers library and use it to initialise a text generation pipeline.
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
device = torch.device("cuda")
pipe = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device=device,
)
Device set to use cuda
Next, we'll use the pipeline to generate a response for our prompt we'll be using to test the different structured generation methods: getting the answer to 40 + 2.
prompt = "What is 40 + 2 ? Give me the answer only. "
messages = [
{
"role": "user",
"content": prompt,
},
]
results = pipe(
messages,
do_sample=False,
max_new_tokens=25,
)
output = results[0]["generated_text"][-1]["content"]
print(output)
60
Last I checked, 40 + 2 = 42. I guess deep learning is hitting a wall after all. Now let's see if structured generation can save us.
Outlines has a wrapper around models provided by the transformers library.
model = outlines.models.transformers(
model_name,
device=device,
)
Generating output using Outlines requires two steps:
- creating a generator which specifies how the text will be generated
- using that generator with the prompt and some parameters to get an output
The simplest generator is generate.text
, which simply generates unstructured text. We use samplers.greedy
to effectively do greedy sampling with a temperature of zero to get deterministic outputs.
generator = outlines.generate.text(
model,
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
max_tokens=25,
)
print(output)
22 The answer is 22. To arrive at this answer, simply add the two numbers together: 4
You'll see that the output is still wrong. But why didn't this generate the same output as using the transformers' text generation pipeline? This is because the text generation pipeline used the messages format, whereas the generate.text
generates autoregressively using the prompt directly.
Our first taste of using structured generation is with generate.choice
. Here, we pass a list of strings to the generator and it can only generate one of those. Here, we allow it to generate any number between 0-99.
generator = outlines.generate.choice(
model,
choices=[f"{i}." for i in range(100)],
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
)
print(output)
22.
Unfortunately, it still gets the answer wrong. What if it needs a little more help? We'll do the same thing again but put the equation into the choices.
generator = outlines.generate.choice(
model,
[f"40 + 2 = {i}." for i in range(100)],
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
)
print(output)
40 + 2 = 42.
And there we go, that's one way to make a model with 500 million parameters do simple addition.
Let's try a few other generate
methods. The first is generate.format
which forces the output to be a valid Python type, here an integer.
generator = outlines.generate.format(
model,
int,
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
max_tokens=25,
)
print(output)
2200000000000000000000000
It's definitely an integer, but not the one we wanted. Personally, I've never found generate.format
to be that useful, or been able to get greate results out of it. (You may have noticed I had to add max_tokens
back, this was because if I didn't, it would generate integers until it hits the full output context length.)
Next up, is probably the second most useful method, generate.regex
, which allows us to pass a regex pattern that all outputs will conform to. (generate.choice
uses generate.regex
under the hood.).
Here, we can ensure it only generates outputs with one or more integers, followed by a period.
generator = outlines.generate.regex(
model,
r"\d+\.",
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
)
print(output)
22.
Unsurprisingly, this generated the same thing as our first test with generate.choices
. What if we repeated our second experiment with generate.choices
?
generator = outlines.generate.regex(
model,
r"40 \+ 2 = \d+\.",
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
max_tokens=25,
)
print(output)
40 + 2 = 42.
The definition of insanity is doing the same thing over and over again and expecting a different result.
One trick we can do is state an answer is obvious, because online (the training data) nobody would ever say something was obvious unless they were giving the actual correct answer. Thus, the LLM is more likely to generate the actual answer after outputting that the answer is obvious.
generator = outlines.generate.regex(
model,
r"Obviously, the answer is: \d+\.",
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
)
print(output)
Obviously, the answer is: 42.
And just to confirm it's not a fluke, we double our sample size to two:
output = generator(
"What is 402 + 420?",
max_tokens=25,
)
print(output)
Obviously, the answer is: 822.
We now have improved the performance of our LLM and got it to generate text in a way we can easily parse.
But what if we could output more complex structured outputs that are even easier to parse? What if we could define the structure in Python, instead of regex? What if we could define a hierarchy of structure? This is where the real strength of Outlines lies.
Below, we use the generate.json
method which takes in a JSON schema of what our output should conform to. JSON schema syntax is a bit convoluted, so lucky for us we can pass a pydantic.BaseModel
object, which we show how to do below:
class Operator(str, enum.Enum):
ADDITION = "ADDITION"
SUBTRACTION = "SUBTRACTION"
MULTIPLICATION = "MULTIPLICATION"
DIVISION = "DIVISION"
class Schema(pydantic.BaseModel):
left_operand: int
operator: Operator
right_operand: int
answer: int
generator = outlines.generate.json(
model,
Schema,
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
)
repr(output)
"Schema(left_operand=40, operator=<Operator.ADDITION: 'ADDITION'>, right_operand=2, answer=42)"
Now, not only do we get the correct answer, we get it as a Schema
object...
type(output)
__main__.Schema
...we can get the answer directly from the object...
output.operator == Operator.ADDITION == "ADDITION", output.answer
(True, 42)
...and also convert it to a dictionary...
output.dict()
{'left_operand': 40, 'operator': <Operator.ADDITION: 'ADDITION'>, 'right_operand': 2, 'answer': 42}
...or into a JSON string.
output.json()
'{"left_operand":40,"operator":"ADDITION","right_operand":2,"answer":42}'
Another quick example for sentiment classification:
class Sentiment(str, enum.Enum):
POSITIVE = "POSITIVE"
NEGATIVE = "NEGATIVE"
class Classification(pydantic.BaseModel):
sentiment: Sentiment
sentiment_score: int
prompt = "The movie was great. I loved it!"
generator = outlines.generate.json(
model,
Classification,
sampler=outlines.samplers.greedy(),
)
output = generator(
prompt,
)
repr(output)
"Classification(sentiment=<Sentiment.POSITIVE: 'POSITIVE'>, sentiment_score=4)"
Using structured generation is a win-win. Your outputs are trivial to parse instead of handling whatever prefixes your LLM can conjure up; and, you get improved performance as a bonus.
Outlines makes it easy to apply structured generation to almost any LLM supported by the transformers library.