Whether using Open AI’s API or Huggingface’s LLMs, you’ve been confronted with several strange and mystical options for controlling how the LLM will respond. Today, I will cover three configuration options to help you get the desired results: Temperature, Top K, and Top P.
Background
Before we tackle any of these configuration options, one of the concepts you need to understand is how LLMs pick the next token.
There are several high-level strategies on how they can be configured to search for that ideal response. I will avoid getting too deep into the different search options. The current list of options includes greedy decoding (also known as “greedy search”), sampling, contrastive search, and various beam searches.
For this article, we will focus on greedy decoding and sampling.
Sampling comes in two forms: Top K and Nucleus (or Top P) sampling. We’ll explore both in this article and how to use them to find good results.
Contrastive Search
I will briefly mention that Contrastive search because it utilizes Top K. Contrastive Search is shown to outperform other strategies but is slow and not suitable for many use cases where performance is needed. I’m not aware of a means of utilizing contrastive search with Open AI’s API.
Beam Search
Another topic we’ll only briefly touch on is Beam Search, which attempts to find the best result by examining possible options for the best complete response.
While Contrastive Search and Beam Search are relevant and powerful means of getting good results from an LLM, they are slower and used when response time is not critical.
What is a Token?
I’m going to use the word “token” a lot. For this LLM discussion, you can think of a token as a word if that helps you conceptualize it. A token can be as little as a single character or symbol and could include a whole word or part of a word. Those details, though, aren’t super important to this discussion.
At the heart of an LLM is the ability to produce a list of probabilities of every token of the vocabulary being the next token. These probabilities are then used by the search mechanism you chose.
Temperature changes how those probabilities are generated, while Top P and Top K change which probabilities we can consider. That consideration step may use “sampling” (because we are picking a sample from a subset of all options based on the chances of that value being next within that subset) or do a beam search.
To my knowledge, there is no way to configure Open AI’s models to do a Beam Search, so we primarily focus on sampling in this article. Sampling is simply randomly picking between our options using the probabilities we’ve calculated.
What is LLM Temperature?
Often described as how “creative” an LLM will be, a poorly set Temperature can make your responses repetitive or incoherent.
A high value for Temperature squeezes all of our options closer to each other, so they have a closer probability, or a small value for Temperature stretches them apart, making the probabilities of each option further apart.
For Open AI’s API, Temperature can be a value between zero and two, and for Huggingface models, I’m not aware of a limit beyond that you can’t set it to zero.
Math-wise, what an LLM does before computing the probability of the next token (a function called “softmax”) is it divides all of the numbers by your Temperature value. While the model was training, it effectively had a Temperature of one.
Using a Temperature at this point wouldn’t make any sense because training requires using the probabilities exactly as they are calculated. However, once we get to the point of using the model, we may want to change Temperature to influence our results.
Because you are dividing numbers by the Temperature value you set, you can see why Huggingface doesn’t allow zero–anything divided by zero is not a number. Open AI treats zero as shorthand for keeping only the highest probability option.
Bonus: One could emulate the Open AI behavior of Temperature 0 by disabling sampling, which forces the model to leverage the most likely next token. The problem with disabling sampling is that you may not find the best sentence even if you use the best words. Looking down multiple paths to find the best complete path is called a Beam Search. However, even without doing a Beam Search, just doing sampling, we might randomly find our way to better options.
Notice in the following example that if we chose “fast” as the next word, the sentence probability wouldn’t be as good as if we chose “good.” With sampling, we can’t guarantee which path we’ll end up down since we are picking randomly.
Temperature can ultimately be used to add variety to the output of an LLM. In our Talk to Your Data demo, we raise the temperature parameter to generate new results for SQL code generation if our first query generation attempt results in an error.
What is the Top-K in an LLM?
Top K lets us limit how many options we consider while sampling. After Temperature is applied and our probabilities are calculated by a formula called softmax, we might have many thousands of tokens (or words) to pick from.
By specifying a Top K of 50, we’re saying, “Only look at the best 50 tokens.” This may be used to eliminate low-quality options, so we don’t consider them. However, in general, Top K is not a terribly useful parameter. Changing the temperature or changing Top P (see below) is a much more reliable and explainable option.
To the best of my knowledge, the Open AI API does not let you configure this value, though I highly suspect they have a default configured for you. Huggingface’s libraries have a default value of 50 when using generation configuration.
I think there are two reasonable uses for Top K:
1: you only want the very best token and want your results to be repeatable.
50: it’s the default, and there’s nothing special about it beyond that it is plenty big but prevents wildly unlikely words from popping up.
If you are tempted to use any other value, you probably should consider Top P instead.
What is the Top-P Parameter in an LLM?
Top P says, “Only consider the possibilities that equal or exceed this value.”
This parameter is expressed as a number between 0.0 and 1.0, with 1.0 being 100% and 0 being 0%.
Think back to the above example where the word “fast” had a chance of 40% and “good” had a probability of 20%. If we had a Top P of 0.40 (or 40%), then we’d only consider the word “fast.” If we had a Top P of 0.30 (or 30%), we’d still only consider the word “fast”, since it is the most likely word and it equals or exceeds 30%. However, if we had a Top P of 0.41 (or 41%), then we’d need to consider both “fast” and “good”.
Top P gives you more control than Top K because it lets you choose a more intuitive cutoff. It’s a way of focusing on the most probable options.
Generally speaking, one does not modify both Temperature and Top P at the same time. This is mostly because you destroy any hope of intuition if things don’t work. They both heavily influence the outcome, and they could easily cancel each other or amplify each other’s impact to the point where neither is meaningful.
Bonus: A Temperature of 0, a Top K of 1, or a Top P of 0 is the same as replacing softmax with the argmax formula. Effectively saying we will not consider more than the most likely next token. There may be a performance difference between these options depending on how many beams you use for searching through samples and the efficiency of the transformer model’s implementation, but the results would be the same.
Example Code
Let’s assume that we are picking between only three words (in a real LLM, we’d likely be picking between around thirty-two thousand tokens, but that’s too much to write out.)
Let’s assume our list of words are: “Dogs”, “like”, and “treats”.
Let’s assume that the numbers for each word after all of the model math is done (but before softmax is applied) are:
What do those numbers mean? Well, first of all, I made them up, but if they were from our model, they would result from dozens of math operations and don’t have meaning yet, but they will when we are done.
The following code is for educational purposes. Real softmax needs to deal with overflow and underflow. What are the values 5, 10, and 20? I made them up. These are just for illustration purposes and demonstrate the calculation of softmax with temperature on numbers that come from the previous steps of the LLM. They aren’t realistic at all.
import numpy as np
def softmax(logits):
numerator = np.exp(logits)
denominator = np.sum(numerator, axis=1, keepdims=True)
return numerator / denominator
def softmax_with_temperature(logits, temperature=1.0):
logits = logits / temperature
return softmax(logits)
result = softmax_with_temperature(np.array([[5, 10, 20]]), 1.0)
print(result)
Conclusion
Temperature, Top-K, and Top-P are important parameters for tuning the output of your LLMs for various circumstances. LLMs can be used for various use cases, such as our Email Personalization demo, where creativity is a core requirement.
If you’ve got more questions about using LLMs, don’t hesitate to contact us or request a Generative AI Workshop for your team.