You can still experience hallucinations with RAG (Ragallucinations, perhaps? Haha) - this happens when the context is correct, but the LLM you're using for generation has certain priors and either defaults to them or generates something random. The empirical proof for ragallucinations can be found in the ClashEval paper by Wu et al.
The ClashEval paper
Researchers curated a dataset called ClashEval consisting of over 1,200 questions across six different domains, including drug dosages, Olympic records, news events, Wikipedia dates, names, and locations. Each question was accompanied by a relevant external document (retrieved content) to help answer the query. These documents were perturbed to include errors ranging from subtle to blatant, allowing the researchers to observe how LLMs would respond when faced with conflicting information.
For each domain, the correct answer was systematically modified to introduce errors in the retrieved documents. For numerical datasets (e.g., drug dosages, sports records), modifications included scaling the correct values by different factors (e.g., multiplying by 0.1, 2.0, etc.). For other types of data, like names and locations, they introduced slight, significant, and even comical modifications (e.g., changing "Bob Green" to "Rob Greene" or absurdly to "Blob Lawnface").
Six top-performing LLMs were benchmarked: GPT-4o, GPT-3.5, Claude Opus, Claude Sonnet, Llama-3, and Gemini 1.5. These models were tested on the ClashEval dataset in two stages:
-Prior Response: The model was asked to generate an answer using only its internal (parametric) knowledge.
-Contextual Response: The model was then provided with the retrieved (and potentially perturbed) document, and the researchers observed whether the model stuck with its prior answer or adopted the information from the retrieved context.
The researchers used three key metrics to evaluate the models:
-Accuracy: The probability that the model gave the correct answer when either the context or the prior was correct.
-Prior Bias: The likelihood of the model incorrectly using its prior knowledge when the external context was correct.
-Context Bias: The likelihood of the model incorrectly adopting the retrieved context when its prior knowledge was correct.
The researchers also analyzed the confidence (token probabilities) of the models' initial responses (without the retrieved context). They found that the more confident a model was in its prior response, the less likely it was to adopt conflicting contextual information.
Results and Interpretation
The study found that large language models (LLMs) frequently override their own correct internal knowledge (prior) when presented with incorrect external information. In more than 60% of cases, LLMs adopted incorrect retrieved content, even when their prior response was correct.
The more unrealistic or blatantly incorrect the retrieved content was, the less likely the LLMs were to adopt it. This suggests that models are somewhat resistant to extremely erroneous information, but they are still vulnerable to subtle errors in retrieved content.
The study revealed that the confidence level of the LLMs in their initial (prior) response played a key role. When a model was less confident in its prior response, it was more likely to accept the retrieved information, even when it was wrong. Conversely, a higher confidence in the prior response reduced the likelihood of adopting incorrect external content.
Among the models tested, Claude Opus outperformed the others, including GPT-4o, in its ability to resist adopting incorrect retrieved content. Claude Opus had a higher accuracy (74.3%) compared to GPT-4o (61.5%) and a lower context bias (incorrectly choosing the context when the prior was right).
Source: Wu et al., 2024
GPT-4o, despite its high performance in general tasks, showed a higher tendency to adopt wrong context over correct priors compared to smaller models like Claude Sonnet. This suggests that performance on general-purpose benchmarks does not always correlate with robustness in RAG settings.
The researchers proposed a method of comparing token probabilities (the model's confidence in its prior and contextual responses) to resolve conflicts between priors and context. By incorporating this correction, they improved model accuracy across the board, with GPT-4o's accuracy increasing from 61.5% to 75.4%.
The results highlight a significant challenge in RAG systems, where LLMs often struggle to correctly arbitrate between their internal knowledge and external evidence, especially when the retrieved content is wrong. This "context bias" shows that while RAG can enhance model performance, it also introduces the risk of models repeating incorrect or harmful information if it's included in the retrieved context.
The art and science of RAG
How to reduce ragallucinations? The ClashEval paper suggests several strategies to mitigate the issue of "ragallucinations," where large language models (LLMs) mistakenly adopt incorrect information from retrieved external content, despite having the correct internal knowledge.
Token Probability Correction
One of the key methods proposed in the paper involves using token probabilities to help the model decide between its internal knowledge and retrieved content. By comparing the confidence scores (token probabilities) of the model's prior response and the contextual response, the model can make a more informed decision. If the prior response is associated with higher confidence, the model should stick to it instead of adopting the external content. This method showed a significant improvement in accuracy in the study.
Here is how to do that in Python. I tested it with several questions, even ones where the facts were clearly fictitious, and token probability correction does seem to reduce hallucinations if the correct context is provided.
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def calculate_average_logprob(completion):
# Extract the list of tokens and their log probabilities
tokens_logprobs = completion.choices[0].logprobs.content
# Extract the log probabilities
logprobs = [token_logprob.logprob for token_logprob in tokens_logprobs]
# Calculate the average log probability
avg_logprob = sum(logprobs) / len(logprobs)
return avg_logprob
def find_best_completion(completions):
best_avg_logprob = float('-inf') # Initialize to negative infinity
best_completion = None
for completion in completions:
avg_logprob = calculate_average_logprob(completion)
if avg_logprob > best_avg_logprob:
best_avg_logprob = avg_logprob
best_completion = completion
return best_completion
def ask_question_with_logprobs(client, question, context=None):
if context:
messages = [
{"role": "user", "content": context},
{"role": "user", "content": question}
]
else:
messages = [{"role": "user", "content": question}]
# Make the OpenAI API call with logprobs enabled
completion = client.chat.completions.create(
model="gpt-4o",
messages=messages,
logprobs=True,
)
# Print completion response and log probabilities
print("Generated Response:", completion.choices[0].message.content)
return completion
# Example 1: Ask a question without context
completion_no_context = ask_question_with_logprobs(client, "Who is the inventor of andzomgadose ?")
# Example 2: Ask a question with correct context
correct_context = "The inventor of andzomgadose is Franck Ndzomga."
completion_with_correct_context = ask_question_with_logprobs(client, "Who is the inventor of andzomgadose ?", correct_context)
# Example 3: Ask a question with misleading context
false_context = "The inventor of andzomgadose is Albert Einstein."
completion_with_false_context = ask_question_with_logprobs(client, "Who is the inventor of andzomgadose ?", false_context)
# Example 4: Compare completions to find the best one
completions = [completion_no_context, completion_with_correct_context, completion_with_false_context]
best_completion = find_best_completion(completions)
# Print the best completion's content
print("Best Completion:", best_completion.choices[0].message.content)
Calibrated Token Probability Correction
A more refined version of the token probability method, this approach adjusts for the uncalibrated nature of probability scores between prior and contextual responses. Instead of comparing raw probability values, the method evaluates the relative confidence (or percentiles) of each score. This further improves the model's ability to correctly reject incorrect external information and boosts overall accuracy, while slightly increasing prior bias (overreliance on internal knowledge).
I have extensively tested this one, and it is very powerful. I want to keep the code secret for now, but here is a link to sneak a peek at my implementation:
Calibrated Token Probability Correction
Other approaches
Let's take a step back. Ragallucinations don't just occur because of a conflict between LLM priors and the provided context. In production, this can also happen due to a faulty context that doesn't address the user's question but misleads the LLM's performance. Improving the retrieval step is, therefore, a powerful way to enhance RAG performance.
One important strategy is improving the quality of retrieval systems. By enhancing the relevance and accuracy of retrieved documents, the chances of misleading or erroneous information being presented to the model can be minimized. I have got some nice articles about that:
-The Problem With Semantic Search
-Combining Elasticsearch And Semantic Search: A Case Study (Part 1)
Another approach is domain-specific fine-tuning. When working with specialized fields, such as medicine or law, where the risk of retrieving incorrect or harmful information is high, fine-tuning LLMs on authoritative, domain-specific datasets helps the model rely more on its internal, reliable knowledge. This can prevent the model from adopting incorrect information from external sources.
In practice, context stuffing is also an effective technique. By carefully crafting the input and ensuring that enough relevant elements are included in the context - even repeating key facts multiple times - you can help guide the model towards the correct answer and reduce the likelihood of ragallucinations.
Finally, ensemble generation offers another robust solution. This approach involves using two or more LLMs to generate a response from the same context. If all the models agree on the answer, it is likely to be correct. If they don't agree, a majority-based decision or rerunning the generation process can help improve accuracy and confidence in the final response.
I will explain how to implement domain-specific fine-tuning, context stuffing, and ensemble generation in my next tutorial on Lycee AI. Stay tuned!
By combining these approaches, you can significantly reduce the risk of ragallucinations and improve the overall reliability of LLMs in production environments.
Sources:
Wu, K., Wu, E., & Zou, J. (2024). ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence.