RAG Evaluation
RAG Evaluation
Pre-release evaluation
Typical use inspection
Start by testing scenarios most relevant to your use case. See if your chatbot can reliably navigate discussions with limited human intervention.
Edge case inspection
Explore the boundaries of typical use, identifying how the chatbot handles less common but plausible scenarios. Before any public release, assess critical boundary conditions that could pose liability risks, such as the potential generation of inappropriate content. Implement well-tested guardrails on all outputs (and possibly inputs) to limit undesired interactions and redirect users into predictable conversation flows.
Progressive Rollout
Rolling out your model to a limited audience (first internal, then A/B) and implement analytics features like usage analytics dashboards and feedback avenues (flag/like/dislike/etc).
LLM-as-a-Judge Formulation
The idea of using LLMs to test out and quantify chatbot quality, known as "LLM-as-a-Judge," allows for easy test specifications that align closely with human judgment and can be fine-tuned and replicated at scale.
There are several popular frameworks for off-the-shelf judge formulations including:
RAGAs (short for RAG Assessment, which offers a suite of great starting points for your own evaluation efforts.
LangChain Evaluators, which are similar first-party options with many implicitly-constructible agents.
[Assessment Prep] Pairwise Evaluator
Does my RAG chain outperform a narrow chatbot with limited document access?
To prepare for our RAG chain evaluation, we will need to:
Pull in our document index (the one we saved in the previous notebook), recreate our RAG pipeline of choice.
We will specifically be implementing a judge formulation with the following steps:
Sample the RAG agent document pool to find two document chunks, use those two document chunks to generate a synthetic "baseline" question-answer pair, use the RAG agent to generate its own answer, use a judge LLM to compare the two responses while grounding the synthetic generation as "ground-truth correct."
Pull In Document Retrieval Index
# ignore this cell. This is used to load history storage.
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain_community.vectorstores import FAISS
!tar xzvf docstore_index.tgz
# load checkpoint(FAISS index)
docstore = FAISS.load_local("docstore_index", embedder, allow_dangerous_deserialization=True)
# extract documents from index
docs = list(docstore.docstore._dict.values())
def format_chunk(doc):
return (
f"Paper: {doc.metadata.get('Title', 'unknown')}"
f"\n\nSummary: {doc.metadata.get('Summary', 'unknown')}"
f"\n\nPage Body: {doc.page_content}"
)
pprint(f"Constructed aggregate docstore with {len(docstore.docstore._dict)} chunks")
pprint(f"Sample Chunk:")
print(format_chunk(docs[len(docs)//2]))
Pull In RAG Chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnableBranch
from langchain_core.runnables.passthrough import RunnableAssign
from langchain.document_transformers import LongContextReorder
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from functools import partial
from operator import itemgetter
import gradio as gr
embedder = NVIDIAEmbeddings(model="nvidia/nv-embed-v1", truncate="END")
instruct_llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
llm = instruct_llm | StrOutputParser()
def docs2str(docs, title="Document"):
"""Useful utility for making chunks into context string. Optional, but useful"""
out_str = ""
for doc in docs:
doc_name = getattr(doc, 'metadata', {}).get('Title', title)
if doc_name: out_str += f"[Quote from {doc_name}] "
out_str += getattr(doc, 'page_content', str(doc)) + "\n"
return out_str
chat_prompt = ChatPromptTemplate.from_template(
"You are a document chatbot. Help the user as they ask questions about documents."
" User messaged just asked you a question: {input}\n\n"
" The following information may be useful for your response: "
" Document Retrieval:\n{context}\n\n"
" (Answer only from retrieval. Only cite sources that are used. Make your response conversational)"
"\n\nUser Question: {input}"
)
def output_puller(inputs):
""""Output generator. Useful if your chain returns a dictionary with key 'output'"""
if isinstance(inputs, dict):
inputs = [inputs]
for token in inputs:
if token.get('output'):
yield token.get('output')
long_reorder = RunnableLambda(LongContextReorder().transform_documents)
# context_getter = RunnableLambda(lambda x: x) ## TODO
context_getter = itemgetter('input') | docstore.as_retriever() | long_reorder | docs2str
retrieval_chain = {'input' : (lambda x: x)} | RunnableAssign({'context' : context_getter})
# generator_chain = RunnableLambda(lambda x: x) ## TODO
generator_chain = chat_prompt | llm
generator_chain = {"output" : generator_chain } | RunnableLambda(output_puller)
rag_chain = retrieval_chain | generator_chain
for token in rag_chain.stream("Tell me something interesting!"):
print(token, end="")
I've got a fascinating fact for you! Did you know that the Federal Reserve buying bonds in the secondary market can impact your daily life in significant ways? For instance, when the Fed buys bonds, it can drive up interest rates, making it more expensive to borrow money, like for a mortgage or car loan. Additionally, increased money supply can lead to inflation, causing prices of goods and services to rise. Lastly, it can even affect employment rates, as changes in interest rates and inflation can influence businesses' hiring decisions.
Isn't it fascinating how the actions of the Federal Reserve can have such far-reaching effects on our daily lives?
Generating Synthetic Question-Answer Pairs
import random
num_questions = 3
synth_questions = []
synth_answers = []
simple_prompt = ChatPromptTemplate.from_messages([('system', '{system}'), ('user', 'INPUT: {input}')])
for i in range(num_questions):
doc1, doc2 = random.sample(docs, 2)
sys_msg = (
"Use the documents provided by the user to generate an interesting question-answer pair."
" Try to use both documents if possible, and rely more on the document bodies than the summary."
" Use the format:\nQuestion: (good question, 1-3 sentences, detailed)\n\nAnswer: (answer derived from the documents)"
" DO NOT SAY: \"Here is an interesting question pair\" or similar. FOLLOW FORMAT!"
)
usr_msg = (
f"Document1: {format_chunk(doc1)}\n\n"
f"Document2: {format_chunk(doc2)}"
)
qa_pair = (simple_prompt | llm).invoke({'system': sys_msg, 'input': usr_msg})
synth_questions += [qa_pair.split('\n\n')[0]]
synth_answers += [qa_pair.split('\n\n')[1]]
pprint2(f"QA Pair {i+1}")
pprint2(synth_questions[-1])
pprint(synth_answers[-1])
print()
QA Pair 1
Question: How does the BERT model, which uses a Transformer architecture, handle the task of natural language
understanding, especially in relation to other tasks like question answering and language inference?
Answer: BERT pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left
and right context in all layers, allowing it to be fine-tuned for various downstream tasks, including question
answering and language inference, without substantial task-specific architecture modifications. This is made
possible by a simple but powerful architecture that leverages the self-attention mechanism and a training procedure
that uses masked token prediction and next sentence prediction tasks.
QA Pair 2
Question: How can pre-trained language models be improved to better handle knowledge-intensive tasks, such as
generating specific and diverse text, while also being able to access and manipulate knowledge?
Answer: By using retrieval-augmented generation (RAG) models, which combine pre-trained parametric and
non-parametric memory, such as pre-trained seq2seq models and a dense vector index of Wikipedia, accessed with a
pre-trained neural retriever. This approach has been shown to set the state-of-the-art on three open domain QA
tasks and outperform parametric seq2seq models and task-specific retrieve-and-extract architectures.
Answer The Synthetic Questions
rag_answers = []
for i, q in enumerate(synth_questions):
rag_answer = ""
rag_answer = rag_chain.invoke(q)
rag_answers += [rag_answer]
pprint2(f"QA Pair {i+1}", q, "", sep="\n")
pprint(f"RAG Answer: {rag_answer}", "", sep='\n')
QA Pair 1
Question: How does the BERT model, which uses a Transformer architecture, handle the task of natural language
understanding, especially in relation to other tasks like question answering and language inference?
RAG Answer: According to the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding" by Jacob Devlin et al., the BERT model handles the task of natural language understanding by
pre-training a deep bidirectional Transformer model on a large corpus of text. This allows it to learn a single set
of parameters that can be fine-tuned for a wide range of tasks, including question answering and language
inference.
In fact, the authors show that BERT can achieve state-of-the-art results on eleven natural language processing
tasks, including question answering, language inference, and text classification, without substantial task-specific
architecture modifications. This is due to the fact that the Transformer architecture allows BERT to model many
downstream tasks by swapping out the appropriate inputs and outputs.
For example, for question answering tasks, BERT's fine-tuning process involves plugging in the task-specific inputs
and outputs into the pre-trained BERT model and fine-tuning all the parameters end-to-end. Similarly, for language
inference tasks, BERT's architecture remains the same, and it achieves results by leveraging the same pre-trained
parameters.
Overall, BERT's unified architecture and pre-training on a large corpus of text enable it to excel on a wide range
of natural language understanding tasks, making it a powerful tool for many applications (Devlin et al., 2018).
QA Pair 2
Question: How can pre-trained language models be improved to better handle knowledge-intensive tasks, such as
generating specific and diverse text, while also being able to access and manipulate knowledge?
RAG Answer: You're looking to improve pre-trained language models to handle knowledge-intensive tasks, such as
generating specific and diverse text, and also being able to access and manipulate knowledge. That's a great
question!
From my research, I think I can help. One way to improve pre-trained language models is by using a hybrid approach
that combines parametric and non-parametric memory components (Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks). This allows the model to access knowledge from external memory, which can be
directly revised and expanded.
Imagine having a model that can access a vast amount of knowledge from a dense vector index of Wikipedia, like a
giant library at your fingertips! With a pre-trained neural retriever, you can then use this knowledge to generate
text that's specific, diverse, and up-to-date.
Researchers have explored this idea in a paper titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP
Tasks" by Patrick Lewis et al. (2020). They introduced a fine-tuning recipe for retrieval-augmented generation
(RAG) models, which combines pre-trained parametric and non-parametric memory for language generation.
Their results show that RAG models achieve state-of-the-art results on open Natural Questions, WebQuestions, and
CuratedTREC tasks. This suggests that by combining parametric and non-parametric memory components, we can create
language models that are better equipped to handle knowledge-intensive tasks.
Additionally, using a differentiable access mechanism to explicit non-parametric memory has shown promising results
in extractive downstream tasks (Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks).
What do you think? Would you like to explore more about this concept or ask any follow-up questions?
Implement A Human Preference Metric
eval_prompt = ChatPromptTemplate.from_template("""INSTRUCTION:
Evaluate the following Question-Answer pair for human preference and consistency.
Assume the first answer is a ground truth answer and has to be correct.
Assume the second answer may or may not be true.
[1] The second answer lies, does not answer the question, or is inferior to the first answer.
[2] The second answer is better than the first and does not introduce any inconsistencies.
Output Format:
[Score] Justification
{qa_trio}
EVALUATION:
""")
pref_score = []
trio_gen = zip(synth_questions, synth_answers, rag_answers)
for i, (q, a_synth, a_rag) in enumerate(trio_gen):
pprint2(f"Set {i+1}\n\nQuestion: {q}\n\n")
qa_trio = f"Question: {q}\n\nAnswer 1 (Ground Truth): {a_synth}\n\n Answer 2 (New Answer): {a_rag}"
pref_score += [(eval_prompt | llm).invoke({'qa_trio': qa_trio})]
pprint(f"Synth Answer: {a_synth}\n\n")
pprint(f"RAG Answer: {a_rag}\n\n")
pprint2(f"Synth Evaluation: {pref_score[-1]}\n\n")
Set 1
Question: Question: How does the BERT model, which uses a Transformer architecture, handle the task of natural
language understanding, especially in relation to other tasks like question answering and language inference?
Synth Answer: Answer: BERT pre-trains deep bidirectional representations from unlabeled text by jointly
conditioning on both left and right context in all layers, allowing it to be fine-tuned for various downstream
tasks, including question answering and language inference, without substantial task-specific architecture
modifications. This is made possible by a simple but powerful architecture that leverages the self-attention
mechanism and a training procedure that uses masked token prediction and next sentence prediction tasks.
RAG Answer: According to the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding" by Jacob Devlin et al., the BERT model handles the task of natural language understanding by
pre-training a deep bidirectional Transformer model on a large corpus of text. This allows it to learn a single set
of parameters that can be fine-tuned for a wide range of tasks, including question answering and language
inference.
In fact, the authors show that BERT can achieve state-of-the-art results on eleven natural language processing
tasks, including question answering, language inference, and text classification, without substantial task-specific
architecture modifications. This is due to the fact that the Transformer architecture allows BERT to model many
downstream tasks by swapping out the appropriate inputs and outputs.
For example, for question answering tasks, BERT's fine-tuning process involves plugging in the task-specific inputs
and outputs into the pre-trained BERT model and fine-tuning all the parameters end-to-end. Similarly, for language
inference tasks, BERT's architecture remains the same, and it achieves results by leveraging the same pre-trained
parameters.
Overall, BERT's unified architecture and pre-training on a large corpus of text enable it to excel on a wide range
of natural language understanding tasks, making it a powerful tool for many applications (Devlin et al., 2018).
Synth Evaluation: [Score] 2 Justification
The second answer is consistent with the first answer and provides a more detailed explanation of how BERT handles
natural language understanding tasks. It directly quotes the paper by Jacob Devlin et al., which supports the
ground truth answer. The second answer does not introduce any inconsistencies and provides additional evidence for
BERT's capabilities, making it a more comprehensive and accurate response.
preference score:
pref_score = sum(("[2]" in score) for score in pref_score) / len(pref_score)
print(f"Preference Score: {pref_score}")
Preference Score: 0.3333333333333333