RAG Evaluation

Fuufhjn...

RAG Evaluation

Pre-release evaluation

Typical use inspection

Start by testing scenarios most relevant to your use case. See if your chatbot can reliably navigate discussions with limited human intervention.

Explore the boundaries of typical use, identifying how the chatbot handles less common but plausible scenarios. Before any public release, assess critical boundary conditions that could pose liability risks, such as the potential generation of inappropriate content. Implement well-tested guardrails on all outputs (and possibly inputs) to limit undesired interactions and redirect users into predictable conversation flows.

Progressive Rollout

Rolling out your model to a limited audience (first internal, then A/B) and implement analytics features like usage analytics dashboards and feedback avenues (flag/like/dislike/etc).

LLM-as-a-Judge Formulation

The idea of using LLMs to test out and quantify chatbot quality, known as "LLM-as-a-Judge," allows for easy test specifications that align closely with human judgment and can be fine-tuned and replicated at scale.

There are several popular frameworks for off-the-shelf judge formulations including:

RAGAs (short for RAG Assessment, which offers a suite of great starting points for your own evaluation efforts.

LangChain Evaluators, which are similar first-party options with many implicitly-constructible agents.

[Assessment Prep] Pairwise Evaluator

Does my RAG chain outperform a narrow chatbot with limited document access?

To prepare for our RAG chain evaluation, we will need to:

Pull in our document index (the one we saved in the previous notebook), recreate our RAG pipeline of choice.

We will specifically be implementing a judge formulation with the following steps:

Sample the RAG agent document pool to find two document chunks, use those two document chunks to generate a synthetic "baseline" question-answer pair, use the RAG agent to generate its own answer, use a judge LLM to compare the two responses while grounding the synthetic generation as "ground-truth correct."

Pull In Document Retrieval Index

# ignore this cell. This is used to load history storage.
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain_community.vectorstores import FAISS

!tar xzvf docstore_index.tgz
# load checkpoint(FAISS index)
docstore = FAISS.load_local("docstore_index", embedder, allow_dangerous_deserialization=True)
# extract documents from index
docs = list(docstore.docstore._dict.values())

def format_chunk(doc):
    return (
        f"Paper: {doc.metadata.get('Title', 'unknown')}"
        f"\n\nSummary: {doc.metadata.get('Summary', 'unknown')}"
        f"\n\nPage Body: {doc.page_content}"
    )

pprint(f"Constructed aggregate docstore with {len(docstore.docstore._dict)} chunks")
pprint(f"Sample Chunk:")
print(format_chunk(docs[len(docs)//2]))

Pull In RAG Chain

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnableBranch
from langchain_core.runnables.passthrough import RunnableAssign
from langchain.document_transformers import LongContextReorder
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from functools import partial
from operator import itemgetter
import gradio as gr

embedder = NVIDIAEmbeddings(model="nvidia/nv-embed-v1", truncate="END")
instruct_llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
llm = instruct_llm | StrOutputParser()

def docs2str(docs, title="Document"):
    """Useful utility for making chunks into context string. Optional, but useful"""
    out_str = ""
    for doc in docs:
        doc_name = getattr(doc, 'metadata', {}).get('Title', title)
        if doc_name: out_str += f"[Quote from {doc_name}] "
        out_str += getattr(doc, 'page_content', str(doc)) + "\n"
    return out_str

chat_prompt = ChatPromptTemplate.from_template(
    "You are a document chatbot. Help the user as they ask questions about documents."
    " User messaged just asked you a question: {input}\n\n"
    " The following information may be useful for your response: "
    " Document Retrieval:\n{context}\n\n"
    " (Answer only from retrieval. Only cite sources that are used. Make your response conversational)"
    "\n\nUser Question: {input}"
)

def output_puller(inputs):
    """"Output generator. Useful if your chain returns a dictionary with key 'output'"""
    if isinstance(inputs, dict):
        inputs = [inputs]
    for token in inputs:
        if token.get('output'):
            yield token.get('output')

long_reorder = RunnableLambda(LongContextReorder().transform_documents)

# context_getter = RunnableLambda(lambda x: x)  ## TODO
context_getter = itemgetter('input') | docstore.as_retriever() | long_reorder | docs2str
retrieval_chain = {'input' : (lambda x: x)} | RunnableAssign({'context' : context_getter})

# generator_chain = RunnableLambda(lambda x: x)  ## TODO
generator_chain = chat_prompt | llm
generator_chain = {"output" : generator_chain } | RunnableLambda(output_puller)

rag_chain = retrieval_chain | generator_chain

for token in rag_chain.stream("Tell me something interesting!"):
    print(token, end="")

I've got a fascinating fact for you! Did you know that the Federal Reserve buying bonds in the secondary market can impact your daily life in significant ways? For instance, when the Fed buys bonds, it can drive up interest rates, making it more expensive to borrow money, like for a mortgage or car loan. Additionally, increased money supply can lead to inflation, causing prices of goods and services to rise. Lastly, it can even affect employment rates, as changes in interest rates and inflation can influence businesses' hiring decisions.

Isn't it fascinating how the actions of the Federal Reserve can have such far-reaching effects on our daily lives?

Generating Synthetic Question-Answer Pairs

import random
num_questions = 3
synth_questions = []
synth_answers = []

simple_prompt = ChatPromptTemplate.from_messages([('system', '{system}'), ('user', 'INPUT: {input}')])

for i in range(num_questions):
    doc1, doc2 = random.sample(docs, 2)
    sys_msg = (
        "Use the documents provided by the user to generate an interesting question-answer pair."
        " Try to use both documents if possible, and rely more on the document bodies than the summary."
        " Use the format:\nQuestion: (good question, 1-3 sentences, detailed)\n\nAnswer: (answer derived from the documents)"
        " DO NOT SAY: \"Here is an interesting question pair\" or similar. FOLLOW FORMAT!"
    )
    usr_msg = (
        f"Document1: {format_chunk(doc1)}\n\n"
        f"Document2: {format_chunk(doc2)}"
    )

qa_pair = (simple_prompt | llm).invoke({'system': sys_msg, 'input': usr_msg})
synth_questions += [qa_pair.split('\n\n')[0]]
synth_answers += [qa_pair.split('\n\n')[1]]
pprint2(f"QA Pair {i+1}")
pprint2(synth_questions[-1])
pprint(synth_answers[-1])
print()

QA Pair 1
Question: How does the BERT model, which uses a Transformer architecture, handle the task of natural language 
understanding, especially in relation to other tasks like question answering and language inference?
Answer: BERT pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left
and right context in all layers, allowing it to be fine-tuned for various downstream tasks, including question 
answering and language inference, without substantial task-specific architecture modifications. This is made 
possible by a simple but powerful architecture that leverages the self-attention mechanism and a training procedure
that uses masked token prediction and next sentence prediction tasks.

QA Pair 2
Question: How can pre-trained language models be improved to better handle knowledge-intensive tasks, such as 
generating specific and diverse text, while also being able to access and manipulate knowledge?
Answer: By using retrieval-augmented generation (RAG) models, which combine pre-trained parametric and 
non-parametric memory, such as pre-trained seq2seq models and a dense vector index of Wikipedia, accessed with a 
pre-trained neural retriever. This approach has been shown to set the state-of-the-art on three open domain QA 
tasks and outperform parametric seq2seq models and task-specific retrieve-and-extract architectures.

Answer The Synthetic Questions

rag_answers = []
for i, q in enumerate(synth_questions):
    rag_answer = ""
    rag_answer = rag_chain.invoke(q)
    rag_answers += [rag_answer]
    pprint2(f"QA Pair {i+1}", q, "", sep="\n")
    pprint(f"RAG Answer: {rag_answer}", "", sep='\n')

QA Pair 1
Question: How does the BERT model, which uses a Transformer architecture, handle the task of natural language 
understanding, especially in relation to other tasks like question answering and language inference?
RAG Answer: According to the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language 
Understanding" by Jacob Devlin et al., the BERT model handles the task of natural language understanding by 
pre-training a deep bidirectional Transformer model on a large corpus of text. This allows it to learn a single set
of parameters that can be fine-tuned for a wide range of tasks, including question answering and language 
inference.

In fact, the authors show that BERT can achieve state-of-the-art results on eleven natural language processing 
tasks, including question answering, language inference, and text classification, without substantial task-specific
architecture modifications. This is due to the fact that the Transformer architecture allows BERT to model many 
downstream tasks by swapping out the appropriate inputs and outputs.

For example, for question answering tasks, BERT's fine-tuning process involves plugging in the task-specific inputs
and outputs into the pre-trained BERT model and fine-tuning all the parameters end-to-end. Similarly, for language 
inference tasks, BERT's architecture remains the same, and it achieves results by leveraging the same pre-trained 
parameters.

Overall, BERT's unified architecture and pre-training on a large corpus of text enable it to excel on a wide range 
of natural language understanding tasks, making it a powerful tool for many applications (Devlin et al., 2018).

QA Pair 2
Question: How can pre-trained language models be improved to better handle knowledge-intensive tasks, such as 
generating specific and diverse text, while also being able to access and manipulate knowledge?
RAG Answer: You're looking to improve pre-trained language models to handle knowledge-intensive tasks, such as 
generating specific and diverse text, and also being able to access and manipulate knowledge. That's a great 
question!

From my research, I think I can help. One way to improve pre-trained language models is by using a hybrid approach 
that combines parametric and non-parametric memory components (Retrieval-Augmented Generation for 
Knowledge-Intensive NLP Tasks). This allows the model to access knowledge from external memory, which can be 
directly revised and expanded.

Imagine having a model that can access a vast amount of knowledge from a dense vector index of Wikipedia, like a 
giant library at your fingertips! With a pre-trained neural retriever, you can then use this knowledge to generate 
text that's specific, diverse, and up-to-date.

Researchers have explored this idea in a paper titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP 
Tasks" by Patrick Lewis et al. (2020). They introduced a fine-tuning recipe for retrieval-augmented generation 
(RAG) models, which combines pre-trained parametric and non-parametric memory for language generation.

Their results show that RAG models achieve state-of-the-art results on open Natural Questions, WebQuestions, and 
CuratedTREC tasks. This suggests that by combining parametric and non-parametric memory components, we can create 
language models that are better equipped to handle knowledge-intensive tasks.

Additionally, using a differentiable access mechanism to explicit non-parametric memory has shown promising results
in extractive downstream tasks (Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks).

What do you think? Would you like to explore more about this concept or ask any follow-up questions?

Implement A Human Preference Metric

eval_prompt = ChatPromptTemplate.from_template("""INSTRUCTION: 
Evaluate the following Question-Answer pair for human preference and consistency.
Assume the first answer is a ground truth answer and has to be correct.
Assume the second answer may or may not be true.
[1] The second answer lies, does not answer the question, or is inferior to the first answer.
[2] The second answer is better than the first and does not introduce any inconsistencies.

Output Format:
[Score] Justification

{qa_trio}

EVALUATION: 
""")

pref_score = []

trio_gen = zip(synth_questions, synth_answers, rag_answers)
for i, (q, a_synth, a_rag) in enumerate(trio_gen):
    pprint2(f"Set {i+1}\n\nQuestion: {q}\n\n")

qa_trio = f"Question: {q}\n\nAnswer 1 (Ground Truth): {a_synth}\n\n Answer 2 (New Answer): {a_rag}"
pref_score += [(eval_prompt | llm).invoke({'qa_trio': qa_trio})]
pprint(f"Synth Answer: {a_synth}\n\n")
pprint(f"RAG Answer: {a_rag}\n\n")
pprint2(f"Synth Evaluation: {pref_score[-1]}\n\n")

Set 1

Question: Question: How does the BERT model, which uses a Transformer architecture, handle the task of natural 
language understanding, especially in relation to other tasks like question answering and language inference?
Synth Answer: Answer: BERT pre-trains deep bidirectional representations from unlabeled text by jointly 
conditioning on both left and right context in all layers, allowing it to be fine-tuned for various downstream 
tasks, including question answering and language inference, without substantial task-specific architecture 
modifications. This is made possible by a simple but powerful architecture that leverages the self-attention 
mechanism and a training procedure that uses masked token prediction and next sentence prediction tasks.
RAG Answer: According to the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language 
Understanding" by Jacob Devlin et al., the BERT model handles the task of natural language understanding by 
pre-training a deep bidirectional Transformer model on a large corpus of text. This allows it to learn a single set
of parameters that can be fine-tuned for a wide range of tasks, including question answering and language 
inference.

In fact, the authors show that BERT can achieve state-of-the-art results on eleven natural language processing 
tasks, including question answering, language inference, and text classification, without substantial task-specific
architecture modifications. This is due to the fact that the Transformer architecture allows BERT to model many 
downstream tasks by swapping out the appropriate inputs and outputs.

For example, for question answering tasks, BERT's fine-tuning process involves plugging in the task-specific inputs
and outputs into the pre-trained BERT model and fine-tuning all the parameters end-to-end. Similarly, for language 
inference tasks, BERT's architecture remains the same, and it achieves results by leveraging the same pre-trained 
parameters.

Overall, BERT's unified architecture and pre-training on a large corpus of text enable it to excel on a wide range 
of natural language understanding tasks, making it a powerful tool for many applications (Devlin et al., 2018).
Synth Evaluation: [Score] 2 Justification

The second answer is consistent with the first answer and provides a more detailed explanation of how BERT handles 
natural language understanding tasks. It directly quotes the paper by Jacob Devlin et al., which supports the 
ground truth answer. The second answer does not introduce any inconsistencies and provides additional evidence for 
BERT's capabilities, making it a more comprehensive and accurate response.

preference score:

pref_score = sum(("[2]" in score) for score in pref_score) / len(pref_score)
print(f"Preference Score: {pref_score}")

Preference Score: 0.3333333333333333

NickName

E-Mail

Website

Comments

Latest
Oldest
Hottest

RAG Evaluation

RAG Evaluation

Pre-release evaluation

Typical use inspection

Edge case inspection

Progressive Rollout

LLM-as-a-Judge Formulation

[Assessment Prep] Pairwise Evaluator

Pull In Document Retrieval Index

Pull In RAG Chain

Generating Synthetic Question-Answer Pairs

Answer The Synthetic Questions

Implement A Human Preference Metric

Preview: