Documents Processing
...About 4 min
Documents Processing
In this note, I'll explain how to use Langchain and other tools to embedding a document and why should we do this.
1. Loading documents
from langchain.document_loaders import UnstructuredFileLoader
from langchain.document_loaders import ArxivLoader
# This package is used to load articles from Arxiv.
documents = ArxivLoader(query="2404.16130").load()
# 2404.16130 is a document number. This article is about Graph RAG.
# documents is a list.
Basic data output
print("Number of Documents Retrieved:", len(documents))
print(f"Sample of Document 1 Content (Total Length: {len(documents[0].page_content)}):")
print(documents[0].page_content[:1000])
Number of Documents Retrieved: 1
Sample of Document 1 Content (Total Length: 53880):
From Local to Global: A Graph RAG Approach to
Query-Focused Summarization
Darren Edge1†
Ha Trinh1†
Newman Cheng2
Joshua Bradley2
Alex Chao3
Apurva Mody3
Steven Truitt2
Jonathan Larson1
1Microsoft Research
2Microsoft Strategic Missions and Technologies
3Microsoft Office of the CTO
{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso}
@microsoft.com
†These authors contributed equally to this work
Abstract
The use of retrieval-augmented generation (RAG) to retrieve relevant informa-
tion from an external knowledge source enables large language models (LLMs)
to answer questions over private and/or previously unseen document collections.
However, RAG fails on global questions directed at an entire text corpus, such
as “What are the main themes in the dataset?”, since this is inherently a query-
focused summarization (QFS) task, rather than an explicit retrieval task. Prior
QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical
RAG systems.
The metadata of the document is written by the author and contains all the basic information, but it is too concise to be used for querying(And documents other than Arxiv do not necessarily have metadata):
pprint(documents[0].metadata)
{
'Published': '2024-04-24',
'Title': 'From Local to Global: A Graph RAG Approach to Query-Focused Summarization',
'Authors': 'Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt,
Jonathan Larson',
'Summary': 'The use of retrieval-augmented generation (RAG) to retrieve relevant\ninformation from an external
knowledge source enables large language models\n(LLMs) to answer questions over private and/or previously unseen
document\ncollections. However, RAG fails on global questions directed at an entire text\ncorpus, such as "What are
the main themes in the dataset?", since this is\ninherently a query-focused summarization (QFS) task, rather than
an explicit\nretrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities\nof text indexed by
typical RAG systems. To combine the strengths of these\ncontrasting methods, we propose a Graph RAG approach to
question answering over\nprivate text corpora that scales with both the generality of user questions and\nthe
quantity of source text to be indexed. Our approach uses an LLM to build a\ngraph-based text index in two stages:
first to derive an entity knowledge graph\nfrom the source documents, then to pregenerate community summaries for
all\ngroups of closely-related entities. Given a question, each community summary is\nused to generate a partial
response, before all partial responses are again\nsummarized in a final response to the user. For a class of global
sensemaking\nquestions over datasets in the 1 million token range, we show that Graph RAG\nleads to substantial
improvements over a na\\"ive RAG baseline for both the\ncomprehensiveness and diversity of generated answers. An
open-source,\nPython-based implementation of both global and local Graph RAG approaches is\nforthcoming at
https://aka.ms/graphrag.'
}
2.Transforming documents
Once loaded, the document goes through a block conversion before it can be passed into the LLM as context. Chunking helps optimize the relevance of the content returned from the vector database.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1200,
chunk_overlap=100,
separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)
# documents will be separated by the separators.
docs_split = text_splitter.split_documents(documents)
# It's a list.
print(len(docs_split))
50
However, it is not usually divided in this way, but in the following ways:
- Identifying logical breaks or synthesis techniques (manually, automatically, LLM-assisted, etc).
- Construct chunks that are rich in unique and relevant information
- Including key concepts, keywords, or metadata snippets in each chunk for improved searchability and relevance in the database.
- Continuously assessing chunking effectiveness and be ready to adjust strategies for optimal balance between size and content richness.
- Considering a hierarchy system (implicitly-generated or explicitly-specified) to improve retrieval attempts.(Llama TreeIndex)
3.Document refinement
class DocumentSummaryBase(BaseModel):
running_summary: str = Field("", description="Running description of the document. Do not override; only update!")
main_ideas: List[str] = Field([], description="Most important information from the document (max 3)")
loose_ends: List[str] = Field([], description="Open questions that would be good to incorporate into summary, but that are yet unknown (max 3)")
# All three variables are updated as the document is processed, and their values are always based on the currently processed document.
summary_prompt = ChatPromptTemplate.from_template(
"You are generating a running summary of the document. Make it readable by a technical user."
" After this, the old knowledge base will be replaced by the new one. Make sure a reader can still understand everything."
" Keep it short, but as dense and useful as possible! The information should flow from chunk to (loose ends or main ideas) to running_summary."
" The updated knowledge base keep all of the information from running_summary here: {info_base}."
"\n\n{format_instructions}. Follow the format precisely, including quotations and commas"
"\n\nWithout losing any of the info, update the knowledge base with the following: {input}"
)
RExtract returns a dictionary filled with knowledge entries by receiving a Pydantic class, a language model (llm), and a prompt:
def RExtract(pydantic_class, llm, prompt):
'''
Runnable Extraction module
Returns a knowledge dictionary populated by slot-filling extraction
'''
parser = PydanticOutputParser(pydantic_object=pydantic_class)
instruct_merge = RunnableAssign({'format_instructions' : lambda x: parser.get_format_instructions()})
def preparse(string):
if '{' not in string: string = '{' + string
if '}' not in string: string = string + '}'
string = (string
.replace("\\_", "_")
.replace("\n", " ")
.replace("\]", "]")
.replace("\[", "[")
)
return string
return instruct_merge | prompt | llm | preparse | parser
4.Document iteration
latest_summary = ""
def RSummarizer(knowledge, llm, prompt, verbose=False):
# knowledge:知识库模板类,存提取的信息
# verbose:是否启用详细输出
def summarize_docs(docs):
parse_chain = RunnableAssign({'info_base': RExtract(knowledge.__class__, llm, prompt)})
# 把这样一条链:instruct_merge | prompt | llm | preparse | parser 变成Runnable对象
state = {'info_base': knowledge}
global latest_summary
for i, doc in enumerate(docs):
state['input'] = doc.page_content
state = parse_chain.invoke(state)
assert 'info_base' in state
if verbose:
print(f"Considered {i+1} documents")
pprint(state['info_base'])
latest_summary = state['info_base']
clear_output(wait=True)
return state['info_base']
return RunnableLambda(summarize_docs)
instruct_model = ChatNVIDIA(model="mistralai/mixtral-8x22b-instruct-v0.1").bind(max_tokens=4096)
instruct_llm = instruct_model | StrOutputParser()
summarizer = RSummarizer(DocumentSummaryBase(), instruct_llm, summary_prompt, verbose=True)
summary = summarizer.invoke(docs_split[:15])