利用 langchain 和 Elasticsearch 进行 rag, 对 Nvidia 的年报分析。PDF 地址:2024annualreport。
部分 package 版本:
langchain==0.2.0
elasticsearch=8.0.0
对应 Elasticsearch 服务是直接拉取的,版本是 8.12.2。
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker pull docker.elastic.co/kibana/kibana:8.12.2
1. 对 pdf 进行加载
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(r"./a2024annualreport.pdf")
pages = loader.load_and_split()
2. 构建 es embedding 检索
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import ElasticKnnSearch, ElasticsearchStore
from elasticsearch import Elasticsearch
finger = "CE:xx"
es = Elasticsearch(
"https://xx:9200",
ssl_assert_fingerprint=finger,
basic_auth=("elastic", "xx"))
embeddings = HuggingFaceEmbeddings(model_name="./all-MiniLM-L6-v2")
esVectorStore = ElasticsearchStore.from_documents(
pages,
es_connection=es,
index_name="annualreport",
embedding=embeddings,
strategy=ElasticsearchStore.ExactRetrievalStrategy())
retriever = esVectorStore.as_retriever()
query = "What are the year-over-year changes in the company's revenue and profit?"
print(esVectorStore.similarity_search(query))
3. 利用 LLM 回答
这里使用的是 deepseek, 便宜呀!
from langchain_openai import ChatOpenAI
llms = ChatOpenAI(base_url="https://api.deepseek.com/v1",
api_key=deepseek,
model="deepseek-chat")
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
system_prompt = (
"You are an assistant for question-answering tasks."
"Use the following pieces of retrieved context to answer"
"the question. If you don't know the answer, say that you ""don't know.""\n\n"
"{context}"
)
prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("human", "{input}"),
]
)
question_answer_chain = create_stuff_documents_chain(llms, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
response = rag_chain.invoke({"input": query})
response["answer"]
答案如下, 相应的答案在 24 页。
这里只简单示例单个 pdf,可以多个 pdf 也同样创建。并且还可以利用 es 的关键字检索和 embedding 检索结合来获取更准确的检索结果。这里就不一一展示了。
The year-over-year changes in the company's revenue and profit are as follows:
- Full-year revenue increased by 126% to a record $60.9 billion.
- GAAP earnings per diluted share was $11.93, up 586% from a year ago.
- Non-GAAP earnings per diluted share were $12.96, up 288% from a year ago.
- Non-GAAP gross margin was 73.8%.
These figures indicate significant growth in both revenue and profit for the company compared to the previous year.
正文完
发表至: langchain
2024-05-20