利用 langchain 和 Elasticsearch 进行 rag, 对 Nvidia 的年报分析。PDF 地址:2024annualreport。
部分 package 版本:
langchain==0.2.0
elasticsearch=8.0.0
对应 Elasticsearch 服务是直接拉取的,版本是 8.12.2。
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2 | |
docker pull docker.elastic.co/kibana/kibana:8.12.2 |
1. 对 pdf 进行加载
from langchain_community.document_loaders import PyPDFLoader | |
loader = PyPDFLoader(r"./a2024annualreport.pdf") | |
pages = loader.load_and_split() |
2. 构建 es embedding 检索
from langchain.embeddings import HuggingFaceEmbeddings | |
from langchain.vectorstores import ElasticKnnSearch, ElasticsearchStore | |
from elasticsearch import Elasticsearch | |
finger = "CE:xx" | |
es = Elasticsearch( | |
"https://xx:9200", | |
ssl_assert_fingerprint=finger, | |
basic_auth=("elastic", "xx")) | |
embeddings = HuggingFaceEmbeddings(model_name="./all-MiniLM-L6-v2") | |
esVectorStore = ElasticsearchStore.from_documents( | |
pages, | |
es_connection=es, | |
index_name="annualreport", | |
embedding=embeddings, | |
strategy=ElasticsearchStore.ExactRetrievalStrategy()) | |
retriever = esVectorStore.as_retriever() | |
query = "What are the year-over-year changes in the company's revenue and profit?" | |
print(esVectorStore.similarity_search(query)) |
3. 利用 LLM 回答
这里使用的是 deepseek, 便宜呀!
from langchain_openai import ChatOpenAI | |
llms = ChatOpenAI(base_url="https://api.deepseek.com/v1", | |
api_key=deepseek, | |
model="deepseek-chat") | |
from langchain.chains import create_retrieval_chain | |
from langchain.chains.combine_documents import create_stuff_documents_chain | |
from langchain_core.prompts import ChatPromptTemplate | |
system_prompt = ( | |
"You are an assistant for question-answering tasks." | |
"Use the following pieces of retrieved context to answer" | |
"the question. If you don't know the answer, say that you ""don't know.""\n\n" | |
"{context}" | |
) | |
prompt = ChatPromptTemplate.from_messages( | |
[ | |
("system", system_prompt), | |
("human", "{input}"), | |
] | |
) | |
question_answer_chain = create_stuff_documents_chain(llms, prompt) | |
rag_chain = create_retrieval_chain(retriever, question_answer_chain) | |
response = rag_chain.invoke({"input": query}) | |
response["answer"] |
答案如下, 相应的答案在 24 页。
这里只简单示例单个 pdf,可以多个 pdf 也同样创建。并且还可以利用 es 的关键字检索和 embedding 检索结合来获取更准确的检索结果。这里就不一一展示了。
The year-over-year changes in the company's revenue and profit are as follows: | |
- Full-year revenue increased by 126% to a record $60.9 billion. | |
- GAAP earnings per diluted share was $11.93, up 586% from a year ago. | |
- Non-GAAP earnings per diluted share were $12.96, up 288% from a year ago. | |
- Non-GAAP gross margin was 73.8%. | |
These figures indicate significant growth in both revenue and profit for the company compared to the previous year. | |
正文完
发表至: langchain
2024-05-20