Langchain rag with Elasticsearch

利用 langchain 和 Elasticsearch 进行 rag, 对 Nvidia 的年报分析。PDF 地址:2024annualreport

部分 package 版本:

  1. langchain==0.2.0
  2. elasticsearch=8.0.0

对应 Elasticsearch 服务是直接拉取的,版本是 8.12.2。

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker pull docker.elastic.co/kibana/kibana:8.12.2

1. 对 pdf 进行加载

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"./a2024annualreport.pdf")
pages = loader.load_and_split()

2. 构建 es embedding 检索

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import ElasticKnnSearch, ElasticsearchStore
from elasticsearch import Elasticsearch


finger = "CE:xx"
es = Elasticsearch(
    "https://xx:9200",
    ssl_assert_fingerprint=finger,
    basic_auth=("elastic", "xx"))
embeddings = HuggingFaceEmbeddings(model_name="./all-MiniLM-L6-v2")
esVectorStore = ElasticsearchStore.from_documents(
    pages,
    es_connection=es,
    index_name="annualreport",
    embedding=embeddings,
    strategy=ElasticsearchStore.ExactRetrievalStrategy())

retriever = esVectorStore.as_retriever()
query = "What are the year-over-year changes in the company's revenue and profit?"
print(esVectorStore.similarity_search(query))

3. 利用 LLM 回答

这里使用的是 deepseek, 便宜呀!

from langchain_openai import ChatOpenAI
llms = ChatOpenAI(base_url="https://api.deepseek.com/v1",
    api_key=deepseek,
    model="deepseek-chat")

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks."
    "Use the following pieces of retrieved context to answer"
    "the question. If you don't know the answer, say that you ""don't know.""\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llms, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": query})
response["answer"]

答案如下, 相应的答案在 24 页。

这里只简单示例单个 pdf,可以多个 pdf 也同样创建。并且还可以利用 es 的关键字检索和 embedding 检索结合来获取更准确的检索结果。这里就不一一展示了。

 The year-over-year changes in the company's revenue and profit are as follows:

- Full-year revenue increased by 126% to a record $60.9 billion.
- GAAP earnings per diluted share was $11.93, up 586% from a year ago.
- Non-GAAP earnings per diluted share were $12.96, up 288% from a year ago.
- Non-GAAP gross margin was 73.8%.

These figures indicate significant growth in both revenue and profit for the company compared to the previous year.

正文完
 
admin
版权声明:本站原创文章,由 admin 2024-05-20发表,共计2318字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请联系tensortimes@gmail.com。