有关 RAG 最近最火的应该是这篇 A Cheat Sheet and Some Recipes For Building Advanced RAG。RAG 全部要点如图所示,原图

1.llama index 文档节点信息和 Chunk Size 优化

图片就能看出含金量有多高了,那现在先来看看第一部分,就是上图 Advanced RAG 的 Chunk-size optimization 部分。

llama index 官方例子param_optimizer.ipynb.

个人实验,因为 api 限制就不跑完了。1.param_optimizer.ipynb

1. 如何添加 node 信息

  1. 解析文档并获取 base_nodes
  2. 利用向量存储索引建立 base_nodes 的索引
  3. 本地持久化
from pathlib import Path
from llama_index.node_parser import SimpleNodeParser
def _build_index(chunk_size, docs):
index_out_path = f"./storage_{chunk_size}"
if not os.path.exists(index_out_path):
Path(index_out_path).mkdir(parents=True, exist_ok=True)
# parse docs
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size)
base_nodes = node_parser.get_nodes_from_documents(docs)
# build index
index = VectorStoreIndex(base_nodes)
# save index to disk
# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir=index_out_path)
# load index
index = load_index_from_storage(storage_context,)
return index

最主要的是get_nodes_from_documents, 将文档建立对应的的 node 关系,

def get_nodes_from_documents(
documents: Sequence[Document],
show_progress: bool = False,
**kwargs: Any,
) -> List[BaseNode]:
"""Parse documents into nodes.
documents (Sequence[Document]): documents to parse
show_progress (bool): whether to show progress bar
doc_id_to_document = {doc.id_: doc for doc in documents}
with self.callback_manager.event(CBEventType.NODE_PARSING, payload={EventPayload.DOCUMENTS: documents}
) as event:
nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
for i, node in enumerate(nodes):
if (
node.ref_doc_id is not None
and node.ref_doc_id in doc_id_to_document
ref_doc = doc_id_to_document[node.ref_doc_id]
start_char_idx = ref_doc.text.find(node.get_content(metadata_mode=MetadataMode.NONE)
# update start/end char idx
if start_char_idx >= 0:
node.start_char_idx = start_char_idx
node.end_char_idx = start_char_idx + len(node.get_content(metadata_mode=MetadataMode.NONE)
# update metadata
if self.include_metadata:
if self.include_prev_next_rel:
if i > 0:
node.relationships[NodeRelationship.PREVIOUS] = nodes[
i - 1
if i < len(nodes) - 1:
node.relationships[NodeRelationship.NEXT] = nodes[
i + 1
event.on_end({EventPayload.NODES: nodes})
return nodes

2. chunk size 参数优化

先建立一个优化目标函数,如 async 模式 的 aobjective_function 分析,主要流程:

  1. 获取参数
  2. 建立索引
  3. 查询
  4. 得到响应
  5. 评估
  6. 语义相似度的指标, 获取的查询结果和真实结果进行相似度计算。
async def aobjective_function(params_dict):
chunk_size = params_dict["chunk_size"]
docs = params_dict["docs"]
top_k = params_dict["top_k"]
eval_qs = params_dict["eval_qs"]
ref_response_strs = params_dict["ref_response_strs"]
# build index
index = _build_index(chunk_size, docs)
# query engine
query_engine = index.as_query_engine(similarity_top_k=top_k)
# get predicted responses
pred_response_objs = await aget_responses(eval_qs, query_engine, show_progress=True)
# run evaluator
# NOTE: can uncomment other evaluators
eval_batch_runner = _get_eval_batch_runner()
eval_results = await eval_batch_runner.aevaluate_responses(eval_qs, responses=pred_response_objs, reference=ref_response_strs)
# get semantic similarity metric
mean_score = np.array([r.score for r in eval_results["semantic_similarity"]]
return RunResult(score=mean_score, params=params_dict)

接下来跟 AutoML 的参数优化一样,

  1. 选取要优化的参数值,这里是{"chunk_size": [256, 512, 1024], "top_k": [1, 2, 5]}
  2. 利用 ParamTuneraobjective_function和参数调节获取最优的指标和对应的参数值。

最后能得到结果,Score: 0.9521222054806685 Top-k: 2 Chunk size: 512

param_dict = {"chunk_size": [256, 512, 1024], "top_k": [1, 2, 5]}
# param_dict = {# "chunk_size": [256],
# "top_k": [1]
# }
fixed_param_dict = {
"docs": docs,
"eval_qs": eval_qs[:10],
"ref_response_strs": ref_response_strs[:10],
from llama_index.param_tuner import ParamTuner
param_tuner = ParamTuner(
results = param_tuner.tune()


  1. gold dataset 难以构建,覆盖的范围太小没什么用,太大了成本太高了。
  2. 参数搜索需要大量算力,时间成本也很高。一般没钱别玩。
