1. 获取医疗数据
可以通过 爬取 网站来获,这里有一个准备好的 json 数据文件,medical.json,直接使用也行。
下面两个例子,可以看到数据有:
name
: 疾病名desc
category
prevent
cause
symptom
get_way
acompany
cure_department
cure_lasttime"
cured_prob
cost_money
check
recommand_drug
drug_detail
{"_id" : { "$oid" : "5bb578b6831b973a137e3ee6"}, "name" : "肺泡蛋白质沉积症", "desc" : "肺泡蛋白质沉积症(简称 PAP),又称 Rosen-Castle-man-Liebow 综合征,是一种罕见疾病。该病以肺泡和细支气管腔内充满 PAS 染色阳性,来自肺的富磷脂蛋白质物质为其特征,好发于青中年,男性发病约 3 倍于女性。", "category" : ["疾病百科", "内科", "呼吸内科"], "prevent" : "1、避免感染分支杆菌病,卡氏肺囊肿肺炎,巨细胞病毒等。\n2、注意锻炼身体,提高免疫力。", "cause" : "病因未明,推测与几方面因素有关:如大量粉尘吸入(铝,二氧化硅等),机体免疫功能下降(尤其婴幼儿),遗传因素,酗酒,微生物感染等,而对于感染,有时很难确认是原发致病因素还是继发于肺泡蛋白沉着症,例如巨细胞病毒,卡氏肺孢子虫,组织胞浆菌感染等均发现有肺泡内高蛋白沉着。\n 虽然启动因素尚不明确,但基本上同意发病过程为脂质代谢障碍所致,即由于机体内,外因素作用引起肺泡表面活性物质的代谢异常,到目前为止,研究较多的有肺泡巨噬细胞活力,动物实验证明巨噬细胞吞噬粉尘后其活力明显下降,而病员灌洗液中的巨噬细胞内颗粒可使正常细胞活力下降,经支气管肺泡灌洗治疗后,其肺泡巨噬细胞活力可上升,而研究未发现Ⅱ型细胞生成蛋白增加,全身脂代谢也无异常,因此目前一般认为本病与清除能力下降有关。", "symptom" : ["紫绀", "胸痛", "呼吸困难", "乏力", "毓卓"], "yibao_status" : "否", "get_prob" : "0.00002%", "get_way" : "无传染性", "acompany" : ["多重肺部感染"], "cure_department" : ["内科", "呼吸内科"], "cure_way" : ["支气管肺泡灌洗"], "cure_lasttime" : "约 3 个月", "cured_prob" : "约 40%", "cost_money" : "根据不同医院,收费标准不一致,省市三甲医院约(8000——15000 元)", "check" : ["胸部 CT 检查", "肺活检", "支气管镜检查"], "recommand_drug" : [], "drug_detail" : [] }
{"_id" : { "$oid" : "5bb578ba831b973a137e4059"}, "name" : "结肠扭转", "desc" : "结肠扭转是指以结肠系膜为轴的部分肠襻扭转和肠管本身纵轴为中心扭曲。在不同的地区发病率各不相同,中国山东以及河北等地区较多。该病可以发生于任何年龄,平均发病年龄 40~90 岁。是急性低位肠梗阻的原因之一。乙状结肠是最好发的部位,其次是盲肠、横结肠和脾曲。\n ", "category" : ["疾病百科", "外科", "普外科"], "prevent" : "预防 \n 如非先天性发育因素而引起的结肠扭转,则针对致病因素(如习惯性便秘的老人),手术后腹腔内粘连,吃高纤维素食物,饱餐后前屈运动等进行预防。", "cause" : "一些慢性便秘的病人肠内容物多,积气使肠襻扩张,妊娠和分娩期肠活动增强腹内器官位置变化,先天或后天因素致远端肠管梗阻,腹腔手术史等,这些都是发生结肠扭转的常见因素。\n 老年病人多见,病史较长,多有典型的便秘史及反复发作史,病人对其发作的规律及缓解方式多能较明确描述; 而青年病人病史较短,喜运动或活动,常使乙状结肠扭转在不知不觉中缓解,往往没有明确的病史及发病规律。\n 发病机制 \n 乙状结肠游离度大,肠襻两端固定点相对较近,所以乙状结肠扭转最常见,乙状结肠扭转多为逆时针方向,少数为顺时针方向,乙状结肠扭转多见于老年男性病人,而青年人乙状结肠扭转多见于女性; 盲肠扭转多为先天性盲肠及升结肠系膜游离肠襻冗长,当肠蠕动活跃或剧烈活动可使肠襻发生扭转,以系膜为轴常顺时针方向扭转,也偶见逆时针方向扭转; 横结肠扭转常与手术粘连有关,结肠扭转 180°~360°为非闭襻性肠梗阻; 若扭转 360°以上可形成闭襻性肠梗阻,一般情况下结肠扭转 360°以下,不容易影响肠管血运和肠腔通畅,大于 180°常出现梗阻,超过 360°结肠扭转后系膜的血管容易受到挤压造成扭转肠襻静脉回流障碍使肠管水肿,腹腔内会有血性渗出液,继而动脉血运不畅导致缺血,甚至坏死,扭转的程度越大造成缺血坏死的机会越多,另外,闭襻性梗阻肠内积气,积液,压力增高也会影响血运,因而闭襻性结肠扭转梗阻往往容易发生肠绞窄,有些病人可因肠坏死出现严重感染乃至休克等,要特别注意发生肠系膜血液循环障碍后短时间内即可出现肠坏死以及严重感染和休克表现。", "symptom" : ["休克", "肠鸣", "结肠血管发育不良", "腹痛伴恶心、呕吐", "腹胀", "粪便嵌塞", "便秘", "粪样呕吐物", "恶心", "肠套叠"], "yibao_status" : "否", "get_prob" : "约占肠扭转发病率的 20%", "easy_get" : "多发于老年男性", "get_way" : "无传染性", "acompany" : ["腹膜炎"], "cure_department" : ["外科", "普外科"], "cure_way" : ["药物治疗", "手术治疗", "康复治疗"], "cure_lasttime" : "1 个月", "cured_prob" : "70%", "common_drug" : ["盐酸消旋山莨菪碱氯化钠注射液", "阿托品异丙嗪注射液"], "cost_money" : "根据不同医院,收费标准不一致,市三甲医院约(10000——30000 元)", "check" : ["胃肠道疾病的超声检查", "腹部平片", "腹部 CT", "胃肠道 CT 检查", "直肠指检", "肠鸣音"], "do_eat" : ["鹿肉", "鸡心", "栗子(熟)", "葵花子仁"], "not_eat" : ["白酒", "银鱼", "小龙虾", "虾皮"], "recommand_eat" : ["香粳米粥", "香菇薏米饭", "薏米莲子粥", "薏米汤", "银耳薏米羹", "牛奶花生粥", "牛奶麦片粥", "酿豆腐"], "recommand_drug" : ["阿托品异丙嗪注射液", "盐酸消旋山莨菪碱氯化钠注射液", "氢溴酸山莨菪碱片"], "drug_detail" : ["京坦松(盐酸消旋山莨菪碱氯化钠注射液)", "广东南国阿托品异丙嗪注射液(阿托品异丙嗪注射液)", "福建福抗氢溴酸山莨菪碱片(氢溴酸山莨菪碱片)", "成都第一氢溴酸山莨菪碱片(氢溴酸山莨菪碱片)", "安徽昌宏氢溴酸山莨菪碱片(氢溴酸山莨菪碱片)" ] }
2. 安装 neo4 并创建知识图谱
直接下载,获取社区版 neo4j-community-4.4.19,将其解压后的 bin 目录添加到环境变量。运行neo4j console
。
Starting Neo4j.
2023-03-16 03:48:37.307+0000 INFO Starting...
2023-03-16 03:48:41.301+0000 INFO This instance is ServerId{ae639cc2} (ae639cc2-643d-4f26-8b12-2f13571fdaa1)
2023-03-16 03:48:43.621+0000 INFO ======== Neo4j 4.4.19 ========
意味着启动成功了 Neo4j 服务。运行http://127.0.0.1:7474
,修改设置密码等,方便下次连接。
安装pip install py2neo
, 来支持 python 操作。用下面代码玩一下 neo4j 的使用,下面是简单创建 3 个节点和节点间关系的代码。
from py2neo import Graph, Node, Relationship
g = Graph("127.0.0.1:7687", auth=("neo4j", "xxx"))
node1=Node('Bayern',name='Tom',id='25',location='Munich')
node2=Node('Bayern',name='Jay',id='09',location='Munich')
node3=Node('Bayern',name='Kuku',id='08',location='Munich')
g.create(node1)
g.create(node2)
g.create(node3)
teammate1 = Relationship(node1, 'teammate', node2)
teammate2 = Relationship(node2, 'teammate', node3)
g.create(teammate1)
g.create(teammate2)
对于 medical.json 数据,我们可以从中获取到下面 8 个字段,你可以将其看作 8 个不同类型的节点。
drugs = [] # 药品
foods = [] # 食物
checks = [] # 检查
departments = [] # 科室
producers = [] # 药品大类
diseases = [] # 疾病
symptoms = [] # 症状
disease_infos = [] # 疾病信息
还有 11 个关系, 这样就有了关系数据。
# 构建节点实体关系
rels_deparment = [] # 科室 - 科室关系
rels_noteat = [] # 疾病 - 忌吃关系
rels_doeat = [] # 疾病 - 宜吃关系
rels_recommandeat = [] # 疾病 - 推荐食物
rels_commonddrug = [] # 疾病 - 通用药物关系
rels_recommanddrug = [] # 疾病 - 推荐药物关系
rels_check = [] # 疾病 - 检查关系
rels_drug_producer = [] # 药物 - 厂商关系
rels_symptom = [] # 疾病症状关系
rels_acompany = [] # 疾病并发关系
rels_category = [] # 疾病 - 科室关系
用 medical.json 数据创建 neo4j 知识图谱:
import os
import json
from tqdm import tqdm
from py2neo import Graph, Node, Relationship
class MedicalGraph:
def __init__(self, data_path):
self.data_path = data_path
self.g = Graph("127.0.0.1:7687", auth=("neo4j", "password"))
def read_nodes(self):
# 8 类节点
drugs = [] # 药品
foods = [] # 食物
checks = [] # 检查
departments = [] # 科室
producers = [] # 药品大类
diseases = [] # 疾病
symptoms = [] # 症状
disease_infos = [] # 疾病信息
# 构建节点实体关系
rels_deparment = [] # 科室 - 科室关系
rels_noteat = [] # 疾病 - 忌吃关系
rels_doeat = [] # 疾病 - 宜吃关系
rels_recommandeat = [] # 疾病 - 推荐食物
rels_commonddrug = [] # 疾病 - 通用药物关系
rels_recommanddrug = [] # 疾病 - 推荐药物关系
rels_check = [] # 疾病 - 检查关系
rels_drug_producer = [] # 药物 - 厂商关系
rels_symptom = [] # 疾病症状关系
rels_acompany = [] # 疾病并发关系
rels_category = [] # 疾病 - 科室关系
with open(data_path, encoding='utf-8') as f:
for data in f.readlines():
# 创建字典来存储疾病信息
disease_dict = {}
# print(data)
data_json = json.loads(data)
# name 疾病名字,desc 描述,category 类型
# prevent,cause,symptom
# get_way,acompany,cure_department
# cure_lasttime,cured_prob,cost_money
# check,recommand_drug,drug_detail
disease = data_json['name']
disease_dict['name'] = disease
diseases.append(disease)
disease_dict['desc'] = ''disease_dict['prevent'] =''
disease_dict['cause'] = ''disease_dict['easy_get'] =''
disease_dict['cure_department'] = ''disease_dict['cure_way'] =''
disease_dict['cure_lasttime'] = ''disease_dict['symptom'] =''
disease_dict['cured_prob'] = ''
# 构建疾病和症状关系, 填充 rels_symptom 列表
if 'symptom' in data_json:
symptoms += data_json['symptom']
disease_dict['symptom'] = data_json['symptom']
for sympt in data_json['symptom']:
rels_symptom.append([disease, sympt])
# 构建疾病和并发症关系
if 'acompany' in data_json:
for acompany in data_json['acompany']:
rels_acompany.append([disease, acompany])
# 存储疾病描述
if 'desc' in data_json:
disease_dict['desc'] = data_json['desc']
# 存储疾病预防
if 'prevent' in data_json:
disease_dict['prevent'] = data_json['prevent']
# 存储疾病成因
if 'cause' in data_json:
disease_dict['cause'] = data_json['cause']
# 存储患病比例
if 'get_prob' in data_json:
disease_dict['get_prob'] = data_json['get_prob']
# 存储易感人群
if 'easy_get' in data_json:
disease_dict['easy_get'] = data_json['easy_get']
# 存储就诊科室
if 'cure_department' in data_json:
cure_department = data_json['cure_department']
# 只有一个科室,如内科
if len(cure_department) == 1:
rels_category.append([disease, cure_department[0]])
# 两个科室如["内科", "呼吸内科"]
if len(cure_department) == 2:
big = cure_department[0]
small = cure_department[1]
# 构建科室之间的关系
rels_deparment.append([small, big])
# 构建疾病和科室之间的关系
rels_category.append([disease, small])
disease_dict['cure_department'] = cure_department
departments += cure_department
if 'cure_way' in data_json:
disease_dict['cure_way'] = data_json['cure_way']
if 'cure_lasttime' in data_json:
disease_dict['cure_lasttime'] = data_json['cure_lasttime']
if 'cured_prob' in data_json:
disease_dict['cured_prob'] = data_json['cured_prob']
if 'common_drug' in data_json:
common_drug = data_json['common_drug']
for drug in common_drug:
rels_commonddrug.append([disease, drug])
drugs += common_drug
if 'recommand_drug' in data_json:
recommand_drug = data_json['recommand_drug']
drugs += recommand_drug
for drug in recommand_drug:
rels_recommanddrug.append([disease, drug])
if 'not_eat' in data_json:
not_eat = data_json['not_eat']
for _not in not_eat:
rels_noteat.append([disease, _not])
foods += not_eat
if 'do_eat' in data_json:
do_eat = data_json['do_eat']
for _do in do_eat:
rels_doeat.append([disease, _do])
foods += do_eat
try:
if 'recommand_eat' in data_json['recommand_eat']:
recommand_eat = data_json['recommand_eat']
for _recom in recommand_eat:
rels_recommandeat.append([disease, _recom])
foods += recommand_eat
except Exception as e:
pass
# print(e)
if 'check' in data_json:
check = data_json['check']
for c in check:
rels_check.append([disease, c])
checks += check
# 样本例子: {"drug_detail" : [ "桂林南药布美他尼片(布美他尼片)", "雄巴拉曲神水..."}
if 'drug_detail' in data_json:
drug_detail = data_json['drug_detail']
producer = [i.split('(')[0] for i in drug_detail]
# 生产商名字 药名
rels_drug_producer += [(i.split('(')[0], i.split('(')[-1].replace(')', '')) \
for i in drug_detail
]
producers += producer
disease_infos.append(disease_dict)
return set(drugs), set(foods), set(checks), set(departments),\
set(producers), set(symptoms), set(diseases), disease_infos,\
rels_check, rels_recommandeat, rels_noteat, rels_doeat, rels_deparment,\
rels_commonddrug, rels_drug_producer, rels_recommanddrug, rels_symptom,\
rels_acompany, rels_category
def creat_node(self, label, nodes):
for c, node_name in enumerate(nodes):
node = Node(label, name=node_name)
self.g.create(node)
# print(c, len(nodes))
return None
def create_disease_nodes(self, disease_infos):
#创建知识图谱中心疾病的节点
for disease_info in tqdm(disease_infos):
node = Node('Disease',
name=disease_info['name'],
desc=disease_info['desc'],
prevent=disease_info['prevent'],
cause=disease_info['cause'],
easy_get=disease_info['easy_get'],
cure_lasttime=disease_info['cure_lasttime'],
cure_department=disease_info['cure_department'],
cure_way=disease_info['cure_way'],
cured_prob=disease_info['cured_prob'],
)
self.g.create(node)
def create_graphnodes(self):
drugs, foods, checks, departments, producers, symptoms, diseases, \
disease_infos, rels_check, rels_recommandeat, rels_noteat, rels_doeat, \
rels_deparment, rels_commonddrug, rels_drug_producer, rels_recommanddrug,\
rels_symptom, rels_acompany, rels_category = self.read_nodes()
self.create_disease_nodes(disease_infos)
self.creat_node('Drug', drugs)
self.creat_node('Food', foods)
self.creat_node('Check', checks)
self.creat_node('Department', departments)
self.creat_node('Producer', producers)
self.creat_node('Stmptom', symptoms)
return None
def create_relationship(self, start_node, end_node, edges, rel_type, rel_name):
"""
:param start_node: 开始节点
:param end_node: 结束节点
:param edges: 关系边
:param rel_type: 关系类型
:param rel_name: 关系名称
:return:
"""
count = 0
set_edges = []
for edge in edges:
set_edges.append('###'.join(edge))
num_edges = len(set(set_edges))
for edge in set(set_edges):
edge = edge.split('###')
p = edge[0]
q = edge[1]
#neo4j 语句创建 rel 关系
query = "match(p:%s),(q:%s) where p.name='%s'and q.name='%s'create (p)-[rel:%s{name:'%s'}]->(q)" \
% (start_node, end_node, p, q, rel_type, rel_name)
try:
self.g.run(query)
count += 1
# print(rel_type, count, num_edges)
except Exception as e:
pass
# print(e)
return None
def create_graphrels(self):
drugs, foods, checks, departments, producers, symptoms, diseases, \
disease_infos, rels_check, rels_recommandeat, rels_noteat, rels_doeat, \
rels_deparment, rels_commonddrug, rels_drug_producer, rels_recommanddrug, \
rels_symptom, rels_acompany, rels_category = self.read_nodes()
self.create_relationship('Disease', 'Food', rels_recommandeat, 'recommand_eat', '推荐食谱')
self.create_relationship('Disease', 'Food', rels_noteat, 'no_eat', '禁忌食品')
self.create_relationship('Disease', 'Food', rels_doeat, 'do_eat', '宜吃')
self.create_relationship('Department', 'Department', rels_deparment, 'belongs_to', '属于')
self.create_relationship('Disease', 'Drug', rels_commonddrug, 'common_drug', '常用药品')
self.create_relationship('Producer', 'Drug', rels_drug_producer, 'drugs_of', '生产药品')
self.create_relationship('Disease', 'Drug', rels_recommanddrug, 'recommand_drug', '推荐药品')
self.create_relationship('Disease', 'Check', rels_check, 'need_check', '诊断检查')
self.create_relationship('Disease', 'Symptom', rels_symptom, 'has_symptom', '症状')
self.create_relationship('Disease', 'Disease', rels_acompany, 'acompany_with', '并发症')
self.create_relationship('Disease', 'Department', rels_category, 'belongs_to', '所属科室')
def export_data(self):
drugs, foods, checks, departments, producers, symptoms, diseases,\
disease_infos, rels_check, rels_recommandeat, rels_noteat, rels_doeat,\
rels_deparment, rels_commonddrug, rels_drug_producer, rels_recommanddrug,\
rels_symptom, rels_acompany, rels_category = self.read_nodes()
export_path = './exported'
print(f'Exporting data...')
os.makedirs(export_path, exist_ok=True)
self._write2txt(os.path.join(export_path, 'drug.txt'), drugs)
self._write2txt(os.path.join(export_path, 'food.txt'), foods)
self._write2txt(os.path.join(export_path, 'check.txt'), checks)
self._write2txt(os.path.join(export_path, 'department.txt'), departments)
self._write2txt(os.path.join(export_path, 'producer.txt'), producers)
self._write2txt(os.path.join(export_path, 'symptoms.txt'), symptoms)
self._write2txt(os.path.join(export_path, 'diseases.txt'), diseases)
return None
def _write2txt(self, path, data):
with open(path, 'w+') as f:
f.write('\n'.join(list(data)))
return None
if __name__ == '__main__':
data_path = r'medical.json'
m = MedicalGraph(data_path)
# drugs, foods, checks, departments, producers, symptoms, diseases, \
# disease_infos, rels_check, rels_recommandeat, rels_noteat, rels_doeat, \
# rels_deparment, rels_commonddrug, rels_drug_producer, rels_recommanddrug, \
# rels_symptom, rels_acompany, rels_category = m.read_nodes()
# print(disease_infos)
m.create_graphnodes()
m.create_graphrels()
m.export_data()
print('Done!')
如果成功运行的话,在游览器中打开地址 127.0.0.1:7474
可以看到类似的关系图谱 (记得在 cmd 中运行neo4j console
启动 neo4j)。
Inference
[2] 刘焕勇 医疗知识图谱 QABasedOnMedicaKnowledgeGraph
[3] 医疗领域知识图谱构建实战
正文完