数据挖掘Pubmed和Pubchem-医学和生化信息数据库

PubMed代表来自生命科学期刊,在线书籍和MEDLINE的超过2千8百万次引用(摘要和标题)的生物医学文献。 引用也可能包括文章的全文。 在公开的2型糖尿病天然化合物中的典型查询

Pubchem-包含超过1亿种化合物和2.36亿种物质的数据库。 该数据库还显示了125万种化合物的生物活性(例如,化合物的抗癌活性或特定基因的抑制作用)。 目前,已知约有900万种有机化合物(复杂物质)。 无机化学品的数量可能很大-从10 ** 18起

在本文中,我将提供一些示例,以列出导致癌症生存期预后不良基因列表,以及PubChem碱基所有化学分子中有机化合物及其编号的搜索代码 。 本文将不会进行机器学习(在有关糖尿病生物标志物的下一篇文章中将需要机器学习,通过RNA表达确定一个人的年龄,筛选抗癌物质)。

为了继续,我们将安装必要的python软件包Biopython和pubchempy。

sudo conda install biopython pip install pubchempy 

考研


我们将根据基因的过表达和表达不足以及不良的癌症预后来挖掘这些基因-这就是典型标题的样子,公开要求和目标基因:

(“ DEK的高表达预示着胃腺癌的不良预后。”,“ DEK不良预后”,“ DEK”,277,15)

为何需要这样做-该基因可用于计算分子及其组合对与不良癌症预后相关的靶标的药理作用。 (例如,基于pubchem或LINCS)。

我们上传带有基因名称的文件(大约12000): Github

 import csv genes=[]; with open('/Users/andrejeremcuk/Downloads/genes.txt', 'r') as fp : reader = csv.reader(fp, delimiter='\t') for i in range(20000): genes.append(reader.next()) import time import numpy as np genesq=np.genfromtxt('/Users/andrejeremcuk/Downloads/genesq.txt',dtype='str') 

对于发布请求,您必须指出您的电子邮件:

 from Bio import Entrez from Bio import Medline MAX_COUNT = 100 Entrez.email = '*@yandex.ru' articles=[];genes_cancer_poor=[];genes_cancer_poor1=[]; 

查询和结果处理:

 for u in range(0,len(genesq)): print u if u%100==0: np.savetxt('/Users/andrejeremcuk/Downloads/genes_cancer_poor.txt', genes_cancer_poor,fmt='%s'); np.savetxt('/Users/andrejeremcuk/Downloads/genes_cancer_poor1.txt', genes_cancer_poor1, fmt='%s') gene=genesq[u];genefullname=genes[u][2] TERM=gene+' '+'poor prognosis' try: h=Entrez.esearch(db='pubmed', retmax=MAX_COUNT, term=TERM) except: time.sleep(5);h=Entrez.esearch(db='pubmed', retmax=MAX_COUNT, term=TERM) result = Entrez.read(h) ids = result['IdList'] h = Entrez.efetch(db='pubmed', id=ids, rettype='medline', retmode='text') ret = Medline.parse(h) fer=[]; for re in ret: try: tr=re['TI']; except: tr='0'; fer.append(tr); 

在标题文本中查找关键字:

  for i in range(len(fer)): gene1=fer[i].find(gene) gene2=fer[i].find(genefullname) ##### inc=fer[i].find("Increased") highe=fer[i].find("High expression") high=fer[i].find("High") expr=fer[i].find("expression") Overe=fer[i].find("Overexpression") overe=fer[i].find("overexpression") up1=fer[i].find("Up-regulation") el1=fer[i].find("Elevated expression") expr1=fer[i].find("Expression of ") #### decr=fer[i].find("Decreased") loss=fer[i].find("Loss") low1=fer[i].find("Low expression") low2=fer[i].find("Low levels") down1=fer[i].find("Down-regulated") down2=fer[i].find("Down-regulated") down3=fer[i].find("Downregulation") ##### acc=fer[i].find("accelerates") poor=fer[i].find("poor patient prognosis") poor1=fer[i].find("poor prognosis") poor2=fer[i].find("unfavorable clinical outcomes") poor3=fer[i].find("unfavorable prognosis") poor4=fer[i].find("poor outcome") poor5=fer[i].find("poor survival") poor6=fer[i].find("poor patient survival") poor7=fer[i].find("progression and prognosis") ### canc=fer[i].find("cancer") canc1=fer[i].find("carcinoma") 

我们会检查标题中的顺序以及最常见的短语是否存在。

  if (gene1!=-1)or(gene2!=-1): #<poor1,poor,poor2,poor3,poor4,poor5,poor6,poor7 if (canc1!=-1)or(canc!=-1): if (poor!=-1)or(poor1!=-1)or(poor2!=-1)or(poor3!=-1)or(poor4!=-1)or(poor5!=-1)or(poor6!=-1)or(poor7!=-1): # genel=-1; if (gene1!=-1): genel=gene1; if (gene2!=-1): genel=gene2; gene1=genel; if (expr!=-1): #<poor1,poor,poor2,poor3,poor4,poor5,poor6,poor7 if (gene1<expr): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,1)) if (low1!=-1)and(gene1!=-1): if (low1<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,2)) if (el1!=-1)and(gene1!=-1): if (el1<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,3)) if (Overe!=-1)and(gene1!=-1): if (Overe<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,4)) if (overe!=-1)and(gene1!=-1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,5)) if (expr1!=-1)and(gene1!=-1): if (expr1<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,6)) if (up1!=-1)and(gene1!=-1): if (up1<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,7)) if (highe!=-1)and(gene1!=-1): if (highe<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,8)) if (high!=-1)and(gene1!=-1)and(expr!=-1): if (high<gene1<expr): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,9)) if (gene1!=-1)and(expr1!=-1): if (expr1<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,10)) if (gene1!=-1)and(inc!=-1): if (inc<gene1): articles.append((fer[i],TERM,gene,u,i));genes_cancer_poor.append((gene,u,i,11)) ########### if (gene1!=-1)and(decr!=-1): if (decr<gene1): articles.append((fer[i],TERM,gene,u,i,'low'));genes_cancer_poor1.append((gene,u,i,12)) if (gene1!=-1)and(loss!=-1): if (loss<gene1): articles.append((fer[i],TERM,gene,u,i,'low'));genes_cancer_poor1.append((gene,u,i,13)) if (gene1!=-1)and(low1!=-1): if (low1<gene1): articles.append((fer[i],TERM,gene,u,i,'low'));genes_cancer_poor1.append((gene,u,i,14)) if (gene1!=-1)and(low2!=-1): if (low2<gene1): articles.append((fer[i],TERM,gene,u,i,'low'));genes_cancer_poor1.append((gene,u,i,15)) if (gene1!=-1)and(down1!=-1): if (down1<gene1): articles.append((fer[i],TERM,gene,u,i,'low'));genes_cancer_poor1.append((gene,u,i,16)) if (gene1!=-1)and(down2!=-1): if (down2<gene1): articles.append((fer[i],TERM,gene,u,i,'low'));genes_cancer_poor1.append((gene,u,i,17)) if (gene1!=-1)and(down3!=-1): if (down3<gene1): articles.append((fer[i],TERM,gene,u,i,'low'));genes_cancer_poor1.append((gene,u,i,18)) 

结果,我们得到了几个清单:低表达和高表达的基因,预后不良。

总共有913篇文章输入了关键词和目标短语。

PubChem


该数据库提供了两种访问其信息的方式: 通过json格式的REST API,请求如下所示:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2516/description/json

重要的是,通过此路径的请求不能超过每秒5个,但是到目前为止,我还没有检查限制,它们应该保存代理。

并通过pubchempy库:

 import pubchempy as pcp c = pcp.Compound.from_cid(5090) c.canonical_smiles 

导入所需的PUG REST API软件包:

 import re import urllib, json, time import numpy as np 

清除HTML标签中的文本的函数:

 def cleanhtml(raw_html): cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, '', raw_html) return cleantext 

在下面的代码中,我们将打开pubchem中1到100,000个分子的英语描述,寻找暗示该分子本质上是有机的(来自动植物或作为饮料的一部分),而既无毒也不致癌。

 natural=[]; for i in range(1,100000): url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/"+str(i)+"/description/json" time.sleep(0.2) try: response = urllib.urlopen(url) except: time.sleep(12);response = urllib.urlopen(url) data = json.loads(response.read()) op=0;ol=0;ot=0; try: for u in range(1,len(data['InformationList']['Information'])): soup=str(data['InformationList']['Information'][u]['Description']) soup1=cleanhtml(soup) if (soup1.find('carcinogen')!=-1)or(soup1.find('death')!=-1)or(soup1.find('damage')!=-1): break; if (soup1.find('toxic')!=-1): break; if (soup1.find(' plant')!=-1)and(op!=9)and(soup1.find('planting')==-1): natural.append((i,'plant',str(data['InformationList']['Information'][0]['Title'])));op=9; if (soup1.find(' beverages')!=-1)and(ot!=9): natural.append((i,'beverages',str(data['InformationList']['Information'][0]['Title'])));ot=9; if (soup1.find(' animal')!=-1)and(ol!=9): natural.append((i,'animal',str(data['InformationList']['Information'][0]['Title'])));ol=9; except: ii=0; if i%100==0: print i;np.savetxt('/Users/andrejeremcuk/Downloads/natural.txt', natural,fmt='%s', delimiter='<') 

要在植物文本中搜索引用,请使用.find(“植物”)。 最后,将生成的有机化合物及其编号保存在PubChem中。

Github

Source: https://habr.com/ru/post/zh-CN424271/


All Articles