CCPortal
DOI10.1080/1062936X.2016.1253611
An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling
Mansouri, K.1,2; Grulke, C. M.2; Richard, A. M.2; Judson, R. S.2; Williams, A. J.2
发表日期2016
ISSN1062-936X
卷号27期号:11页码:911-937
英文摘要

The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.


英文关键词data curation;standardization;QSAR modelling;physicochemical properties;Open Data
语种英语
WOS记录号WOS:000389355800003
来源期刊SAR AND QSAR IN ENVIRONMENTAL RESEARCH
来源机构美国环保署
文献类型期刊论文
条目标识符http://gcip.llas.ac.cn/handle/2XKMVOVA/58269
作者单位1.Oak Ridge Inst Sci & Educ ORISE, Oak Ridge, TN 37830 USA;
2.US EPA, Off Res & Dev, Natl Ctr Computat Toxicol, Res Triangle Pk, NC USA
推荐引用方式
GB/T 7714
Mansouri, K.,Grulke, C. M.,Richard, A. M.,et al. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling[J]. 美国环保署,2016,27(11):911-937.
APA Mansouri, K.,Grulke, C. M.,Richard, A. M.,Judson, R. S.,&Williams, A. J..(2016).An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.SAR AND QSAR IN ENVIRONMENTAL RESEARCH,27(11),911-937.
MLA Mansouri, K.,et al."An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling".SAR AND QSAR IN ENVIRONMENTAL RESEARCH 27.11(2016):911-937.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Mansouri, K.]的文章
[Grulke, C. M.]的文章
[Richard, A. M.]的文章
百度学术
百度学术中相似的文章
[Mansouri, K.]的文章
[Grulke, C. M.]的文章
[Richard, A. M.]的文章
必应学术
必应学术中相似的文章
[Mansouri, K.]的文章
[Grulke, C. M.]的文章
[Richard, A. M.]的文章
相关权益政策
暂无数据
收藏/分享

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。