气候变化领域集成服务门户(CCPortal): Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

CCPortal

DOI	10.1007/s10661-017-6025-0
	Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology
	Fox, Eric W.1; Hill, Ryan A.2; Leibowitz, Scott G.1; Olsen, Anthony R.1; Thornbrugh, Darren J.2,3; Weber, Marc H.1
发表日期	2017-07-01
ISSN	0167-6369
卷号	189 期号:7
英文摘要	Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a crossvalidation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.
英文关键词	Random forest modeling;Variable selection;Model selection bias;National rivers and streams assessment;StreamCat dataset;Benthic macroinvertebrates
语种	英语
WOS记录号	WOS:000404652900013
来源期刊	ENVIRONMENTAL MONITORING AND ASSESSMENT
来源机构	美国环保署
文献类型	期刊论文
条目标识符	http://gcip.llas.ac.cn/handle/2XKMVOVA/57721
作者单位	1.US EPA, Natl Hlth & Environm Effects Res Lab, Western Ecol Div, 200 SW 35th St, Corvallis, OR 97333 USA; 2.US EPA, Natl Hlth & Environm Effects Res Lab, Western Ecol Div, Oak Ridge Inst Sci & Educ ORISE Postdoctoral Part, 200 SW 35th St, Corvallis, OR 97333 USA; 3.Northern Great Plains Network, Natl Pk Serv, 231 East St Joseph St, Rapid City, SD 55701 USA
推荐引用方式 GB/T 7714	Fox, Eric W.,Hill, Ryan A.,Leibowitz, Scott G.,et al. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology[J]. 美国环保署,2017,189(7).
APA	Fox, Eric W.,Hill, Ryan A.,Leibowitz, Scott G.,Olsen, Anthony R.,Thornbrugh, Darren J.,&Weber, Marc H..(2017).Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.ENVIRONMENTAL MONITORING AND ASSESSMENT,189(7).
MLA	Fox, Eric W.,et al."Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology".ENVIRONMENTAL MONITORING AND ASSESSMENT 189.7(2017).