The journal of open source software ;
Øystein Sørensen
The journal of open source software ;
Øystein Sørensen
Many problems in science involve using measured variables to explain an outcome of interest using some statistical regression model. In high-dimensional problems, charac- terized by having a very large number of variables, one often focuses on finding a subset of variables with good explanatory power. An example from cancer research involves finding gene expressions or other biomarkers which can explain disease progression, from a large set of candidates (Kristensen et al., 2014). Another example is customer analytics, where it may be of interest to find out which variables predict whether customers will return or not, and variables of interest include factors like previous purchasing patterns, demographics, and satisfaction measures (Baesens, 2014).
The lasso (Tibshirani, 1996) and the Dantzig selector (Candes & Tao, 2007; James & Rad- chenko, 2009) are popular methods for variable selection in this type of problems, com- bining computational speed with good statistical properties (Bühlmann & Geer, 2011). In many practical applications, the process of measuring the variables of interest is sub- ject to measurement error (Carroll, Ruppert, Stefanski, & Crainiceanu, 2006), but this additional source of noise is neglected by the aforementioned models. Such measurement error has been shown to lead to worse variable selection properties of the lasso (Sørensen, Frigessi, & Thoresen, 2015), typically involving an increased number of false positive se- lections. A corrected lasso has been proposed and analyzed by Loh & Wainwright (2012) for linear models and Sørensen et al. (2015) for generalized linear models. It has been applied by Vasquez, Hu, Roe, Halonen, & Guerra (2019) in a problem involving mea- surement of serum biomarkers. For the Dantzig selector, Rosenbaum & Tsybakov (2010) proposed the Matrix Uncertainty Selector (MUS) for linear models, which was extended to the generalized linear model case by Sørensen, Hellton, Frigessi, & Thoresen (2018) with an algorithm named GMUS (Generalized MUS).
hdme is an R (R Core Team, 2018) package containing implementations of both the corrected lasso and the MU selector for high-dimensional measurement error problems. Its main functions are gmus() and corrected_lasso(). Additional functions pro- vide opportunities for hyperparameter tuning using cross-validation or the elbow rule (Rosenbaum & Tsybakov, 2010), and plotting tools for visualizing the model fit. The underlying numerical procedures are implemented in C++ using the RcppArmadillo pack- age (Eddelbuettel & Sanderson, 2014) and linear programming with Rglpk (Theussl & Hornik, 2019). hdme is available from the comprehensive R archive network (CRAN) at https://CRAN.R-project.org, and the latest development version is available at https: //github.com/osorensen/hdme. The package vignette, which can be opened in R with the command vignette(“hdme”), contains a step-by-step introduction to the models implemented in the package.
Click the image to go to the paper