The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst.
The aim of this tutorial is to present an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries.
We will touch upon the topics of single imputation with a focus on matrix completion methods based on iterative regularized SVD, notions of confidence intervals by giving the fundamentals of multiple imputation strategies, as well as issues of visualization with incomplete data. The approaches will be illustrated with real data with continuous, binary and categorical variables using some of the main R packages Amelia, mice, missForest, missMDA, norm, softimpute, VIM.
For details, refer to
tutorial description.