Researchers often face the problem of how to address missing data. Multiple imputation by chained equations is one of the most common methods for imputation. In theory, any imputation model can be used to predict the missing values. However, if the predictive models are incorrect, it can lead to biased estimates and invalid inferences. One of the latest solutions for dealing with missing data is machine learning methods and the SuperMICE method. In this paper, We present a set of simulations indicating that this approach produces final parameter estimates with lower bias and better coverage than other commonly used imputation methods. Also, implementing some machine learning methods and an ensemble algorithm, SuperMICE, on the data of the Industrial establishment survey is discussed, in which the imputation of different variables in the data co-occurs. Also, the evaluation of various methods is discussed, and the method that has better performance than the other methods is introduced.
Type of Study: Applied |
Subject: Applied Statistics Received: 2024/10/13 | Accepted: 2024/08/31 | Published: 2025/05/18
References
1. Aerts, M., Claeskens, G., Hens, N., and Molenberghs, G. (2002). Local Multiple Imputation. Biometrika, 89(2), 375-388. [DOI:10.1093/biomet/89.2.375]
2. Alwateer, M., Atlam, E. S., Abd El-Raouf, M. M., Ghoneim, O. A., and Gad, I. (2024). Missing Data Imputation: A Comprehensive Review. Journal of Computer and Communications, 12(11), 53-75. [DOI:10.4236/jcc.2024.1211004]
3. Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., and Tabona, O. (2021). A Survey on Missing Data in Machine Learning. Journal of Big Data, 8(1), 1-37. [DOI:10.1186/s40537-021-00516-9] [PMID] []
4. Graham, J. W., Olchowski, A. E., and Gilreath, T. D. (2007). How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prevention Science, 8, 206-213. [DOI:10.1007/s11121-007-0070-9] [PMID]
5. Laqueur, H. S., Shev, A. B., and Kagawa, R. M. (2022). SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations. American Journal of Epidemiology, 191(3), 516-525. [DOI:10.1093/aje/kwab271] [PMID]
6. Little, R. J. (1988). Missing-data Adjustments in Large Surveys. Journal of Business and Economic Statistics, 6(3), 287-296.
https://doi.org/10.2307/1391881 [DOI:10.1080/07350015.1988.10509663]
7. Marshall, A., Altman, D. G., Royston, P., and Holder, R. L. (2010). Comparison of Techniques for Handling Missing Covariate Data Within Prognostic Modelling Studies: A Simulation Study. BMC Medical Research Methodology, 10(1), 1-16. [DOI:10.1186/1471-2288-10-7] [PMID] []
8. Nadaraya, E. A. (1964). On Estimating Regression. Theory of Probability and Its Applications, 9(1), 141-142. [DOI:10.1137/1109020]
9. Quinlan, J. R. (1987). Simplifying Decision Trees. International Journal of Man-Machine Studies, 27(3), 221-234. [DOI:10.1016/S0020-7373(87)80053-6]
10. Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J.,and Solenberger, P. (2001). A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology, 27(1), 85-96.
11. Rubin, D. B. Multiple Imputation for Nonresponse in Surveys.Toronto, ON, Canada: John Wiley and Sons, Inc.; 2004.
12. Stekhoven, D. J., & Bühlmann, P. (2012). MissForest-non-parametric Missing Value Imputation for Mixed-type Data. Bioinformatics, 28(1), 112-118. [DOI:10.1093/bioinformatics/btr597] [PMID]
13. Tiwaskar, S., Rashid, M., and Gokhale, P. (2024). Impact of Machine Learning-based Imputation Techniques on Medical Datasets-a Comparative Analysis, {it Multimedia Tools and Applications}, DOI:10.1007/s11042-024-19103-0. [DOI:10.1007/s11042-024-19103-0]
14. Van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC press. [DOI:10.1201/9780429492259]
15. Van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45, 1-67. [DOI:10.18637/jss.v045.i03]
16. Van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super Learner, Statistical Applications in Genetics and Molecular Biology, 6(1), DOI:10.2202/1544-6115.1309. [DOI:10.2202/1544-6115.1309]
Ghaderi M, Rezaei Ghahroodi Z, Gandomi M. Advanced Missing Value Imputation Techniques: Machine Learning Methods with an Emphasis on an Ensemble Method for Multiple Imputation by Chained Equations. JSS 2025; 19 (1) :161-182 URL: http://jss.irstat.ir/article-1-907-en.html