Multiple imputation to fill in missing data in soil physico-hydrical properties database

Luciana Maria de Oliveira, Herdjania Veras de Lima, Sueli Rodrigues, Eduardo Jorge Maklouf Carvalho, Lorena Chagas Torres

Resumo


Missing values in databases is a common issue and almost inevitable. Multiple imputation (MI) is an efficient statistical method for estimating missing values in an incomplete dataset. To test this approach for a soil database, we hypothesized that the imputation of missing data provides a statistically more accurate database than the complete case analysis (CCA). The overall goal of our study was to evaluate the efficiency of the MI using the MICE (Multivariate Imputation by Chained Equations) algorithm to fill in missing data in a database of soil physico-hydrical properties, and to show that it is more feasible to perform the imputation than the CCA. Preliminary analyses were performed to check the suitability of the proposed algorithm. Imputation of the missing data of each variable was adjusted using linear regression models. The variables with missing data comprise the model as the dependent variable and the other variables, which were correlated with the same, enter as covariates. The analysis was performed by comparing the values of the estimates, their standard errors and 95% confidence intervals. The pattern missing was multivariate and arbitrary and, organic matter was the variable with the largest amount of missing data. The significance of the covariates varied depending on the variable to be estimated. The results showed that the MICE presented better performance than CCA, since, although the statistical comparison of the two methods was similar, multiple imputation maintains the size of the database and preserves the general distribution.

Palavras-chave


Soil database; Incomplete data; Markov Chain Monte Carlo; Missing predictors

Texto completo:

PDF

Referências


AUDIGIER, V.; HUSSON, F.; JOSSE, J. Multiple imputation for continuous variables using a Bayesian principal component analysis. Journal of Statistical Computation and Simulation, v. 86, p. 2140-2156, 2015. DOI: http://dx.doi.org/10.1080/00949655.2015.1104683.

CARVALHO, J. R. P. et al. Modelo de imputação múltipla para estimar dados de precipitação diária e preenchimento de falhas. Revista Brasileira de Meteorologia, v. 32, p. 575-583, 2017. DOI: http://dx.doi.org/10.1590/0102-7786324006.

CLAESSEN, M. E. C. et al. Manual de métodos de análise de solo. Rio de Janeiro: Embrapa, 1997. 212 p.

CLIFFORD, D.; DOBBIE, M. J.; SEARLE, R. Non-parametric imputation of properties for soil profiles with sparse observations. Geoderma, v. 232/234, p. 10-18, 2014. DOI: https://doi.org/10.1016/j.geoderma.2014.04.026.

FIGUEREDO, A. J. et al. Multivariate modeling of missing data within and across assessment waves. Addiction, v. 95, p. 361-380, 2000. DOI: https://doi.org/10.1080/09652140020004287.

GRAHAM, J. W. Missing data analysis: making it work in the real world. Annual Review of Psychology, v. 60, p. 549-576, 2009.

HARRELL, J. F. E. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. 2. ed. New York: Springer International Publishing, 2016. 572 p.

HONAKER, J.; KING, G.; BLACKWELL, M. Amelia II: a program for missing data. Journal of Statistical Software, v. 45, p. 1-47, 2011. DOI: http://dx.doi.org/10.18637/jss.v045.i07.

KIM, M. et al. Comparative studies of different imputation methods for recovering streamflow observation. Water Resource Research, v. 7, p. 6847-6860, 2015. DOI: http://dx.doi.org/10.3390/w7126663.

KIEHL, E. J. Manual de edafologia: relações solo-planta. São Paulo: Ceres, 1979. 264 p.

LITTLE, R. J.; RUBIN, D. B. Missing data. International Encyclopedia of the Social and Behavioral Sciences, v. 15, p. 602-607, 2015. DOI: http://dx.doi.org/10.1016/B978-0-08-097086-8.42082-9.

LITTLE, R. J. A. A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, v. 83, p. 1198-1202, 1988. DOI: https://doi.org/10.2307/2290157.

NUNES, L. N.; KLÜCK, M. M.; FACHEL, J. M. G. Uso da imputação múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos. Caderno de Saúde Pública, v. 25, n. 2, p. 268-278, 2009. DOI: http://dx.doi.org/10.1590/S0102-311X2009000200005.

MINASNY, B.; HARTEMINK, A. E. Predicting soil properties in the tropics. Earth-Science Reviews, v. 106, p. 52-62, 2011. DOI: http://dx.doi.org/10.1016/j.earscirev.2011.01.005.

PAES, Â. T.; POLETO, F. Z. Por dentro da estatística. Educação Continuada em Saúde Einstein, v. 11, p. 5-7, 2013.

PEDERSEN, A. B. et al. Missing data and multiple imputation in clinical epidemiological research. Clinical Epidemiology, v. 9, p. 157-166, 2017. DOI: https://doi.org/10.2147/CLEP.S129785.

POYATOS, R. et al. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information. Biogeosciences, v. 15, p. 2601-2617, 2018. DOI: https://doi.org/10.5194/bg-15-2601-2018.

R CORE TEAM. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2017. Disponível em: https://www.R-project.org/. Acesso em: 15 jan. 2018.

RUBIN, D. B. Inference and missing data. Biometrika, v. 63, p. 581-592, 1976. DOI: http://dx.doi.org/10.2307/2335739.

RUBIN, D. B. Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, 1987. 253 p.

SANTOS, H. G. et al. Sistema brasileiro de classificação de solos. 5. ed. rev. ampl. Brasília, DF: Embrapa Solos, 2018. 356 p.

SCHAFER, J. L.; OLSEN, M. K. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behavioral Research, v. 33, p. 545-571, 1998. DOI: https://doi.org/10.1207/s15327906mbr3304_5.

SHAO, J.; MENG, W.; SUN, G. Evaluation of missing value imputation methods for wireless soil datasets. Personal and Ubiquitous Computing, v. 21, p. 113-123, 2017. DOI: https://doi.org/10.1007/s00779-016-0978-9.

SILVA, A. C.; ARMINDO, R. A. Importância das funções de pedotransferência no estudo das propriedades e funções hidráulicas dos solos do Brasil. Multi-Science Journal, v. 1, p. 31-37, 2016.

SONG, Q.; SHEPPERD, M. A new imputation method for small software project data sets. Journal of Systems and Softwares, v. 80, p. 51-62, 2007. DOI: https://doi.org/10.1016/j.jss.2006.05.003.

SQUILLANTE JÚNIOR, R. et al. Modeling accident scenarios from databases with missing data: a probabilistic approach for safety-related systems design. Safety Science, v. 104, p. 119-134, 2018. DOI: https://doi.org/10.1016/j.ssci.2018.01.001.

STUART, E. A. et al. Multiple imputation with large data sets: a case study of the Children’s Mental Health Initiative. American Journal of Epidemiology, v. 169, n. 9, p.1133-1139, 2009.

VAN BUUREN, S. Flexible imputation of missing data. 2. ed. Boca Raton: Chapman and Hall: CRC Press, 2018, 416 p.

VAN BUUREN, S.; OUDSHOORN, C. G. M. Multivariate imputation by chained equations: MICE V1.0 user´s manual. Leiden: TNO Preventie en Gezondheid, TNO/PG/VGZ/00.038, 2000.

WAXMAN, S. A.; STEVENS, K. R. A critical study of the methods for determining the nature and abundance of soil organic matter. Soil Science, v. 30, p. 97-116, 1930.




Revista Ciência Agronômica ISSN 1806-6690 (online) 0045-6888 (impresso), Site: www.ccarevista.ufc.br, e-mail: ccarev@ufc.br - Fone: (85) 3366.9702 - Expediente: 2ª a 6ª feira - de 7 às 17h.