초록

Handling missing values during data analysis is an important issue that directly affects the prediction performance of models and research results. However, research on the differences between the dimensionality reduction rate and model prediction performance is still lacking for building energy-related data. Therefore, this study compared the dimensionality reduction rate and model prediction performance by handling missing values in weather information datasets, which is related to building energy. The missing value-handling methods were divided into removal, k-nearest neighbors (KNN) imputation, and no handling. Dimensionality reduction methods were classified based on principal component analysis and feature selection using the model. Further, the eXtreme Gradient Boosting (XGBoost) algorithm, a gradient boosting method with its own missing data handling capabilities, was used. Consequently, few principal components were required to explain 95% of the variance in the raw data when the missing values were removed than when they were replaced with KNN. Moreover, the dimensionality reduction methods of model building and feature selection outperformed principal component analysis in terms of dimensionality reduction rate and model predictive accuracy. Particularly, the XGBoost model without missing values had the highest accuracy, suggesting that the missing-value handling method of XGBoost may be superior to conventional missing-value handling methods. These results may have important implications for selecting imputation methods in building energy data analysis, considering the effort and cost of missing value handling, and can significantly reduce the cost and effort of data preprocessing.