Abstract:
Background Soil Organic Carbon (SOC) is a key component of the global carbon cycle and plays a crucial role in regional carbon sink assessment and climate change studies. In regions characterized by complex topography and strong environmental heterogeneity, traditional SOC prediction methods are often constrained by data skewness and feature redundancy. MethodsTaking Yunnan Province as the study area, this study proposes an improved Random Forest (RF) modeling framework that integrates feature selection and data distribution optimization, aiming to reduce the effects of high-dimensional environmental redundancy and the skewed distribution of SOC data on prediction accuracy. A multistage feature selection strategy combined with Pearson correlation analysis was employed to refine the set of input variables, while logarithmic transformation was applied to normalize the SOC data and enhance model performance. Two RF models (RF and RF_ln) were developed for comparison. Results 1) soil property-related covariates contributed most significantly to SOC prediction, followed by topographic, biological, and climatic factors. 2) after logarithmic transformation, the RF_ln model achieved a higher coefficient of determination (R²) on the testing dataset (from 0.65 to 0.72), with RMSE and MAE notably reduced, resulting in an approximately 6.5% improvement in prediction accuracy. 3) the spatial distribution of SOC in Yunnan Province exhibited a pronounced pattern characterized by higher values in the north and south and lower values in the central region along the latitudinal direction, and higher values in the west and lower values in the east along the longitudinal direction. Conclusions In complex mountainous environments, integrating feature optimization with data distribution adjustment can significantly enhance SOC prediction accuracy. This study provides methodological support for regional-scale mountain carbon sink assessment and high-precision SOC mapping.