基于特征优化和数据转换的随机森林框架在复杂地形下的表土有机碳预测——以云南省为例

李梓源; 李灿锋; 李海侠; 朱奎

doi:10.16843/j.sswc.2025317

基于特征优化和数据转换的随机森林框架在复杂地形下的表土有机碳预测——以云南省为例

Prediction of Topsoil Organic Carbon in Complex Topography Using a Random Forest Framework with Feature Optimization and Data Transformation: A Case Study of Yunnan Province

摘要

摘要: 土壤有机碳（Soil Organic Carbon, SOC）是全球碳循环的关键组成部分，对区域碳汇评估和气候变化响应研究具有重要意义。在地形复杂、环境异质性强的区域，传统 SOC 预测方法易受到数据偏态和特征冗余的限制。本研究以云南省为研究区，提出一种结合特征筛选与数据分布优化的改进随机森林建模框架，旨在降低高维环境因子的冗余性和SOC数据的偏态分布对预测精度的影响。采用多级特征筛选机制结合Pearson相关性分析优化特征集输入。同时，针对SOC数据的偏态分布采用对数转换以优化模型性能。构建了两种随机森林模型（RF与RF_ln）。结果表明：（1）土壤性质类协变量对SOC预测贡献最大，其次为地形、生物和气候因子；（2）对数转换后的RF_ln模型在测试集上的R²由0.65提升至0.72，RMSE和MAE显著降低，精度提高约6.5%；（3）云南SOC总体在纵向呈现南北高中间低、在横向呈现西高东低的空间分布趋势。在复杂山地环境中，结合特征结构优化与数据分布调整能够显著提升SOC预测精度，本研究为区域尺度山地碳汇评估和SOC高精度制图提供方法参考。

Abstract: Background Soil Organic Carbon (SOC) is a key component of the global carbon cycle and plays a crucial role in regional carbon sink assessment and climate change studies. In regions characterized by complex topography and strong environmental heterogeneity, traditional SOC prediction methods are often constrained by data skewness and feature redundancy. MethodsTaking Yunnan Province as the study area, this study proposes an improved Random Forest (RF) modeling framework that integrates feature selection and data distribution optimization, aiming to reduce the effects of high-dimensional environmental redundancy and the skewed distribution of SOC data on prediction accuracy. A multistage feature selection strategy combined with Pearson correlation analysis was employed to refine the set of input variables, while logarithmic transformation was applied to normalize the SOC data and enhance model performance. Two RF models (RF and RF_ln) were developed for comparison. Results 1) soil property-related covariates contributed most significantly to SOC prediction, followed by topographic, biological, and climatic factors. 2) after logarithmic transformation, the RF_ln model achieved a higher coefficient of determination (R²) on the testing dataset (from 0.65 to 0.72), with RMSE and MAE notably reduced, resulting in an approximately 6.5% improvement in prediction accuracy. 3) the spatial distribution of SOC in Yunnan Province exhibited a pronounced pattern characterized by higher values in the north and south and lower values in the central region along the latitudinal direction, and higher values in the west and lower values in the east along the longitudinal direction. Conclusions In complex mountainous environments, integrating feature optimization with data distribution adjustment can significantly enhance SOC prediction accuracy. This study provides methodological support for regional-scale mountain carbon sink assessment and high-precision SOC mapping.

HTML全文

参考文献(0)

施引文献

资源附件(0)