ISSN 0439-755X
CN 11-1911/B
主办:中国心理学会
   中国科学院心理研究所
出版:科学出版社

心理学报 ›› 2023, Vol. 55 ›› Issue (7): 1192-1206.doi: 10.3724/SP.J.1041.2023.01192

• 研究报告 • 上一篇    

认知诊断测评中缺失数据的处理:随机森林阈值插补法

游晓锋1, 杨建芹1, 秦春影1, 刘红云2,3()   

  1. 1南昌师范学院数学与信息科学学院, 南昌 330032
    2应用实验心理北京市重点实验室
    3北京师范大学心理学部, 北京 100875
  • 收稿日期:2022-04-23 发布日期:2023-04-21 出版日期:2023-07-25
  • 通讯作者: 刘红云, E-mail: hyliu@bnu.edu.cn
  • 基金资助:
    江西省教育厅科技重点项目(GJJ212601);南昌市教育大数据智能技术重点实验室(2020-NCZDSY-012);国家自然科学基金项目(32071091)

Missing data analysis in cognitive diagnostic models: Random forest threshold imputation method

YOU Xiaofeng1, YANG Jianqin1, Qin Chunying1, LIU Hongyun2,3()   

  1. 1School of Mathematics and Information Science, Nanchang Normal University, Nanchang 330022, China
    2Beijing Key Laboratory of Applied Experimental Psychology, Beijing Normal University, Beijing 100875, China
    3Faculty of Psychology, Beijing Normal University, Beijing 100875, China
  • Received:2022-04-23 Online:2023-04-21 Published:2023-07-25

摘要:

认知诊断测评中缺失数据的处理是理论和实际应用者非常关注的研究主题。借鉴随机森林插补法(RFI)不依赖于缺失机制假设的特点, 对已有的RFI方法进行改进, 提出采用个人拟合指标(RCI)确定插补阈值的新方法: 随机森林阈值插补方法(RFTI)。模拟研究表明, RFTI在插补正确率上明显高于RFI方法; 与RFI和EM方法相比, RFTI在被试属性模式判准率和边际判准率上表现出明显优势, 尤其是非随机缺失和混合缺失机制, 以及缺失比例较高的条件下, 其优势更加明显。但对项目参数的估计, RFTI方法与EM方法相比不具有优势。

关键词: 缺失数据, 认知诊断测评, 随机森林阈值插补, 随机森林插补, EM算法

Abstract:

In recent years, interest in cognitive diagnostic assessments (CDAs), as a new form of test, has increased drastically. Due to the specific design of the test, missing data is an inevitable problem in CDAs. Proper handling of missing data in CDAs is important to provide accurate diagnostic feedback to students and teachers. With the use of machine learning in education, relevant advancements have been made in missing data imputation. Research showed machine learning techniques have more desirable features for missing data imputation than traditional approaches. The random forest algorithm has been extended to become the random forest imputation (RFI) method in handling of CDAs missing data for CDAs. The method takes into consideration the characteristics of the data rather than assumes certain missing mechanism. RFI is a new non-parametric method that makes full use of the available response information and characteristics of response patterns to impute missing data.

Making use of advantages of RFI in categorization/prediction and its non-reliant on missing mechanism type, we improved and proposed the new random forest threshold imputation (RFTI) method. It could be used to impute missing responses in the widely used DINA (Deterministic Inputs, Noise “And” Gate) model. This research proposed to apply the Response Conformity Index (RCI) in the missing data imputation to set the threshold of imputation and to develop a method for missing response treatment for CDAs without totally relying on imputation. Two simulation studies were conducted to compare the performance of the proposed method and traditional models. Study 1 began by introducing the theoretical background and algorithm implementation of RFTI. Then, RFTI and RFI were compared in terms of accuracy rate of imputation for data with different proportions of missingness (10%, 20%, 30%, 40%, 50%) and missing data mechanisms (MIXED, MNAR, MAR, MCAR). This was to affirm the necessity of including RCI during imputation. Study 2 aimed to investigate the performance of RFTI, as well as RFI and EM algorithm in imputing missing data under different conditions. The manipulated design factors were identical to those in Study 1. We evaluated RFTI in terms of its accuracy in assessing the model attributes and item parameters. We also compared RFTI against the traditionally better performed EM and RFI under various design conditions to explore the advantages and conditions of using RFTI.

Results of Study 1 showed that RFTI, as compared to RFI, improved accuracy when imputation threshold was one. In various design conditions, RFTI imputation rate and accuracy were also better. Study 2 showed that RFTI outperformed other methods (RFI, EM algorithm) in accurately assessing the attribute pattern and attribute margin. This advantage was affected by the missing data mechanism and the proportion of missing data. Notably, RFTI was particularly better than other methods in handling mixed type of missing or MNAR data, and when the proportion of missing data was higher than 30%. However, RFTI was not any better than other methods in its accuracy of item parameter estimates. In most conditions, EM algorithm provided the most accurate parameter estimates.

In sum, we propose a method to impute missing data in CDAs by applying machine learning methods in measurement models. The advantage of this new method is affirmed through its accurate assessment of attribute pattern and attribute margin of DINA model. Theoretically, the current study provides a missing data imputation approach with less assumptions, which extends the traditional methods to impute missing data in CDAs framework. Moreover, we investigate how to estimate the attribute pattern of students accurately through the responses of a few items. It sheds lights on imputing missing data due to particularly designs in assessment or teaching.

Key words: missing data, cognitive diagnostic assessment, random forest threshold imputation, random forest imputation, expectation-maximization algorithm

中图分类号: