让自适应测验更知人善选——基于推荐系统的选题策略

doi:10.3724/SP.J.1041.2019.01057

摘要/Abstract

摘要：

基于推荐系统中协同过滤推荐的思想, 提出两种可以利用已有答题者数据的CAT选题策略：直接基于答题者推荐(DEBR)和间接基于答题者推荐(IEBR)。通过两个模拟研究, 在不同题库和不同长度的测验中, 比较了两种推荐选题策略与两种传统选题策略(FMI和BAS)在测量精度和对题目曝光率控制上的表现, 以及影响推荐选题策略表现的因素。结果发现：两种推荐选题策略对题目曝光率的控制优于两种传统选题策略, 测量精度不亚于BAS方法, 其中DEBR侧重选题精度, IEBR对题目曝光率控制最好。已有答题者数据的特点和质量是影响推荐选题策略表现的主要因素。

关键词: 选题策略, 已有答题者数据, 推荐系统, 协同过滤推荐, 模拟研究

Abstract:

Better CAT item selection strategies may be designed by making better use of information from previous examinees’ responses. The past examinees’ data serve as a valuable reference for selecting items more accurately and evenly for new examinees. However, most of the existing strategies proposed under the theoretical framework of IRT only use information from the current examinee and fail to take full advantage of past examinees’ data. A collaborative filtering recommender approach from the recommender system literature is able to find items that best match one’s preference by utilizing information from others, which shares the similar goal as the item selection strategy of CAT. Therefore, the present study adapted the underlying assumptions of collaborative filtering recommender and proposed new item selection strategies which take advantage of past examinees’ data, and then investigated the potential factors that might affect the performance of new strategies.

In light of user-based collaborative filtering, we defined similar examinees as a group of examinees who uniformly answered the same items, and proposed two strategies, Direct Examinee-Based Recommender (DEBR) and Indirect Examinee-Based Recommender (IEBR). Two simulation studies were conducted to examine the measurement accuracy and item exposure control of new strategies under different conditions. In study 1, a simulated item bank was considered. The recommender-based strategies used two different types of past examinees’ data generated by FMI and BAS, respectively, to select items under two fixed-length CATs. In study 2, a real item bank was used to test new strategies under a more realistic setting. The effect of combining two batches of past examinees’ data from different recommender-based strategies was also investigated.

In both studies, when using past examinees’ data with high accuracy but poor item exposure control (generated by FMI), the recommender-based strategies greatly remedied unbalanced item utilization with an acceptable loss of accuracy. When using past examinees’ data with better tradeoff of measurement precision and test security (generated by BAS), the recommender-based strategies kept the accuracy at the same level and further improved item exposure control. More specifically, DEBR focused on maintaining the accuracy and had lower measurement error than IEBR; IEBR was good at improving the control of item exposure and made better use of the whole item bank than all the other strategies. These features of two recommender-based strategies were stable and consistent under different item banks and different length of CATs. The extent to which DEBR and IEBR demonstrated their features was influenced by the quality of item bank, test length, number of past examinees and strategy used to generate data.

In general, this research successfully combined the recommender systems with CAT item selection methods to establish a new flexible framework, which is an unprecedented innovation upon the traditional item selection strategies. This research also provided empirical evidence for the value of past examinees’ data and the recommender system approach as a feasible alternative option for selecting items in CAT. Finally, suggestions for future studies were provided regarding investigating the proposed new strategies in various situations and upgrading recommender-based strategies for more CAT conditions, including finding diverse measures of similarities between examinees or items and employing more complex algorithms of recommender system to meet the demands of large-scale tests.

Key words: selection strategy, past examinees’ data, recommender system, collaborative filtering recommender, simulation study

中图分类号:

B841

王璞珏, 刘红云. (2019). 让自适应测验更知人善选——基于推荐系统的选题策略. 心理学报, 51(9), 1057-1067.

WANG Pujue, LIU Hongyun. (2019). Make adaptive testing know examinees better: The item selection strategies based on recommender systems. Acta Psychologica Sinica, 51(9), 1057-1067.

图/表 4

参考文献 34

1	Akbay L.., & Kaplan M. , ( 2017). Transition to multidimensional and cognitive diagnosis adaptive testing: An overview of cat. The Online Journal of New Horizons in Education-January.7( 1), 206-214.
2	Barrada J. R., Olea J., Ponsoda V., & Abad F. J . ( 2010). A method for the comparison of item selection rules in computerized adaptive testing. Applied Psychological Measurement.34( 6), 438-452.
3	Chang H.H . ( 2015). Psychometrics behind computerized adaptive testing. Psychometrika.80( 1), 1-20.
4	Chang H. H., Qian J. H., & Ying Z. L . ( 2001). a-stratified multistage computerized adaptive testing with b blocking. Applied Psychological Measurement.25( 4), 333-341.
5	, Chang H.H., & Ying Z.L . ( 1999). a-stratified multistage computerized adaptive testing. Applied Psychological Measurement.23( 3), 211-222.
6	Chen S. Y., Ankenmann R. D., & Spray J. A . ( 2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement.40( 2), 129-145.
7	Chen Y., Li X., Liu J., & Ying Z . ( 2018). Recommendation system for adaptive learning. Applied psychological measurement.42( 1), 24-41.
8	Cheng Y., Patton J. M., & Shao C . ( 2015). a-stratified computerized adaptive testing in the presence of calibration error. Educational and Psychological Measurement.75( 2), 260-283.
9	Covington P., Adams J., & Sargin E . (2016, September). Deep neural networks for Youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (pp. 191-198). Boston, MA: ACM.
10	Georgiadou E. G., Triantafillou E., & Economides A. A . ( 2007). A review of item exposure control strategies for computerized adaptive testing developed from 1983 to 2005. The Journal of Technology.Learning and Assessment, 5( 8), 1-39.
11	He W., Diao Q., & Hauser C . ( 2014). A comparison of four item-selection methods for severely constrained CATs. Educational and Psychological Measurement.74( 4), 677-696.
12	Jia Z., Yang Y., Gao W., & Chen X . ( 2015,February). User-based collaborative filtering for tourist attraction recommendations. In 2015 IEEE International Conference on Computational Intelligence & Communication Technology (pp. 22-25). Ghaziabad, India: IEEE.
13	Kaplan M., de la Torre J., & Barrada J. R . ( 2015). New item selection methods for cognitive diagnosis computerized adaptive testing. Applied psychological measurement.39( 3), 167-188.
14	Klašnja-Milićević A., Ivanović M., & Nanopoulos A . ( 2015). Recommender systems in e-learning environments: A survey of the state-of-the-art and possible extensions. Artificial Intelligence Review.44( 4), 571-604.
15	Koren Y. & Bell R. , ( 2015). Advances in collaborative filtering. In F. Ricci, L. Rokach, & B. Shapira (Eds.), Recommender Systems Handbook (2nd ed., pp. 77-118). Boston, MA: Springer.
16	Lika B., Kolomvatsos K., & Hadjiefthymiades S . ( 2014). Facing the cold start problem in recommender systems. Expert Systems with Applications.41( 4), 2065-2073.
17	Liu Q., Chen E. H., Zhu T. Y., Huang Z. Y., Wu R. Z., Su Y., & Hu G. P . ( 2018). Research on educational data mining for online intelligent learning. Pattern Recognition and Artificial Intelligence.31( 1), 77-90.
18	[ 刘淇, 陈恩红, 朱天宇, 黄振亚, 吴润泽, 苏喻, 胡国平 . ( 2018). 面向在线智慧学习的教育数据挖掘技术研究. 模式识别与人工智能.31( 1), 77-90.]
19	Lord F.M . ( 1980). Applications of item response theory to practical testing problems. Hillsdale NJ: Erlbaum.
20	Mao X.Z., & Xin T. , ( 2011). Item selection method in computerized adaptive testing. Advances in Psychological Science.19( 10), 1552-1562.
21	[ 毛秀珍, 辛涛 . ( 2011). 计算机化自适应测验选题策略述评. 心理科学进展.19( 10), 1552-1562.]
22	, Mao X.Z., & Xin T. , ( 2015). Multidimensional computerized adaptive testing: Model, techniques and methods. Advances in Psychological Science.23( 5), 907-918.
23	[ 毛秀珍, 辛涛 . ( 2015). 多维计算机化自适应测验: 模型, 技术和方法. 心理科学进展.23( 5), 907-918.]
24	Pirasteh P., Jung J. J., & Hwang D . (2014, April). Item-based collaborative filtering with attribute correlation: A case study on movie recommendation. In N. T. Nguyen, B. Attachoo, B. Trawiński, & K. Somboonviwat (Eds.), In Proceedings of the 6th Asian Conference on Intelligent Information and Database Systems (pp. 245-252). Cham, Switzerland: Springer.
25	Quijano-Sánchez L., Recio-García J. A., Díaz-Agudo B., & Jiménez-Díaz G . ( 2011, March). Happy movie: A group recommender application in facebook. In Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference (pp. 419-420). Palm Beach, FL: AAAI.
26	Ricci F., Rokach L., & Shapira B . ( 2015). Recommender systems: Introduction and challenges. In F. Ricci, L. Rokach, & B. Shapira (Eds.), Recommender Systems Handbook (2nd ed., pp.1-34). Boston, MA: Springer.
27	Smith B.., & Linden G. , ( 2017). Two decades of recommender systems at Amazon. com. IEEE Internet Computing.21( 3), 12-18.
28	Tan P. N., Steinbach M., & Kumar V. .,( 2006). Introduction to Data Mining .New York, NY: Pearson Education.
29	Thai-Nghe N., Drumond L., Krohn-Grimberghe A., & Schmidt-Thieme L . ( 2010). Recommender system for predicting student performance. Procedia Computer Science.1( 2), 2811-2819.
30	Wang H., Wang N., & Yeung D. Y . ( 2015, August). Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1235-1244).Sydney, NSW, Australia: ACM.
31	Weiss D.J . ( 1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement.6( 4), 473-492.
32	Zhang S., &Chang, H.H . ( 2016). From smart testing to smart learning: How testing technology can assist the new generation of education. International Journal of Smart Technology and Learning.1( 1), 67-92.
33	Zhu T. Y., Huang Z. Y., Chen E. H., Liu Q., Wu R. Z., Wu L., … Hu G. P . ( 2017). Cognitive diagnosis based personalized question recommendation. Chinese Journal of Computers.40( 1), 176-191.
34	[ 朱天宇, 黄振亚, 陈恩红, 刘淇, 吴润泽, 吴乐, .. 胡国平 . ( 2017). 基于认知诊断的个性化试题推荐方法. 计算机学报.40( 1), 176-191.]

选题策略	均方误差	平均绝对误差	能力估计相关	卡方值	测验重叠率	曝光不足	曝光过度	答题者调用率
定长20道题目
随机选题	0.323	0.449	0.829	2.595	5.56%	0	0
FMI	0.090	0.234	0.954	127.852	40.80%	315	41
DEBR (FMI)	0.141	0.291	0.930	66.341	21.83%	22	29	14.12%
IEBR (FMI)	0.242	0.383	0.872	8.712	7.09%	1	2	2.53%
BAS	0.224	0.370	0.882	14.164	9.00%	46	6
DEBR (BAS)	0.217	0.365	0.884	11.246	8.25%	44	4	4.25%
IEBR (BAS)	0.222	0.369	0.882	11.187	8.15%	42	4	4.66%
定长40道题目
随机选题	0.198	0.354	0.890	4.572	11.05%	0	0
FMI	0.052	0.178	0.974	118.335	45.72%	240	80
DEBR (FMI)	0.089	0.228	0.956	95.045	34.38%	37	78	19.77%
IEBR (FMI)	0.126	0.277	0.937	7.571	11.80%	0	15	5.19%
BAS	0.126	0.278	0.932	18.962	15.03%	14	36
DEBR (BAS)	0.125	0.276	0.933	15.930	14.27%	13	27	6.98%
IEBR (BAS)	0.128	0.280	0.931	12.012	13.25%	14	17	7.22%

选题策略	均方误差	平均绝对误差	能力估计相关	卡方值	测验重叠率	曝光不足	曝光过度	答题者调用率
定长20道题目
随机选题	0.323	0.449	0.829	2.595	5.56%	0	0
FMI	0.090	0.234	0.954	127.852	40.80%	315	41
DEBR (FMI)	0.141	0.291	0.930	66.341	21.83%	22	29	14.12%
IEBR (FMI)	0.242	0.383	0.872	8.712	7.09%	1	2	2.53%
BAS	0.224	0.370	0.882	14.164	9.00%	46	6
DEBR (BAS)	0.217	0.365	0.884	11.246	8.25%	44	4	4.25%
IEBR (BAS)	0.222	0.369	0.882	11.187	8.15%	42	4	4.66%
定长40道题目
随机选题	0.198	0.354	0.890	4.572	11.05%	0	0
FMI	0.052	0.178	0.974	118.335	45.72%	240	80
DEBR (FMI)	0.089	0.228	0.956	95.045	34.38%	37	78	19.77%
IEBR (FMI)	0.126	0.277	0.937	7.571	11.80%	0	15	5.19%
BAS	0.126	0.278	0.932	18.962	15.03%	14	36
DEBR (BAS)	0.125	0.276	0.933	15.930	14.27%	13	27	6.98%
IEBR (BAS)	0.128	0.280	0.931	12.012	13.25%	14	17	7.22%

选题策略	均方误差	平均绝对误差	能力估计相关	卡方值	测验重叠率	曝光不足	曝光过度	答题者调用率
随机选题	0.320	0.440	0.830	2.551	8.02%	0	0
FMI	0.152	0.307	0.922	150.511	58.48%	214	33
DEBR (FMI)	0.190	0.341	0.901	101.793	40.81%	53	38	25.04%
DEBR (FMI+DEBR)	0.233	0.380	0.875	47.426	21.10%	29	35	12.69%
IEBR (FMI)	0.265	0.408	0.855	43.395	19.63%	0	24	5.24%
IEBR (FMI+IEBR)	0.274	0.414	0.852	11.830	8.19%	0	0	2.86%
BAS	0.259	0.404	0.861	42.965	19.48%	20	27
DEBR (BAS)	0.253	0.395	0.869	43.449	19.65%	12	33	9.75%
DEBR (BAS+DEBR)	0.262	0.403	0.865	39.684	18.29%	13	26	9.51%
IEBR (BAS)	0.266	0.408	0.858	37.491	17.49%	17	24	9.96%
IEBR (BAS+IEBR)	0.267	0.407	0.855	25.305	13.07%	8	18	5.13%

选题策略	均方误差	平均绝对误差	能力估计相关	卡方值	测验重叠率	曝光不足	曝光过度	答题者调用率
随机选题	0.320	0.440	0.830	2.551	8.02%	0	0
FMI	0.152	0.307	0.922	150.511	58.48%	214	33
DEBR (FMI)	0.190	0.341	0.901	101.793	40.81%	53	38	25.04%
DEBR (FMI+DEBR)	0.233	0.380	0.875	47.426	21.10%	29	35	12.69%
IEBR (FMI)	0.265	0.408	0.855	43.395	19.63%	0	24	5.24%
IEBR (FMI+IEBR)	0.274	0.414	0.852	11.830	8.19%	0	0	2.86%
BAS	0.259	0.404	0.861	42.965	19.48%	20	27
DEBR (BAS)	0.253	0.395	0.869	43.449	19.65%	12	33	9.75%
DEBR (BAS+DEBR)	0.262	0.403	0.865	39.684	18.29%	13	26	9.51%
IEBR (BAS)	0.266	0.408	0.858	37.491	17.49%	17	24	9.96%
IEBR (BAS+IEBR)	0.267	0.407	0.855	25.305	13.07%	8	18	5.13%

[1]	孙小坚, 郭磊. 考虑题目选项信息的非参数认知诊断计算机自适应测验[J]. 心理学报, 2022, 54(9): 1137-1150.
[2]	罗芬, 王晓庆, 蔡艳, 涂冬波. 基于基尼指数的双目标CD-CAT选题策略[J]. 心理学报, 2020, 52(12): 1452-1465.
[3]	郭磊; 郑蝉金; 边玉芳; 宋乃庆; 夏凌翔. 认知诊断计算机化自适应测验中新的选题策略：结合项目区分度指标[J]. 心理学报, 2016, 48(7): 903-914.
[4]	罗照盛;喻晓锋;高椿雷;李喻骏;彭亚风;王睿;王钰彤. 基于属性掌握概率的认知诊断计算机化自适应测验选题策略[J]. 心理学报, 2015, 47(5): 679-688.
[5]	郭磊;王卓然;王丰;边玉芳. 结合a分层的兼具项目曝光和广义测验重叠率控制的选题策略[J]. 心理学报, 2014, 46(5): 702-713.
[6]	罗芬,丁树良,王晓庆. 多级评分计算机化自适应测验动态综合选题策略[J]. 心理学报, 2012, 44(3): 400-412.
[7]	程小扬,丁树良,严深海,朱隆尹. 引入曝光因子的计算机化自适应测验选题策略[J]. 心理学报, 2011, 43(02): 203-212.
[8]	刘珍,丁树良,林海菁. 基于GPCM的计算机自适应测验选题策略比较[J]. 心理学报, 2008, 40(05): 618-625.
[9]	林海菁,丁树良. 具有认知诊断功能的计算机化自适应测验的研究与实现[J]. 心理学报, 2007, 39(04): 747-753.
[10]	戴海琦,陈德枝,丁树良,邓太萍. 多级评分题计算机自适应测验选题策略比较[J]. 心理学报, 2006, 38(05): 778-783.
[11]	陈平,丁树良,林海菁,周婕. 等级反应模型下计算机化自适应测验选题策略[J]. 心理学报, 2006, 38(03): 461-467.