IRT-based scoring methods for multidimensional forced choice tests

doi:10.3724/SP.J.1042.2022.01410

Abstract

Abstract:

Forced-choice (FC) test is widely used in non-cognitive testing because of its effectiveness in resisting faking and the response bias caused by traditional Likert method. The traditional scoring of forced-choice test produces ipsative data, which has been criticized for being unsuitable for inter-individual comparisons. In recent years, the development of multiple forced-choice IRT models that allow researchers to obtain normative information from forced-choice test has re-ignited the interest of researchers and practitioners in forced-choice IRT models. The six prevailing forced-choice IRT models in existing studies can be classified according to the adopted decision model and item response model. The TIRT model, the RIM model and the BRB-IRT model for which the decision model is Thurstone’s Law of Comparative Judgment, and the MUPP framework and its derivatives for which the Luce Choice Axiom is adopted. In terms of item response mode, both the MUPP-GGUM and GGUM-RANK models are applicable to items with unfolding response mode, while the other forced-choice models are applicable to items with dominant response mode. In the parameter estimation method, it can also be distinguished according to the estimation algorithm and the estimation process. MUPP-GGUM uses a two-step strategy for parameter estimation, and it uses Likert scale to calibrate item parameters in advance, so that it can facilitate subsequent item bank management, while the others use joint estimation methods. For joint estimation, TIRT uses the traditional estimation algorithms: weighted least squares (WLS)/diagonally weighted least squares (DWLS), both of which are conveniently used in Mplus and take relatively little time, but at the same time they suffer from poor convergence and high computer memory usage in high-dimensional situations. The other model uses the Markov chain Monte Carlo (MCMC) algorithm, which effectively solves the convergence and insufficient memory in traditional algorithms, but the estimation time is longer and much slower than the traditional algorithms.
The research on the application of the forced-choice IRT model is summarized in three areas: parameter invariance testing, computerized adaptive testing (CAT) and validity study. Parameter invariance testing can be divided into cross-block consistency and cross-population consistency (also known as DIF), with more research currently focusing on the latter, for example, there are already DIF testing methods for TIRT and RIM. While enriching or upgrading existing DIF testing methods is needed in future research in addition to develop other forced-choice model DIF testing methods so as to be more sensitive to DIF from multiple sources. Non-cognitive tests are usually high-dimensional, and the tests length problems caused by high dimensionality can be naturally addressed by CAT. There are studies that have already explored appropriate item selection strategies for the MUPP-GGUM, GGUM-RANK and RIM models. Future research can continue to explore item selection strategies for different forced-choice IRT models to ensure that the forced-choice CAT test can achieve a balance between measurement precision and test length in high-dimensional context. Validity studies focus on whether the scores obtained from the forced selection IRT model reflect the true characteristics of individuals, as tests that are not validated have huge pitfalls in the interpretation of the results. Some studies have compared IRT scores, traditional scores, and Likert-type scores to see whether IRT scores can yield similar results to Likert scores, whether they perform better than traditional scores in terms of recovery of latent traits. However, the use of Likert scale scores as criterion may introduces response bias as a source of error, and future research can focus on obtaining purer, more convincing criterion.

Key words: forced choice test, ipsative data, TIRT, MUPP, GGUM-RANK

CLC Number:

B841

LIU Juan, ZHENG Chanjin, LI Yunchuan, LIAN Xu. IRT-based scoring methods for multidimensional forced choice tests[J]. Advances in Psychological Science, 2022, 30(6): 1410-1428.

Figures/Tables 6

References 93

[1]	连旭, 卞迁, 曾劭婵, 车宏生. (2014). MAP职业性格迫选测验基于瑟斯顿IRT模型的拟合分析[摘要]. 全国心理学学术会议, 北京.
[2]	李辉, 肖悦, 刘红云. (2017). 抗作假人格迫选测验中瑟斯顿IRT模型的影响因素. 北京师范大学学报(自然科学版), 53(5), 624-630.
[3]	骆方, 张厚粲. (2007). 人格测验中作假的控制方法. 心理学探新, 27(4), 78-82.
[4]	王珊, 骆方, 刘红云. (2014). 迫选式人格测验的传统计分与IRT计分模型. 心理科学进展, 22(3), 549-557.
[5]	Adams R. J., Wu M. L., & Wilson M. R. (2015). ACER Conquest 4.0 [Computer program]. Melbourne: Australian Council for Educational Research
[6]	Aguinis H., & Handelsman M. M. (1997). Ethical issues in the use of the bogus pipeline. Journal of Applied Social Psychology, 27(7), 557-573. pmid: 14627020
[7]	Aon, Hewitt. (2015). 2015 Trends in global employee engagement report. Lincolnshire, IL: Aon Corp.
[8]	Baron H. (1996). Strengths and limitations of ipsative measurement. Journal of Occupational and Organizational Psychology, 69(1), 49-56. doi: 10.1111/j.2044-8325.1996.tb00599.x URL
[9]	Bartram D. (1996). The relationship between ipsatized and normative measures of personality. Journal of Occupational and Organizational Psychology, 69(1), 25-39. doi: 10.1111/j.2044-8325.1996.tb00597.x URL
[10]	Bartram D. (2007). Increasing validity with forced-choice criterion measurement formats. International Journal of Selection and Assessment, 15(3), 263-272. doi: 10.1111/j.1468-2389.2007.00386.x URL
[11]	Block J. (1963). The Q-sort method in personality assessment and psychiatric research. Archives of General Psychiatry, 136(3), 230-231.
[12]	Bradley R. A., & Terry M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324-345.
[13]	Bradlow E. T., Wainer H., & Wang X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153-168. doi: 10.1007/BF02294533 URL
[14]	Brown A. (2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(1), 135-160. doi: 10.1007/s11336-014-9434-9 pmid: 25663304
[15]	Brown A., Inceoglu I., & Lin Y. (2017). Preventing rater biases in 360-degree feedback by forcing choice. Organizational Research Methods, 20(1), 121-148. doi: 10.1177/1094428116668036 URL
[16]	Brown A., & Maydeu-Olivares A. (2010). Issues that should not be overlooked in the dominance versus ideal point controversy. Industrial and Organizational Psychology, 3(4), 489-493. doi: 10.1111/j.1754-9434.2010.01277.x URL
[17]	Brown A., & Maydeu-Olivares A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460-502. doi: 10.1177/0013164410375112 URL
[18]	Brown A., & Maydeu-Olivares A. (2012). Fitting a Thurstonian IRT model to forced-choice data using mplus. Behavior Research Methods, 44(4), 1135-1147. doi: 10.3758/s13428-012-0217-x pmid: 22733226
[19]	Brown A., & Maydeu-Olivares A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1), 36-52. doi: 10.1037/a0030641 pmid: 23148475
[20]	Brown A., & Maydeu-Olivares A. (2018). Ordinal factor analysis of graded-preference questionnaire data. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 516-529. doi: 10.1080/10705511.2017.1392247 URL
[21]	Bürkner P.-C. (2018). thurstonianIRT: Thurstonian IRT models in R. Journal of Open Source Software, 4(42), 1662. doi: 10.21105/joss.01662 URL
[22]	Bürkner P.-C., Schulte N., & Holling H. (2019). On the statistical and practical limitations of Thurstonian IRT models. Educational and Psychological Measurement, 79(5), 827-854. doi: 10.1177/0013164419832063 URL
[23]	Cao M., & Drasgow F. (2019). Does forcing reduce faking? A meta-analytic review of forced-choice personality measures in high-stakes situations. The Journal of Applied Psychology, 104(11), 1347-1368. doi: 10.1037/apl0000414 URL
[24]	Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29.
[25]	Chen C.-W., Wang W.-C., Chiu M. M., & Ro S. (2020). Item selection and exposure control methods for computerized adaptive testing with multidimensional ranking items. Journal of Educational Measurement, 57(2), 343-369. doi: 10.1111/jedm.12252 URL
[26]	Chernyshenko O. S., Stark S., Chan K. Y., Drasgow F., & Williams B. (2001). Fitting item response theory models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36(4), 523-562. doi: 10.1207/S15327906MBR3604_03 pmid: 26822181
[27]	Clemans W. V. (1966). An analytical and empirical examination of some properties of ipsative measures. Psychometric Monographs, 14.
[28]	Closs S. J. (1996). On the factoring and interpretation of ipsative data. Journal of Occupational and Organizational Psychology, 69(1), 41-47. doi: 10.1111/j.2044-8325.1996.tb00598.x URL
[29]	Coombs C. H. (1950). Psychological scaling without a unit of measurement. Psychological Review, 57(3), 145-158. doi: 10.1037/h0060984 URL
[30]	Doornik J. A. (2009). An object-oriented matrix programming language Ox 6. London, England: Timberlake Consultants.
[31]	Drasgow F., Chernyshenko O. S., & Stark S. (2010). 75 years after Likert: Thurstone was right! Industrial and Organizational Psychology, 3(4), 465-476. doi: 10.1111/j.1754-9434.2010.01273.x URL
[32]	Dueber D. M., Love A. M. A., Toland M. D., & Turner T. A. (2018). Comparison of single-response format and forced- choice format instruments using Thurstonian item response theory. Educational and Psychological Measurement, 79(1), 108-128. doi: 10.1177/0013164417752782 URL
[33]	Dwight S. A., & Donovan J. J. (2003). Do warnings not to fake reduce faking? Human Performance, 16(1), 1-23. doi: 10.1207/S15327043HUP1601_1 URL
[34]	Gelman A., & Rubin D. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457-472.
[35]	Guenole N., Brown A., & Cooper A. (2016). Forced-choice assessment of work-related maladaptive personality traits: Preliminary evidence from an application of Thurstonian item response modeling. Assessment, 25(4), 513-526. doi: 10.1177/1073191116641181 URL
[36]	Gwet K. L. (2014). Handbook of inter-rater reliability (4th ed.): The definitive guide to measuring the extent of agreement among raters. Gaithersburg, MD: Advanced Analytics, LLC.
[37]	Hendy N., Krammer G., Schermer J. A., & Biderman M. D. (2021). Using bifactor models to identify faking on Big Five questionnaires. International Journal of Selection and Assessment, 29(1), 81-99. doi: 10.1111/ijsa.12316 URL
[38]	Hontangas P. M., de la Torre J., Ponsoda V., Leenen I., Morillo D., & Abad F. J. (2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39(8), 598-612. doi: 10.1177/0146621615585851 pmid: 29881030
[39]	Hontangas P. M., Leenen I., de la Torre J., Ponsoda V., Morillo D., & Abad F. J. (2016). Traditional scores versus IRT estimates on forced-choice tests based on a dominance model. Psicothema, 28(1), 76-82. doi: 10.7334/psicothema2015.204 pmid: 26820428
[40]	Houston J., Borman W., Farmer W., & Bearden R. (2006). Development of the Navy Computer Adaptive Personality Scales (NCAPS)(NPRST-TR-06-2). Millington, TN: Navy Personnel Research, Studies, and Technology.
[41]	Huang J., & Mead A. D. (2014). Effect of personality item writing on psychometric properties of ideal-point and Likert scales. Psychological Assessment, 26(4), 1162-1172. doi: 10.1037/a0037273 URL
[42]	Hurtz G., & Donovan J. (2000). Personality and job performance: The Big Five revisited. The Journal of Applied Psychology, 85(6), 869-879. doi: 10.1037/0021-9010.85.6.869 URL
[43]	Jackson D. N., Wroblewski V. R., & Ashton M. C. (2000). The impact of faking on employment tests: Does forced choice offer a solution? Human Performance, 13(4), 371-388. doi: 10.1207/S15327043HUP1304_3 URL
[44]	Joo S.-H., Lee P., & Stark S. (2018). Development of information functions and indices for the GGUM-RANK multidimensional forced choice IRT model. Journal of Educational Measurement, 55(3), 357-372. doi: 10.1111/jedm.12183 URL
[45]	Joo S.-H., Lee P., & Stark S. (2020). Adaptive testing with the GGUM-RANK multidimensional forced choice model: Comparison of pair, triplet, and tetrad scoring. Behavior Research Methods, 52(2), 761-772. doi: 10.3758/s13428-019-01274-6 URL
[46]	Joubert T., Inceoglu I., Bartram D., Dowdeswell K., & Lin Y. (2015). A comparison of the psychometric properties of the forced choice and Likert scale versions of a personality instrument. International Journal of Selection and Assessment, 23(1), 92-97. doi: 10.1111/ijsa.12098 URL
[47]	Kiefer T., Robitzsch A., & Wu M. (2016). TAM: Test analysis modules (R package version 1.995-0) [Computer program]. Retrieved from https://cran.r-project.org/web/packages/TAM/index.html
[48]	Kim J.-S., & Bolt D. (2007). Estimating item response theory models using Markov chain Monte Carlo methods. Educational Measurement: Issues and Practice, 26(4), 38-51. doi: 10.1111/j.1745-3992.2007.00107.x URL
[49]	Lee H., & Smith W. Z. (2020a). A Bayesian random block item response theory model for forced-choice formats. Educational and Psychological Measurement, 80(3), 578-603. doi: 10.1177/0013164419871659 URL
[50]	Lee H., & Smith W. Z. (2020b). Fit indices for measurement invariance tests in the Thurstonian IRT model. Applied Psychological Measurement, 44(4), 282-295. doi: 10.1177/0146621619893785 URL
[51]	Lee P., Joo S.-H., & Stark S. (2020). Detecting DIF in multidimensional forced choice measures using the Thurstonian item response theory model. Organizational Research Methods, 24(4), 739-771. doi: 10.1177/1094428120959822 URL
[52]	Lee P., Joo S.-H., Stark S., & Chernyshenko O. S. (2019). GGUM-RANK statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43(3), 226-240. doi: 10.1177/0146621618768294 URL
[53]	Li M., Sun T., & Zhang B. (2021). autoFC: An R package for automatic item pairing in forced-choice test construction. Applied Psychological Measurement, Advance online publication. https://doi.org/10.1177/01466216211051726.
[54]	Lin Y., & Brown A. (2017). Influence of context on item parameters in forced-choice personality assessments. Educational and Psychological Measurement, 77(3), 389-414. doi: 10.1177/0013164416646162 URL
[55]	Luce, & Duncan, R.(1959). On the possible psychophysical laws. Psychological Review, 66(2), 81-95. doi: 10.1037/h0043178 URL
[56]	Luce R. D. (1977). The choice axiom after twenty years. Journal of Mathematical Psychology, 15(3), 215-233. doi: 10.1016/0022-2496(77)90032-3 URL
[57]	Lunn D., Spiegelhalter D., Thomas A., & Best N. (2009). The BUGS project: Evolution, critique and future directions. Statistics in Medicine, 28(25), 3049-3067. doi: 10.1002/sim.3680 URL
[58]	Maydeu-Olivares A., & Brown A. (2010). Item response modeling of paired comparison and ranking data. Multivariate Behavioral Research, 45(6), 935-974. doi: 10.1080/00273171.2010.531231 pmid: 26760724
[59]	Morillo D., Leenen I., Abad F. J., Hontangas P., de la Torre J., & Ponsoda V. (2016). A dominance variant under the multi-unidimensional pairwise-preference framework: Model formulation and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 40(7), 500-516. doi: 10.1177/0146621616662226 pmid: 29881066
[60]	Oswald F. L., & Schell K. L. (2010). Developing and scaling personality measures: Thurstone was right-But so far, Likert was not wrong. Industrial and Organizational Psychology, 3(4), 481-484. doi: 10.1111/j.1754-9434.2010.01275.x URL
[61]	Oswald F. L., Shaw A., & Farmer W. L. (2015). Comparing simple scoring with IRT scoring of personality measures: The navy computer adaptive personality scales. Applied Psychological Measurement, 39(2), 144-154. doi: 10.1177/0146621614559517 URL
[62]	Pavlov G., Shi D., Maydeu-Olivares A., & Fairchild A. (2021). Item desirability matching in forced-choice test construction. Personality and Individual Differences, 183, 111114.
[63]	Plummer M. (2003, March). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Paper presented at the 3rd International Workshop on Distributed Statistical Computing, Vienna, Austria.
[64]	Press W. H., Flannery B. P., Teukolsky S. A., & Vetterling W. T. (1986). Numerical recipes: The art of scientific computing. New York: Cambridge University Press.
[65]	Qiu X.-L., & Wang W.-C. (2021). Assessment of differential statement functioning in ipsative tests with multidimensional forced-choice items. Applied Psychological Measurement, 45(2), 79-94. doi: 10.1177/0146621620965739 URL
[66]	R Core Team. (2021): R: A language and environment for statistical computing. Vienna, Austria. Available online at https://www.R-project.org/ .
[67]	Roberts J. S., Donoghue J. R., & Laughlin J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1), 3-32. doi: 10.1177/01466216000241001 URL
[68]	Roberts J. S., & Thompson V. M. (2011). Marginal maximum a posteriori item parameter estimation for the generalized graded unfolding model. Applied Psychological Measurement, 35(4), 259-279. doi: 10.1177/0146621610392565 URL
[69]	Rosseel Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1-36.
[70]	Sass R., Frick S., Reips U.-D., & Wetzel E. (2020). Taking the test taker's perspective: Response process and test motivation in multidimensional forced-choice versus rating scale instruments. Assessment, 27(3), 572-584. doi: 10.1177/1073191118762049 URL
[71]	Saville P., & Willson E. (1991). The reliability and validity of normative and ipsative approaches in the measurement of personality. Journal of Occupational Psychology, 64(3), 219-238. doi: 10.1111/j.2044-8325.1991.tb00556.x URL
[72]	Schulte N., Holling H., & Bürkner P.-C. (2021). Can high- dimensional questionnaires resolve the ipsativity issue of forced-choice response formats? Educational and Psychological Measurement, 81(2), 262-289. doi: 10.1177/0013164420934861 URL
[73]	Seybert J., & Becker D. (2019). Examination of the test- retest reliability of a forced‐choice personality measure. ETS Research Report Series, 2019(1), 1-17.
[74]	SHL. (2018). OPQ32r technical manual. SHL.
[75]	Sitser T., van der Linden D., & Born M. P. (2013). Predicting sales performance criteria with personality measures: The use of the general factor of personality, the Big Five and narrow traits. Human Performance, 26(2), 126-149. doi: 10.1080/08959285.2013.765877 URL
[76]	Spiegelhalter D., Thomas A., & Best N. (2003). WinBUGS version 1.4 [Computer program]. Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health.
[77]	Stan Development Team. (2020). RStan: the R interface to Stan. http://mc-stan.org/
[78]	Stark S. E. (2002). A new IRT approach to test construction and scoring designed to reduce the effects of faking in personality assessment: The generalized graded unfolding model for multi -unidimensional paired comparison responses (Unpublished doctorial dissertation). University of Illinois at Urbana-Champaign.
[79]	Stark S., Chernyshenko O. S., & Drasgow F. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi- unidimensional pairwise-preference model. Applied Psychological Measurement, 29(3), 184-203. doi: 10.1177/0146621604273988 URL
[80]	Stark S., Chernyshenko O. S., Drasgow F., Nye C. D., White L. A., Heffner T., & Farmer W. L. (2014). From ABLE to TAPAS: A new generation of personality tests to support military selection and classification decisions. Military Psychology, 26(3), 153-164. doi: 10.1037/mil0000044 URL
[81]	Stark S., Chernyshenko O. S., Drasgow F., & White L. A. (2012). Adaptive testing with multidimensional pairwise preference items. Organizational Research Methods, 15(3), 463-487. doi: 10.1177/1094428112444611 URL
[82]	Tay L., Ali U. S., Drasgow F., & Williams B. (2011). Fitting IRT models to dichotomous and polytomous data: Assessing the relative model-data fit of ideal point and dominance models. Applied Psychological Measurement, 35(4), 280-295. doi: 10.1177/0146621610390674 URL
[83]	Tendeiro J. N., & Castro-Alvarez S. (2018). GGUM: An R package for fitting the generalized graded unfolding model. Applied Psychological Measurement, 43(2), 172-173. doi: 10.1177/0146621618772290 URL
[84]	Thurstone L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286. doi: 10.1037/h0070288 URL
[85]	Tu N., Zhang B., Angrave L., & Sun T. (2021). Bmggum: An R package for Bayesian estimation of the multidimensional generalized graded unfolding model with covariates. Applied Psychological Measurement, 45(7-8), 553-555.
[86]	Usami S., Sakamoto A., Naito J., & Abe Y. (2016). Developing pairwise preference-based personality test and experimental investigation of its resistance to faking effect by item response model. International Journal of Testing, 16(4), 288-309. doi: 10.1080/15305058.2016.1145123 URL
[87]	Walton K. E., Cherkasova L., & Roberts R. D. (2020). On the validity of forced choice scores derived from the Thurstonian item response theory model. Assessment, 27(4), 706-718. doi: 10.1177/1073191119843585 pmid: 31007043
[88]	Wang W.-C., Qiu X.-L., Chen C.-W., & Ro S. (2016). Item response theory models for multidimensional ranking items. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, & M. Wiberg (Eds.), Quantitative psychology research (pp. 49-65). New York, NY: Springer.
[89]	Wang W.-C., Qiu X.-L., Chen C.-W., Ro S., & Jin K.-Y. (2017). Item response theory models for ipsative tests with multidimensional pairwise comparison items. Applied Psychological Measurement, 41(8), 600-613. doi: 10.1177/0146621617703183 URL
[90]	Watrin L., Geiger M., Spengler M., & Wilhelm O. (2019). Forced-choice versus Likert responses on an occupational Big Five questionnaire. Journal of Individual Differences, 40(3), 134-148. doi: 10.1027/1614-0001/a000285
[91]	Wetzel E., Frick S., & Brown A. (2020). Does multidimensional forced-choice prevent faking? Comparing the susceptibility of the multidimensional forced-choice format and the rating scale format to faking. Psychological Assessment, 33 (2), 156-170. doi: 10.1037/pas0000971 URL
[92]	Zhang B., Sun T., Drasgow F., Chernyshenko O. S., Nye C. D., Stark S., & White L. A. (2020). Though forced, still valid: Psychometric equivalence of forced-choice and single-statement measures. Organizational Research Methods, 23(3), 569-590. doi: 10.1177/1094428119836486 URL
[93]	Ziegler M., MacCann C., & Roberts R. D. (Eds.).(2012). New perspectives on faking in personality assessment. Oxford, UK: Oxford University Press.

指导语：从以下两个描述中选择最符合自己的一项
题块	最符合
A寻找事物的不足	√
B探索陌生的领域

指导语：从以下两个描述中选择最符合自己的一项
题块	最符合
A寻找事物的不足	√
B探索陌生的领域

指导语：对以下描述进行排序
题块	排序
A寻找事物的不足	3
B探索陌生的领域	1
C基于数据分析做决定	2

指导语：对以下描述进行排序
题块	排序
A寻找事物的不足	3
B探索陌生的领域	1
C基于数据分析做决定	2

指导语：从以下描述中选择最符合自己和最不符合自己的一项
题块	最符合	最不符合
A寻找事物的不足
B探索陌生的领域	√
C基于数据分析做决定
D做注重精确性的工作		√