ISSN 0439-755X
CN 11-1911/B

›› 2009, Vol. 41 ›› Issue (08): 773-784.

### A Comparison of GT and IRT: An Analysis of Performance Rating of Men’s 10 Me-ters Platform Diving in Beijing Olympic Games

YU Zong-Huo;TANG Xiao-Juan;WANG Deng-Feng

1. (1 Department of Psychology, Peking University, Beijing 100871, China)
(2 College of Mathematics and Information Science, Nanchang Hangkong University, Nanchang 330063, China)
• Received:2008-11-14 Revised:1900-01-01 Published:2009-08-30 Online:2009-08-30
• Contact: WANG Deng-Feng

Abstract: Generalizability Theory (GT) and Item Response Theory (IRT) have improved the Classical Test Theory (CTT) in different aspects. They put focus on macro-level and micro-level of measurement, respectively. Both GT and Multi-Facet Rasch Measurement model (MFRM, which is one case of IRT methods) can be applied to decompose the variances from different sources (including error) in the Performance Rating and to estimate the reliability of rating. The results from both of them can give researchers some recommendations about how to improve the Performance Rating. This paper tries to find how they perform differently in the way of improving the rating process in Beijing Olympic Games through making a comparison between GT and MFRM.
Those athletes’ scores from 10 meters platform diving in Beijing Olympic Games form the data to be anal-ysis. In the 2008 Beijing Olympic Games, there were twelve athletes who participated in the final of Men’s 10 meters platform diving. Each athlete dived six times, and was marked independently by seven referees each time. In total, there are 12´6´7=504 data points. Based on this dataset, both GT and MFRM are applied to analyze four facets (including round, person, referee, and difficulty) of these scores. However, as a hidden facet, diffi-culty can’t be separated in GT.
The results from GT and MFRM suggest consistently that the athlete, the round, and their interaction are important sources of variation in these scores, and that the referees have not significant contribution to variance in athletes’ scores. At the same time, the results from MFRM indicate that the difficulty is also a significant source of variation. Based on these results, we can find some ways to improve scoring from different aspects. For example, we find that the g coefficient is influenced significantly not by the number of referee but by the number of rounds. Therefore, it’s helpful to improve the reliability of rating through increasing the number of rounds. MFRM gives the measure of individual elements within each facet, the standard errors for each ele-ment and the diagnostic fit statistics to detect aberrant responses. Based on the analysis of MFRM, We find the referees disordered the step calibrations of the scale around the category of 6.5. The results from MFRM also give birth to a new ranking which is really different from that given in the 2008 Beijing Olympic Games.
In sum, we find that GT and MFRM are consistent totally in estimating the sources of variation. However, both methods have their own advantages. GT is more helpful in the way of design of measurement, and MFRM is more helpful in the ways of measure of individual elements within each facet and detecting aberrant responses. Moreover, MFRM can separate the effects of round, referee, and difficulty more successfully and produce a more precise estimation of ranking of athletes than the method used in 2008 Beijing Olympic Games.