Chapter 1 Introduction 1
1.1 Rationales for studying ratervariability 1
1.2 Status quo of studies on rater variability 2
1.3 An overview of this book 5
1.4 Definition of key terms 5
Chapter 2 Literature review:Studies on rater variability in language performance assessment 7
2.1 Rater variability in language performance assessment 7
2.2 Exploring rater variability using statistical analysis 9
2.2.1 Introduction 9
2.2.2 Rater reliability in Classical Test Theory 9
2.2.3 Rater facet as variance component in Generalizability Theory 10
2.2.4 Rater calibration in Many-Facet Rasch Model 13
2.2.5 Summary 21
2.3 Process-oriented approach to investigating rater variability 22
2.3.1 Raters'decision-making:the“black box”behind the final ratings 22
2.3.2 Indirect evidence 23
2.3.3 Direct investigation of rating process:insights from verbal protocols 27
2.4 Factors accounting for rater variability 46
2.4.1 Extemal factors 47
2.4.2 Internal factors 48
2.4.3 Situational factors 52
2.5 A framework for comparison between rater groups 53
2.6 Summary 56
Chapter 3 Study 1:Investigating the scoring reliability of CET-SET using Many-Facet Rasch Model 57
3.1 Issues in second language speaking assessment 57
3.2 Challenges in test validation 57
3.3 The context of the study 63
3.4 Objectives of the study 64
3.5 Methods 65
3.5.1 Data 65
3.5.2 Instrument(MFRM) 66
3.6 Data analyses and findings 67
3.6.1 Facet map 67
3.6.2 Candidates 69
3.6.3 Tasks 74
3.6.4 Items 75
3.6.5 Rating scales 77
3.6.6 Raters 80
3.6.7 Bias analysis 82
3.7 Conclusions 85
3.8 Implications 87
3.9 Further research efforts to be made 88
Chapter 4 Study 2:Exploring how raters'cognitive and meta-cognitive strategies influence rating accuracy in essay scoring 90
4.1 Subjective scoring:A matter of reliability or validity? 91
4.2 Exploring rating process:Looking into rater variability 94
4.3 Rater cognition studies in writing assessment 102
4.4 Methodology 107
4.4.1 The context of the study 107
4.4.2 Participants 108
4.4.3 Materials 108
4.4.4 Data collection 109
4.4.5 Data analysis 111
4.5 Results and discussion 116
4.5.1 General patterns of differences in broad categories 116
4.5.2 In-depth investigation of differences in the major sub-categories 118
4 6 Summary and further discussion 130
4.7 Conclusion 134
Chapter 5 Conclusions 135
5.1 Summary of findings 135
5.2 Comparison of the two studies 136
5.3 Limitations 137
5.4 Further research efforts to be made 138
Appendix Ⅰ CET-SET rating scale 140
Appendix Ⅱ CET4 rating rubrics forthe writing task 142
Appendix Ⅲ The writing task of the Dec.2006 administration of CET4 and range finders 144
Appendix Ⅳ Sample essays 147
Appendix Ⅴ Instructions and training tasks for think-aloud session 149
Appendix Ⅵ Sample transcripts of raters'thinking aloud 151
Appendix Ⅶ Coding protocols for think-aloud verbal reports 154
Appendix Ⅷ The coding scheme for raters'cognitive and meta-cognitive strategies 155
References 157
Index 172