Arabic Bulgarian Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Korean Latvian Lithuanian Malagasy Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swedish Thai Turkish Vietnamese
Arabic Bulgarian Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Korean Latvian Lithuanian Malagasy Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swedish Thai Turkish Vietnamese

definition - Inter-rater reliability

definition of Wikipedia

   Advertizing ▼


Inter-rater reliability

From Wikipedia

Jump to: navigation, search

Inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.

There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient and intra-class correlation.


The philosophy of inter-rater agreement

There are several operational definitions [1] of "inter-rater reliability" in use by Examination Boards, reflecting different viewpoints about what is reliable agreement between raters.

There are three operational definitions of agreement:

1. Reliable raters agree with the "official" rating of a performance.

2. Reliable raters agree with each other.

3. Reliable raters agree about which performance is better and which is worse.

These combine with two operational definitions of behavior:

A. Reliable raters are automatons, behaving like "rating machines". This category includes rating of essays by computer [2]. This behavior can be evaluated by Generalizability theory.

B. Reliable raters behave like independent witnesses. They demonstrate their independence by disagreeing slightly. This behavior can be evaluated by the Rasch model.

Joint probability of agreement

The joint-probability of agreement is probably the most simple and least robust measure. It is the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater divided by the total number of ratings. It assumes that the data are entirely nominal. It does not take into account that agreement may happen solely based on chance. Some question, though, whether there is a need to 'correct' for chance agreement; and suggest that, in any case, any such adjustment should be based on an explicit model of how chance and error affect raters' decisions.[3]

Kappa statistics

Main articles: Cohen's kappa, Fleiss' kappa

Cohen's kappa[4], which works for two raters, and Fleiss' kappa[5], an adaptation that works for any fixed number of raters, improve upon the joint probability in that they take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problem as the joint-probability in that they treat the data as nominal and assume the ratings have no natural ordering. If the data do have an order, the information in the measurements is not fully taken advantage of.

Correlation coefficients

Main articles: Pearson product-moment correlation coefficient, Spearman's rank correlation coefficient

Either Pearson's r or Spearman's \rho can be used to measure pairwise correlation among raters using a scale that is ordered. Pearson assumes the rating scale is continuous; Spearman assumes only that it is ordinal. If more than two raters are observed, an average level of agreement for the group can be calculated as the mean of the r (or \rho) values from each possible pair of raters.

Both the Pearson and Spearman coefficients consider only relative position. For example, (1, 2, 1, 3) is considered perfectly correlated with (2, 3, 2, 4).

Intra-class correlation coefficient

Another way of performing reliability testing is to use the intra-class correlation coefficient (ICC) [6].There are several types of this and one is defined as, "the proportion of variance of an observation due to between-subject variability in the true scores".[7] The range of the ICC may be between 0.0 and 1.0 (an early definition of ICC could be between −1 and +1). The ICC will be high when there is little variation between the scores given to each item by the raters, e.g. if all ratersgive the same, or similar scores to each of the items. The ICC is an improvement over Pearson's r and Spearman's \rho,as it takes into account of the differences in ratings for individual segments, along with the correlation between raters.

Limits of agreement

Bland–Altman plot

Another approach to agreement (useful when there are only two raters) is to calculate the mean of the differences between the two raters. The confidence limits around the mean provide insight into how much random variation may be influencing the ratings. If the raters tend to agree, the mean will be near zero. If one rater is usually higher than the other by a consistent amount, the mean will be far from zero, but the confidence interval will be narrow. If the raters tend to disagree, but without a consistent pattern of one rating higher than the other, the mean will be near zero but the confidence interval will be wide.

Bland and Altman [8] have expanded on this idea by graphing the difference of each point, the mean difference, and the confidence limits on the vertical against the average of the two ratings on the horizontal. The resulting Bland–Altman plot demonstrates not only the overall degree of agreement, but also whether the agreement is related to the underlying value of the item. For instance, two raters might agree closely in estimating the size of small items, but disagree about larger items.

Krippendorff’s Alpha

Krippendorff's alpha[9] is a versatile and general statistical measure for assessing the agreement achieved when multiple raters describe a set of objects of analysis in terms of the values of a variable. Alpha emerged in content analysis where textual units are categorized by trained coders and is used in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychometrics where individual attributes are tested by multiple methods, or in observational studies where unstructured happenings are recorded for subsequent analysis.


  1. ^  Saal, F.E., Downey, R.G. and Lahey, M.A (1980) "Rating the Ratings: Assessing the Psychometric Quality of Rating Data" in Psychological Bulletin. Vol. 88, No. 2, pp. 413–428
  2. ^  Page, E. B, and Petersen, N. S. (1995) "The Computer Moves into Essay Grading: Updating the Ancient Test" in Phi Delta Kappan. Vol. 76, No. 7, pp. 561–565.
  3. ^  Uebersax, John S. (1987). "Diversity of decision making models and the measurement of interrater agreement" in Psychological Bulletin. Vol 101, pp. 140–146.
  4. ^  Cohen, J. (1960) "A coefficient for agreement for nominal scales" in Education and Psychological Measurement. Vol. 20, pp. 37–46
  5. ^  Fleiss, J. L. (1971) "Measuring nominal scale agreement among many raters" in Psychological Bulletin. Vol. 76, No. 5, pp. 378–382
  6. ^  Shrout, P. and Fleiss, J. L. (1979) "Intraclass correlation: uses in assessing rater reliability" in Psychological Bulletin. Vol. 86, No. 2, pp. 420–428
  7. ^  Everitt, B. (1996) Making Sense of Statistics in Psychology (Oxford : Oxford University Press) ISBN 0-19-852366-1
  8. ^  Bland, J. M., and Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet i, pp. 307–310.
  9. ^  Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Thousand Oaks, CA: Sage. pp. 219-250.
  10. ^  Hayes, A. F. & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1, 77-89.

Further reading

  • Gwet, K. (2001) Handbook of Inter-Rater Reliability, (Gaithersburg : StatAxis Publishing) ISBN 0-9708062-0-5

External links


All translations of Inter-rater reliability

sensagent's content

  • definitions
  • synonyms
  • antonyms
  • encyclopedia

Dictionary and translator for handheld

⇨ New : sensagent is now available on your handheld

   Advertising ▼

sensagent's office

Shortkey or widget. Free.

Windows Shortkey: sensagent. Free.

Vista Widget : sensagent. Free.

Webmaster Solution


A windows (pop-into) of information (full-content of Sensagent) triggered by double-clicking any word on your webpage. Give contextual explanation and translation from your sites !

Try here  or   get the code


With a SensagentBox, visitors to your site can access reliable information on over 5 million pages provided by Sensagent.com. Choose the design that fits your site.

Business solution

Improve your site content

Add new content to your site from Sensagent by XML.

Crawl products or adds

Get XML access to reach the best products.

Index images and define metadata

Get XML access to fix the meaning of your metadata.

Please, email us to describe your idea.


The English word games are:
○   Anagrams
○   Wildcard, crossword
○   Lettris
○   Boggle.


Lettris is a curious tetris-clone game where all the bricks have the same square shape but different content. Each square carries a letter. To make squares disappear and save space for other squares you have to assemble English words (left, right, up, down) from the falling squares.


Boggle gives you 3 minutes to find as many words (3 letters or more) as you can in a grid of 16 letters. You can also try the grid of 16 letters. Letters must be adjacent and longer words score better. See if you can get into the grid Hall of Fame !

English dictionary
Main references

Most English definitions are provided by WordNet .
English thesaurus is mainly derived from The Integral Dictionary (TID).
English Encyclopedia is licensed by Wikipedia (GNU).


The wordgames anagrams, crossword, Lettris and Boggle are provided by Memodata.
The web service Alexandria is granted from Memodata for the Ebay search.
The SensagentBox are offered by sensAgent.


Change the target language to find translations.
Tips: browse the semantic fields (see From ideas to words) in two languages to learn more.

last searches on the dictionary :

4301 online visitors

computed in 0.047s

I would like to report:
section :
a spelling or a grammatical mistake
an offensive content(racist, pornographic, injurious, etc.)
a copyright violation
an error
a missing statement
please precise:



Company informations

My account



   Advertising ▼