Suppose I would like to calculate the agreement between 2 different diagnostic tests to detect a disease using 2 experienced raters. The following assumptions hold: 1) The rating of a disease is based on a categorical scale like grade I, II and so on. Therefore, it is a process of categorising the disease into different grades. 2) Non of the tests is considered as a reference. 3) The raters are independents. My goal is to calculate the agreement between the 2 testes after controlling for the variation between the raters opinions. In other words, I seek to know whether there is a real difference between the 2 tests in categorising the disease without polluting the result because of the expected variantion in the raters opinions.