Hello Madeleine,
There really isn't a universal standard that must be reached as there are many possible reasons comparison coding might be carried out and those reasons might have different requirements. As well, there are many different ways comparison coding could be carried out. Some review organisation might have their own standard on the level of agreement required.
The answer to your question would be what level of agreement (or disagreement) are you comfortable with based on the parameters of the inclusion/exclusion criteria. If you are double screening a random sample to check that all coders are interpreting the screening tool the same way then you might require 100% agreement before moving forward. If the screening criteria required quite a bit of interpretation you might be happy with less agreement.
As well, you might have lots of disagreement on the reason for exclusion but only concern yourself with disagreements on inclusion vs exclusion. In that case you might want 100% agreement but only on the inclusion vs exclusion comparison.
I think the important issue is that you understand the reason for any disagreements as they might indicate confusion in how the criteria works with the studies in your review.
The kappa statistic itself is often a source of misunderstanding. A good paper to read is 'Fleiss J, Cohen J, Everitt, B (1969) “Large Sample Standard Errors of Kappa and Weighted Kappa”, Psychological Bulletin, 72(5) pg. 323-327.' There is the statistics kappa and the weighted kappa and their use depends on the relative seriousness of the possible disagreements (ex. include vs exclude and exclude vs exclude)
If you are calculating a kappa statistic I have found this paper useful in interpreting the value. 'Viera A, Garret J (2005) “Understanding the Interobserver Agreement: The Kappa Statistic” 37(5) pg. 360-363'
Best regards,
Jeff