EPPI-Centre > Publications > Systematic reviews > Assessment by teachers > Assessment by teachers

A systematic review of the evidence of reliability and validity of assessment by teachers used for summative purposes. Summary

Background

The proposal for this review resulted from the work of the Assessment Reform Group (ARG) over several years and the more recent reviews conducted by the Assessment and Learning Research Synthesis Group (ALRSG), whose members include all the members of ARG. The review of classroom assessment initiated by ARG, and carried out by Black and Wiliam (1998), indicated that assessment used for formative purposes benefits teaching and learning, and raises standards of student performance. However, the ALRSG review, A systematic review of the impact of summative assessment and tests on students' motivation for learning, showed that high-stakes tests can have a negative impact on students' motivation for learning and on the curriculum and pedagogy. But, summative assessment is necessary and serves important purposes in providing information to summarise students' achievement and progress for their teachers, parents, the students themselves and others who need this information. To serve these purposes effectively, summative assessment should interfere as little as possible with teaching methods and the curriculum and, importantly, should reflect the full range of learning outcomes, particularly those needed for continued learning and for learning how to learn.

Assessment by teachers has the potential for providing summative information about students' achievement since teachers can build up a picture of students' attainments across the full range of activities and goals. Although assessment by teachers is used as the main source of information in some national and state assessment systems, in other countries, it has the image of being unreliable and subject to bias. This review was undertaken to provide some research evidence about the dependability of summative assessment by teachers and the conditions that affect it.

Definition of terms

Assessment is a term that covers any activity in which evidence of learning is collected in a planned and systematic way to draw inferences about learning. The purpose of the assessment determines how the information is used. Thus assessment by teachers for summative purposes means:

any activity in which teachers gather evidence in a planned and systematic way about their students' learning to draw inferences based on their professional judgement to report achievement at a particular time.

The phrase 'about their students' learning' excludes from this definition the role of teachers as markers or examiners in the context of external examination, where they do not mark their own students' work. It includes teachers' assessments of their own students as part of an examination for external certification. The phrase 'based on their professional judgement' excludes assessment where information is gathered by teachers but marked externally, but would include students' self-assessment managed by teachers.

Reliability refers to how accurate the assessment is (as a measurement); that is, if repeated, how far the second result would agree with the first.

Validity refers to how well what is assessed matches what it is intended to assess. Different forms of validity derive from different ways of estimating it. Construct validity is a useful overarching concept.

Since reliability and validity are not independent of each other - and increasing one tends to decrease the other - it is useful in some contexts to refer to dependability as a combination of the two. The approach to summative assessment by teachers giving the most dependable result would protect construct validity, while optimising reliability.

Aims of the review

The aims of this review were as follows:

To conduct a systematic review of research evidence to identify and summarise evidence relating to the reliability and validity of the use of teachers' assessment for summative purposes
To determine the conditions that affect the reliability and validity of teachers' summative assessment
To map the characteristics of studies reporting on the reliability and validity of teachers' assessment
In consultation with potential users of the review, to draw from this evidence implications of the findings for different user groups, including practitioners, policy-makers, those involved in teacher education and professional development, employers, parents and pupils
The identification of further research that is needed in this area and of the focus of subsequent reviews that might be undertaken by the ALRSG
Publication of the full report and of short summaries for different user groups in the Research Evidence and Education Library (REEL).

Review questions

Thus the review was designed to address the main question:

What is the research evidence of the reliability and validity of assessment by teachers for the purposes of summative assessment?

and the subsidiary question:

What conditions affect the reliability and validity of teachers' summative assessment?

In order to achieve all the aims of the review, it was necessary to address the further question:

What are the implications of the findings for policy and practice in summative assessment?

Methods

The review methodology followed the procedures devised by the Evidence for Policy and Practice Information and Co-ordinating Centre (EPPI Centre), and the Review Group received the technical support of the EPPI Centre. Criteria were defined for guiding a wide-ranging search for studies that dealt with some form of summative assessment conducted by teachers, involving students in school in the age range 4 to 18, and reporting on the validity and/or reliability of methods used. Bibliographic databases and registers of educational research were searched online as were relevant online journals, with other journals and back numbers of those only recently put online being searched by hand. Other studies were found by scanning the references lists of already-identified reports, making requests to members of relevant associations and other review groups, and using personal contacts.

All studies identified in these ways were screened, using inclusion and exclusion criteria, and the included studies were then keyworded, using the Core Keywording Strategy (EPPI Centre, 2002a) and additional keywords specific to the context of the review. Keywords were used to produce a map of selected studies. Detailed data extraction was carried out online independently by two reviewers who then worked together to reach a consensus, using the EPPI-Reviewer (Review Guidelines for Extracting Data and Quality Assessing Primary Studies in Educational Research (EPPI Centre, 2002b)). Review-specific questions relating to the weight of evidence of each study in the context of the review were used in addition to those of the EPPI-Reviewer. Judgements were made as to the weight of evidence relevant to the review provided by each study in relation to methodological soundness, appropriateness of the study type and relevance of the focus to the review questions.

The structure for the synthesis of evidence from the in-depth review was based on the extent to which the studies were concerned with reliability or validity of the assessment. Despite the difficulty in making a clear distinction between these concepts, and their inevitable interdependence, it was possible to designate each one as providing evidence primarily in relation to reliability or primarily in relation to validity. Evidence in relation to the conditions affecting reliability or validity was drawn together separately. In the synthesis and discussion, reference was made to the weight of evidence provided by each study.

Potential users of the review were involved in several ways: providing advice as members of the review group; providing information about studies through personal contact; participating in keywording and in data extraction; and through a consultation seminar on implications of the draft findings of the review attended by a number of policy and practitioner users.

Results

Identification of studies

The search resulted in a total of 431 papers being found. Of these, 369 were excluded, using exclusion criteria. Full texts were obtained for 48 of the remaining 62 papers, from which a further 15 were excluded, and two sets of papers (three in one case and two in the other) were linked as they reported on the same study. This left 30 studies after keywording. All of these were included in the in-depth review.

Systematic map

The 30 studies included in the in-depth review were mapped in terms of the EPPI Centre and review-specific keywords. All were written in the English language: 15 were conducted in England, 12 in the United States and one each in Australia, Greece and Israel. All studies were concerned with students between the ages of 4 and 18. Of the 30, 11 involved primary school or nursery students (aged 10 or below) only, 13 involved secondary students (aged 11 or above) only, and six were concerned with both primary and secondary students. There was no variation across educational settings in terms of whether the study focus was on reliability or validity, but there were slightly more evaluations of naturally-occurring situations in primary schools. Almost all studies set in primary and nursery schools involved assessment of mathematics and a high proportion related to reading. At the secondary level, studies of assessment of mathematics and ‘other’ subjects (variously concerned with foreign languages, history, geography, Latin and bible studies) predominated.

Eighteen studies were classified as involving assessment of work as part of, or embedded in, regular activities. Three were classified as portfolios, two as projects and nine were either set externally or set by the teacher to external criteria. The vast majority were assessed by teachers, using external criteria. The most common purpose of the assessment in the studies was for national or state-wide assessment programmes, with six studies related to certification and another six to informing parents (in combination with other purposes). As might be expected in the context of summative assessment, most research related to the use of external criteria by teachers, with little research on student self-assessment or teachers using their own criteria.

In-depth review

Findings from studies of reliability of assessment based on teachers’ judgements

There was evidence of high weight for the following:

The reliability of portfolio assessment where tasks were not closely specified was low (Koretz et al., 1994; Shapley and Bush, 1999); this finding has been used as an argument for increasing the match between task and assessment criteria by closer specification of tasks.
The finer specification of criteria, describing progressive levels of competency, has been shown to be capable of supporting reliable teachers' assessment (TA) while allowing evidence to be used from the full range of classroom work (Rowe and Hill, 1996).
Studies of the National Curriculum Assessment (NCA) for students aged 6 and 7 in England and Wales in the early 1990s found considerable error and evidence of bias in relation to different groups of students (Shorrocks et al., 1993; Thomas et al., 1998).
Study of the NCA for 11 year-olds in England and Wales in the later 1990s shows that results of TA and standard tasks agree, and are to an extent consistent with the recognition that they assess similar but not identical achievements (Reeves et al., 2001).
The clearer teachers are about the goals of students’ work, the more consistently they apply assessment criteria (Hargreaves et al., 1996).
When rating students' oral proficiency in a foreign language, teachers are consistently more lenient than moderators, but are able to place students in the same rank order as experienced examiners (Good, 1988a; Levine et al., 1987).

There was evidence of medium weight for the following:

Interpretation of correlations of TA and standard task results for seven-year-olds should take into account the variability in the administration of standard tasks (Abbott et al., 1994).
Teachers who have participated in developing criteria are able to use them reliably in rating students’ work (Hargreaves et al., 1996; Frederiksen and White, 2004).
Teachers are able to score hands-on science investigations and projects with high reliability using detailed scoring criteria (Frederiksen and White, 2004; Shavelson et al., 1992).

Findings from studies reporting the validity of assessments based on teachers’ judgement

There was evidence of high weight for the following:

Teachers' judgement of the academic performance of young children are influenced by the teachers' assessment of their behaviour; this adversely affects the assessment of boys compared with girls (Bennett et al., 1993).
The introduction of TA as part of the national curriculum assessment initially had a beneficial effect on teachers' planning and was integrated into teaching (Hall et al., 1997); subsequently, however, in the later 1990s, there was a decline in earlier collaboration among teachers and sharing interpretations of criteria, as support for TA declined and the focus changed to other initiatives (Hall and Harding, 2002).
The validity of a science project as part of 'A' level examinations for assessing skills different from those used in regular laboratory work was reduced when the project assessment was changed from external to internal by teachers (Brown, 1998).
Teachers' judgements guided by checklists and other materials in the Work Sampling System were found to have high concurrent validity for assessment of kindergarten (Kg) to Grade 3 students (Meisels et al., 2001).
Teachers' judgements of students' performance are likely to be more accurate in aspects more thoroughly covered in their teaching (Coladarci, 1986).

There was evidence of medium weight for the following:

There is variation of practice among teachers in their approaches to TA, type of information used and application of national criteria (Gipps et al., 1996; Radnor, 1995).
There is conflicting evidence as to the relationship between teachers' ratings of students' achievement and standardised test scores of the same achievement when the ratings are not based on specific criteria (Hopkins et al., 1985; Sharpley and Edgar, 1986).
The rate at which young children can read aloud is a valid curriculum-based measure of reading progress as measured by a standardised reading test (Crawford et al., 2001).
Tentative estimates of construct validity of portfolio assessment, derived from evidence of correlations of portfolios and tests, were low (Koretz et al., 1994; Shapley and Bush, 1999).
Teacher assessment of practical skills in science makes a valid contribution to assessment at 'A' level within each science subject, but there is little evidence of generalisability of skills across subjects (Brown et al., 1996).
Teachers' perceptions of students' ability and probability of success on a test are moderately valid predictors of performance on the test, as are student self-assessments of their performance on a test after they have taken it (Wilson and Wright, 1993).

Evidence in relation to the conditions that affect the reliability and validity of teachers’ summative assessment

Both high- and medium-weight evidence indicated the following:

There is bias in teachers' assessment (TA) relating to student characteristics, including behaviour (for young children), gender and special educational needs; overall academic achievement and verbal ability may influence judgement when assessing specific skills.
There is variation in the level of TA and in the difference between TA and standard tests or tasks that is related to the school. The evidence is conflicting as to whether this is increasing or decreasing over time. There are differences among schools and teachers in approaches to conducting TA.
There is no clear view of how the reliability and validity of TA varies with the subject assessed. Differences between subjects in how TA compares with standard tasks or examinations results have been found, but there is no consistent pattern suggesting that assessment in one subject is more or less reliable than in another.
It is important for teachers to follow agreed procedures if TA is to be sufficiently dependable to serve summative purposes. To increase reliability, there is a tension between closer specification of the task and of the conditions under which it is carried out, and the closer specification of the criteria for judging performance.
The training required for teachers to improve the reliability of their assessment should involve teachers as far as possible in the process of identifying criteria so as to develop ownership of them and understanding of the language used. Training should also focus on the sources of potential bias that have been revealed by research.
Teachers can predict with some accuracy their students' success on specific test items and on examinations (for 16-year-olds), given specimen questions. There is less accuracy in predicting 'A' level grades (for 18-year-olds).
Detailed criteria describing levels of progress in various aspects of achievement enable teachers to assess students reliably on the basis of regular classroom work.
Moderation through professional collaboration is of benefit to teaching and learning as well as to assessment. Reliable assessment needs protected time for teachers to meet and to take advantage of the support that others, including assessment advisers can give.

Conclusions

The implications of the findings of the review were explored through consultation with invited teachers, head teachers and researchers, as well as representatives of teachers' organisations, of the Association for Achievement and Improvement through Assessment (AAIA), and of UK government agencies involved in national assessment programmes. Some points went beyond the review findings and are listed separately after those directly arising from the research evidence.

Implications for policy

When deciding the method, or combination of methods, of assessment for summative assessment, the shortcomings of external examinations and national tests need to be borne in mind.
The essential and important differences between TA and tests should be recognised by ceasing to judge TA in terms of how well it agrees with test scores.
There is a need for resources to be put into identifying detailed criteria that are linked to learning goals, not specially devised assessment tasks. This will support teachers' understanding of the learning goals and may make it possible to equate the curriculum with assessment tasks.
It is important to provide professional development for teachers in undertaking assessment for different purposes that address the known shortcomings of TA.
The process of moderation should be seen as an important means of developing teachers' understanding of learning goals and related assessment criteria.

Implications for practice

Teachers should not judge the accuracy of their assessments by how far they correspond with test results, but by how far they reflect the learning goals.
There should be wider recognition that clarity about learning goals is needed for dependable assessment by teachers.
Teachers should be made aware of the sources of bias in their assessments, including the ‘halo’ effect, and school assessment procedures should include steps that guard against such unfairness.
Schools should take action to ensure that the benefits of improving the dependability of the assessment by teachers is sustained: for example, by protecting time for planning assessment, in-school moderation, etc.
Schools should develop an 'assessment culture' in which assessment is discussed constructively and positively, and not seen as a necessary chore (or evil).

Implications for research

There should be more studies of how teachers go about assessment for different purposes, what evidence they use, how they interpret it, etc.
The reasons for teachers' overestimation of performance, compared with moderators' judgements of the same performance, need to be investigated to find out, for instance, whether a wider range of evidence is used by the students' own teachers, or whether criteria are differently interpreted.
More needs to be known about how differences between schools influence the practice and dependability of individual teachers.
Since evaluating TA by correlation with test results is based on the false premise that they assess the same things, other ways need to be found for evaluating the dependability of TA.
There needs to be research into the effectiveness of different approaches to improving the dependability of TA, including moderation procedures.
Research should bring together knowledge of curriculum planners, learning psychologists, assessment specialists and practitioners to produce more detailed criteria that can guide TA.

Additional points related to the review identified in consultation with users

It is important to consider the purpose of assessment in deciding the strengths and weaknesses of using teachers' assessment in a particular case. For instance, when assessment is fully under the control of the school and is used for informing pupils and parents of progress ('internal purposes'), the need to combine TA with other evidence (e.g. tests) may be less than when the assessment results are used for ‘external’ purposes, such as accountability of the school or selection or certification of students.
There needs to be greater recognition of the difference between purposes of summative assessment and of how to match the way it is conducted with its purpose. For instance, the 'internal' assessment that is under the control of the school should not emulate the 'external' assessment which has different purposes.
If tests are used, they should be reported separately from TA, which should be independent of the test scores.
There is evidence that a change in national assessment policy is due. The current system is not achieving its purpose. The recent report on comparability of national tests over time (Massey et al., 2003) concludes that TAs have shown less change in standards than the national tests. The authors state, 'National testing in its current form is expensive, primarily because of the external marking of the tests, and the time may soon come when it is thought that these resources may make a better contribution elsewhere' (Massey et al., 2003, p 239).
Improving teachers' formative assessment would also improve their summative assessment and so should be a part of a programme of professional development aimed at enabling teachers' judgements to be used for summative purposes.
The role that pupils can take in their own summative assessment needs to be investigated and developed.
Any change towards greater use of TA in current systems where summative assessment is dominated by tests requires a major switch in resources from test development to supporting teacher-led assessment.
Change towards greater use of TA for summative purposes requires a long-term strategy, with strong 'bottom-up' elements and provision for local transformations.

References

Abbott D, Broadfoot P, Croll P, Osborn M, Pollard A (1994) Some sink, some float: national curriculum assessment and accountability. British Educational Research Journal 20: 155-174.

Bennett RE, Gottesman RL, Rock DA, Cerullo F (1993) Influence of behaviour perceptions and gender on teachers' judgements of students' academic skill. Journal of Educational Psychology 85: 347-356.

Black P, Wiliam, D (1998) Assessment and classroom learning. Assessment in Education 5: 7 –71.

Brown CR, Moor JL, Silkstone BE, Botton C (1996) The construct validity and context dependency of teacher assessment of practical skills in some pre-university level science examinations. Assessment in Education 3: 377-391.

Brown CR (1998) An evaluation of two different methods of assessing independent investigations in an operational pre-university level examination in biology in England. Studies in Educational Evaluation 24: 87-98.

Coladarci T (1986) Accuracy of teachers' judgements of students responses to standardised test items. Journal of Educational Psychology 78: 141-146.

Crawford L, Tindal G, Steiber S (2001) Using oral reading rate to predict student performance on statewide achievement tests. Educational Assessment 7: 303-323.

Evidence for Policy and Practice Information and Co-ordinating Centre (EPPICentre) (2002a) Core Keywording Strategy: Data collection for a register of educational research. Version 0.9.7. London: EPPI Centre, Social Science Research Unit.

EPPI Centre (2002b) Review Guidelines for Extracting Data and Quality Assessing Primary Studies in Educational Research. Version 0.9.7. London: EPPI Centre, Social Science Research Unit.

Frederiksen J, White B (2004), Designing assessment for instruction and accountability: an application of validity theory to assessing scientific inquiry. In: Wilson M (ed.) Towards Coherence between Classroom Assessment and Accountability, 103rd Yearbook of the National Society for the Study of Education part II. Chicago, IL, USA: National Society for the Study of Education.

Gipps C, McCallum B, Brown M (1996) Models of teacher assessment among primary school teachers in England. The Curriculum Journal 7: 167-183.

Good FJ (1988a) Differences in marks awarded as a result of moderation: some findings from a teacher assessed oral examination in French. Educational Review, 40: 319-331.

Hall K, Webber B, Varley S, Young V, Dorman P (1997) A study of teacher assessment at Key Stage 1. Cambridge Journal of Education 27: 107-122.

Hall K, Harding A (2002) Level descriptions and teacher assessment in England: towards a community of assessment practice. Educational Research 44: 1-15.

Hargreaves DJ, Galton MJ, Robinson S (1996) Teachers' assessments of primary children's classroom work in the creative arts. Educational Research 38: 199-211.

Hopkins KD, George CA, Williams DD (1985) The concurrent validity of standardised achievement tests by content area asing teachers' ratings as criteria. Journal of Educational Measurement 22: 177-182.

Koretz D, Stecher BM, Klein SP, McCaffrey D (1994) The Vermont Portfolio Assessment Program: findings and implications. Educational Measurement: Issues and Practice 13: 5-16.

Levine MG, Haus GJ, Cort D (1987) The accuracy of teacher judgement of the oral proficiency of high school foreign language students. Foreign Language Annals 20: 45-50.

Massey A, Green S, Dexter T, Hamnett L (2003) Comparability of National Tests over Time: Key stage test standards between 1996 and 2001. London: Qualifications and Curriculum Authority (QCA).

Meisels SJ, Bickel DD, Nicholson J, Xue Y, Atkins-Burnett S (2001) Trusting teachers' judgements: a validity study of a curriculum-embedded performance assessment in kindergarten to Grade 3. American Educational Research Journal 38: 73-95.

Radnor HA (1995) Evaluation of Key Stage 3 Assessment Arrangments for 1995: Final report. Exeter: University of Exeter.

Reeves DJ, Boyle WF, Christie T (2001) The relationship between teacher assessment and pupil attainments in standard test/tasks at Key Stage 2, 1996-8. British Educational Research Journal 27: 141-160.

Rowe KJ, Hill PW (1996) Assessing, recording and reporting students' educational progress: the case for 'subject profiles'. Assessment in Education 3: 309-352.

Shapley KS, Bush MJ (1999) Developing a valid and reliable portfolio assessment in the primary grades: building on practical experience. Applied Measurement in Education 12: 11-32.

Sharpley CF, Edgar E (1986) Teachers' ratings vs standardised tests: an empirical investigation of agreement between two indices of achievement. Psychology in the Schools 23: 106-111.

Shavelson RJ, Baxter GP, Pine J (1992) Performance assessments: political rhetoric and measurement reality. Educational Researcher 21: 22-27.

Shorrocks D, Daniels S, Staintone R, Ring, K (1993) Testing and Assessing 6 and Seven Year-Olds. The evaluation of the 1992 Key Stage 1 National Curriculum Assessment. UK: National Union of Teachers and Leeds University School of Education.

Thomas S, Madaus GF, Raczek AE, Smees R (1998) Comparing teacher assessment and the standard task results in England: the relationship between pupil characteristics and attainment. Assessment in Education 5: 213-246.

Wilson J, Wright CR (1993) The predictive validity of student self-evaluations, teachers' assessments, and grades for performance on the verbal reasoning and numerical ability scales of the differential aptitude test for a sample of secondary school students attending rural Appalachia schools. Educational and Psychological Measurement 53: 259-270.

This report should be cited as: Harlen W (2004) A systematic review of the evidence of reliability and validity of assessment by teachers used for summative purposes. In: Research Evidence in Education Library. London: EPPI Centre, Social Science Research Unit, Institute of Education, University of London.