Effective primary school tests

What elements make for an effective and reliable assessment? Test developer Catherine Kirkup outlines the key criteria that primary schools should consider when selecting their approaches to pupil assessment

As a test developer in NFER's Centre for Assessment, I am proud of the tests we develop for schools. As a researcher working to improve the learning and the lives of pupils, my primary concern is that schools choose valid and reliable tests. So what makes a good test? In this article, I will attempt to explain what questions schools should ask when they are considering purchasing any new tests.

Do these tests assess what we want to measure (i.e. are they valid tests)?

The validity of a test is important because teachers draw conclusions about a pupil's attainment or progress based on those test results (alongside other classroom evidence such as homework, etc).

You may think that it goes without saying that a mathematics test will measure mathematical ability but unfortunately not all tests have good validity.

For example, a maths test that contains only questions about number will not provide a valid assessment of a pupil's understanding of shape. So how do you know if a test you are thinking of using is a valid one?

First, the content should include adequate coverage of the underlying curriculum, the attainment of which you want to measure. There will always be some aspects of the curriculum that cannot be assessed within a paper and pencil assessment, but a valid test should include a comprehensive range of questions addressing the assessable domain, particularly focusing on any key skills or learning milestones that indicate progress within the subject for the particular year group.

Examine any sample materials and read any information about how and when the tests were developed. Look for evidence that the content:

Has been mapped to the latest version of the national curriculum.
Assesses essential aspects of the curriculum (key skills).
Is age-appropriate.
Has been written by authors with curriculum expertise and experience in assessment.
Has undergone a rigorous development process, e.g. reviewed by curriculum experts and teachers.

Second, the purpose for which the test was developed should ideally match how you intend to use it. Some tests are designed to provide diagnostic information on particular topics whereas others have been written to provide summative information. If you wish to prepare pupils for end of key stage tests, choose ones that have a similar look and feel so that you are giving them realistic test-taking practice.

Pupils do not always perform in a similar way if the format of the test is different, so online tests may not accurately indicate pupil performance on a paper and pencil test. That doesn't mean you should not use both paper and online tests, just be aware of the potential mode effects and be clear about why you are using them.

Will they provide accurate and reliable results?

Perhaps the most important feature of a good test is that it should provide reliable outcomes. In order to do this a test needs to be technically sound. The reliability of a test will be adversely affected by ambiguously worded questions, biased questions, poor administration guidance, poor mark schemes, etc.

The length of the test should also be appropriate – short tests can be unreliable because they provide insufficient evidence, whereas very long tests can result in pupil errors due to fatigue. Effective quality-assurance and extensive trialling enable test developers to minimise sources of error such as these and to establish a test's reliability, i.e. the extent to which the test provides consistent results.

Test developers may report reliability in a number of different ways (correlation coefficients, standard deviations of measurement error, etc), but you do not need to be a technical expert. Simply make sure that information about the test's reliability is available. If not, look at other sources of tests.

The size of the pupil sample is also important if you wish to benchmark your pupils against pupil attainment nationally. If you intend to use tests for this purpose then you should buy tests that have been trialled on a nationally representative sample of pupils. Usually this requires a minimum of around 1,200 pupils per test across a range of schools representing different regions, different types of schools, different school performance bands, etc.

Slightly larger samples will give more accurate standardised scores because these will have been calculated on larger numbers of pupils. It is important to use tests that have been standardised at an appropriate time, e.g. if there has been a significant change to what is taught in a particular year group.

Because there have been substantial changes to the mathematics curriculum in England, tests purporting to give outcomes relating to attainment of the 2014 national curriculum should have been trialled with pupils who have been taught the new content, e.g. at the end of 2014/15 or later. So ask questions or look for information on the following:

Have the tests been trialled in schools?
If yes, what was the size of the sample? Was it nationally representative?
When was the standardisation carried out?

How useful will the outcomes be?

If a test has been standardised you will usually be provided with look-up tables or online software to convert raw scores to scaled scores or standardised scores. This allows you to do a number of things:

It makes it easier to compare the performance of pupils (a score of 35 out of 50 is not very useful unless you know the average score and spread of scores for all pupils). Usually tests are standardised by converting raw scores to a scale (e.g. 70 to 130) with the average set at 100. This allows you to easily see whether a pupil is above or below the average of all the pupils that took the test.
Assuming a sufficiently large and representative sample, scaled scores will also allow you to compare individual pupil performance and the cohort performance against the national average.
Some test publishers will offer both standardised scores (no adjustment for age) and age-standardised scores. In calculating age-standardised scores pupils are only compared with other pupils of a similar age. These scores can be useful particularly for putting the performance of younger pupils into context (e.g. in discussions with parents).

The above scores are often referred to as "norm-referenced" because performance is compared with the average or "norm". However, some test publishers may also provide some criterion-referenced outcomes that describe how a pupil's performance compares against a set of predetermined criteria or learning standards.

Criterion-referenced outcomes are generally based on expert opinion about what a pupil might be expected to know and do at a particular age in a particular subject. Using rigorous standard-setting methods that evaluate each item in the test, cut scores (the marks indicative of each grade boundary) are determined that best differentiate pupils who have met the required standard from those working below or above.

Are scaled scores, standardised scores and/or age standardised scores included?
Is information provided that explains how you can track pupil progress?
Are additional outcomes available, e.g. whether pupils are working below, at or above age-related expectations? Were these criterion-referenced outcomes developed in an appropriate way?

Finally, consider what optional extras (performance analyses, marking packages, or advice about identifying strengths and weaknesses) the test publisher provides and how important they are to your school.

It might surprise you to learn that some published tests have never been trialled in schools. Check sample materials or the test publisher's website for information before buying, or ask questions via email/telephone. If they are unable to give you satisfactory responses, look elsewhere.

Catherine Kirkup is a research director at NFER's Centre for Assessment. For information on NFER tests for assessment, visit www.nfer.ac.uk/nt7

One in five schools now operating food banks as cost of living bites

A case study of edtech implementation and impact

What ingredients make for an effective test?

Do these tests assess what we want to measure (i.e. are they valid tests)?

Will they provide accurate and reliable results?

How useful will the outcomes be?

Related articles

About us

Newsletter

One in five schools now operating food banks as cost of living bites

A case study of edtech implementation and impact

What ingredients make for an effective test?

Do these tests assess what we want to measure (i.e. are they valid tests)?

Will they provide accurate and reliable results?

How useful will the outcomes be?

Related articles

Interpreting the outcomes of standardised tests

The recipe for effective assessment

What does effective assessment look like?

About us

Newsletter