Holistic vs analytic approach

What are the two types of rubrics?

What is holistic assessment?

What are the advantages of using a rubric?

How do you make a holistic rubric?

What are the advantages of Analytic Rubric over holistic rubric?

What are the different types of rubrics?

Where do we use scoring rubrics?

What if there is no rubrics in assessment?

What is an example of a rubric?

What does analytically minded mean?

What is an analytical checklist?

What do you mean by rubrics?

How do you create a rubric?

What is norm referenced assessment?

ABSTRACT

In Sweden, grades are used for selection to upper-secondary school and higher education, even though agreement in teachers’ grading is low and the selection therefore potentially unfair. Furthermore, measures taken to increase the agreement have not been successful. This study has explored how to increase agreement in teachers’ grading by comparing analytic and holistic grading. Teachers [n = 74] have been randomly assigned to two different conditions [i.e. analytic or holistic grading] in either English as a foreign language [EFL] or mathematics. Findings suggest that analytic grading is preferable to holistic grading in terms of agreement among teachers. The effect was stronger and statistically significant in EFL, but could be observed in mathematics as well. However, teachers in the holistic conditions provided more references to criteria in their justifications, whereas teachers in the analytic conditions to a much larger extent made references to grade levels without specifying criteria.

In Sweden, where this study is situated, grades are high stakes for students. Grades are the only criteria used for selecting students as they leave compulsory school and apply for upper-secondary school. When students apply for higher education, selection is also made based on the ‘Swedish Scholastic Aptitude Test’, but a minimum of one third of the seats [often more] are based on grades. Given that it is the individual teacher who synthesises students’ performances into a grade, and that the reliability of teachers’ grades has been questioned [e.g. Swedish National Agency of Education, 2019], this is a problematic situation which can potentially have a significant influence on the lives of thousands of students each year.

Measures have been taken to increase the agreement in teachers’ grading, most recently by legally requiring that teachers in primary and secondary education take results from national tests into account when grading. This relatively loose strategy of ‘taking test results into account’ has not yet yielded any observable change in teachers’ grades, however, and there is still a large discrepancy between teacher-assigned grades and test results for most subjects [Swedish National Agency of Education, 2019]. The current situation may potentially lead to political demands for tying grades even more closely to test results, which in turn may have other unwanted consequences. For example, research suggests that the most widespread consequence of high-stakes testing is a narrowing of the curriculum, where teachers pay less attention to [or even exclude] subject matter that is not tested [Au, 2007; Pedulla et al., 2003]. Furthermore, in their review of research on the impact of high-stakes testing on student motivation, Harlen and Deakin Crick [2003] conclude that results from such tests have been found to have a ‘particularly strong and devastating impact’ [p. 196] on low-achieving students.

The question addressed by this study is therefore whether it is possible to increase the agreement in teachers’ grading by other means than national tests, such as specifying how teachers synthesise data on student performance into a grade.

The term ‘grading’, as used here, means making a decision about students’ overall performance according to [implicit or explicit] grading criteria. It follows from this definition that grades [as a product] are composite ‘measures’ [i.e. expressed along an ordinal scale] of student proficiency, based on a more or less heterogeneous collection of data on student performance. It also follows from the definition that grading, as a process, involves human judgement.

In Sweden, it is the responsibility of the individual teacher to synthesise students’ performances into a grade and the teacher’s decision cannot be appealed. However, in those subjects were there are national tests,1 the results from these tests have to be taken into account when grading. Exactly how the test results should be taken into account, or their weight in relation to the student’s other performances, is not specified. Obviously, a great deal of trust has been vested in teachers’ capacity to grade consistently and fairly.

As shown in the review by Brookhart et al. [2016], covering more than 100 years of research about assessment and grading, research on teachers’ grading has a long history. In this research, two findings are particularly consistent: [a] Although student achievement is the factor that above all others determines a student’s grade, grades commonly include other factors as well, most notably effort and behaviour, and [b] There is great variability among teachers regarding both the process and product of grading [Brookhart, 2013].

Regarding the inclusion of non-achievement factors when grading, this seems primarily to be an effect of teachers wanting the grading to be fair to the students, which means that teachers find it hard to give low grades to students who have invested a lot of effort [Brookhart, 2013; Brookhart et al., 2016]. Including non-achievement factors is therefore a way for teachers to balance an ethical dilemma in cases where low grades are anticipated to have a negative influence on students.

The variation in scores, marks, and grades between different teachers, and also for individual teachers on different occasions, has been extensively investigated. Several of the recent reviews of research about the reliability of teachers’ assessment and grading make reference to the early studies by Starch and Elliott [e.g. Brookhart et al., 2016; Parkes, 2013], who compared teachers’ marking of student performance in English, mathematics, and history [Starch & Elliott, 1912, 1913a, 1913b]. These authors used a 100-point scale, and teachers’ marks in English [n = 142], for example, covered approximately half of that scale [60–97 and 50–97 points respectively for the two tasks]. In history, the variability was even greater than it was for English and mathematics. They therefore concluded that the variation is a result of the examiner and the grading process rather than the subject [for an overview, see Brookhart et al., 2016].

Interestingly, almost a hundred years later, Brimi [2011] used a design similar to that used in the Starch and Elliott study in English, but with a sample of teachers specifically trained in assessing writing. The results, however, were the same [50–93 points on a 100-point scale].

In his review on the reliability of classroom assessments, Parkes [2013] also turns his attention to the intra-rater reliability of teachers’ assessment. As an example, Eells [1930] compared the marking of 61 teachers in history and geography at two occasions, 11 weeks apart. The share of teachers making the same assessment on both occasions varied from 16 to 90 percent for the different assignments. The 90 percent agreement was an extreme outlier, however, and the others were clustered around 25 percent. None of the teachers made the same assessment for all assignments, and the estimated reliability ranged from .25 to .51. The author concludes that ‘It is unnecessary to state that reliability coefficients as low as these are little better than sheer guesses’ [p. 52].

A number of objections can be made in relation to the conclusions above, due to limitations of the studies. For example, as pointed out by Brookhart [2013], the tasks used by Starch and Elliott would not be considered high-quality items according to current standards. Rather, they would be anticipated to be difficult to assess and likely to lead to large variations in marking. Another limitation is that most studies are ‘one-shot’ assessments, where teachers are asked to assess or grade performances from unknown or fictitious students. While it may be argued that such assessments are more objective, they miss the idea that teachers’ assessments can become more accurate over time as evidence of student proficiency accumulates. Further, their assessments can potentially become more valid as teachers come to understand what their students mean, even if expressed poorly. Lastly, teachers do not always have access to assessment criteria, which also means that their assessments could be anticipated to vary greatly. Still, teachers’ assessments are not always sufficiently reliable even with very detailed scoring protocols, such as rubrics. In a review of research on the use of rubrics, Jonsson and Svingby [2007] report that most assessments were below the threshold for acceptable reliability. Brookhart and Chen [2015], in a more recent review, claim that the use of rubrics can yield reliable results, but then criteria and performance-level descriptions need to be clear and focused, and raters need to be adequately trained. Taken together, even taking into account the limitations of some individual studies, the majority of studies on this topic point in the same direction with regard to the variability of teachers’ assessments and grading.

The underlying reasons for the observed variability in teachers’ assessments and grading are complex and involve both idiosyncratic beliefs and external pressure, as well as a number of other factors, such as subject taught, school level, and student characteristics [e.g. Duncan & Noonan, 2007; Isnawati & Saukah, 2017; Kunnath, 2017; Malouff & Thorsteinsson, 2016; McMillan, 2003; Randall & Engelhard, 2008]. Due to this complexity, teachers’ grading is sometimes portrayed in almost mystical terms, such as when Bloxham et al. [2016] write that assessment decisions, at least at the higher education level, are so ‘complex, intuitive and tacit’ [p. 466] that any attempts to achieve consistency in grading are likely to be more or less futile.

One possible reason for the observed variability in teachers’ grading, which is the focus of this study, is that teachers use different models [or strategies] for grading. Korp [2006] has, in a Swedish context, described three different models, which will be called holistic, arithmetic, and intuitive.

In the arithmetic model, the grade is calculated as a sum or a mean based primarily on test results. The arithmetic model therefore requires that the teachers document student performance quantitatively [cf. Sadler, 2005]. According to Korp, the teachers who used this model did not mention either the national curriculum or the grading criteria when talking about their grading practice.

In the intuitive model, students’ grades are influenced by a mixture of factors, such as test results, attendance, attitudes, and lesson activity. From these factors, the teacher may have a general impression of the student’s proficiency in the subject, and that, rather than the specific performance in relation to the grading criteria, will determine the grade.

In contrast to the arithmetic and intuitive models, in the holistic model the teacher compares all available evidence about student proficiency to the grading criteria and makes a decision based on this holistic evaluation. Since it is only in the holistic model that the teachers base their decision on explicit grading criteria, this is the only model in line with the intentions of the Swedish grading system.

The fact that the holistic model works in line with the intentions of the grading system does not mean that this model is easier for teachers to apply. On the contrary, the teachers in the study by Korp [2006] expressed dissatisfaction with the idea that they were expected to integrate different aspects of student performance. Not surprisingly, it is considerably easier to arrive at a composite measure of student proficiency when using a homogeneous set of data, such as points from written tests, as compared to the heterogeneous material in a portfolio [Nijveldt, 2007]. This tension between a unidimensional or multidimensional basis for grading is therefore yet another instance of the reliability versus validity trade-off. While unidimensional data may result in more coherent and reliable grading, such data only represents a fraction of student proficiency in a subject. Multidimensional data, on the other hand, may provide a fuller and more valid picture of student proficiency but is more difficult to interpret and evaluate in a reliable manner.

In addition to the models presented by Korp [2006], Jönsson and Balan [2018] introduced yet another model, which is called ‘analytic’ and can be described as a combination of the holistic and arithmetic models. In this model, teachers compare student performance on individual tasks with the grading criteria, producing what is referred to as ‘assignment grades’, which are then used – more or less arithmetically – when assigning an overall grade. The analytic model thereby resembles the holistic model by making reference to explicit grading criteria [i.e. not only test scores or a general impression], but reduces the complexity of the final synthesis by already having transformed the heterogeneous data into a common scale.

As suggested by Jönsson and Balan [2018], the reduction of complexity in the analytic model may contribute to increasing the consistency in teachers’ grading while preserving the connection to shared criteria. A disadvantage, however, is that each individual decision is based on a much smaller dataset, as compared to a holistic judgement, which takes all available evidence about student proficiency into account. Consequently, there is a risk that the analytic model may influence the validity of the grades negatively in terms of construct underrepresentation.

Analytic assessment involves assessing different aspects of student performance, such as mechanics, grammar, style, organisation, and voice in student writing. Alternatively, holistic assessment means making an overall assessment, considering all criteria simultaneously.

The most apparent advantage of using analytic assessments is that they provide a nuanced and detailed image of student performance by taking different aspects into consideration. By judging the quality of, for instance, different dimensions of student writing, both strengths and areas in need of improvement may be identified, which, in turn, facilitate formative assessment and feedback.

In contrast, holistic assessments focus on the performance as a whole, which also has advantages. For example, holistic assessments are likely to be less time-consuming and there are indications that teachers prefer them [e.g. Bloxham et al., 2011]. Furthermore, focusing on the whole prevents teachers from giving too much weight to individual parts. This is an important argument, since there is a widespread fear that easy-to-assess surface features get more attention in analytic assessment than more complex features that are more difficult to assess [Panadero & Jonsson, 2020]. Sadler [2009], for instance, argues that teachers’ holistic and analytic assessments rarely coincide, and implies that analytic assessments may therefore not be valid. Focusing on different aspects of student performance can therefore be seen as both a strength and a weakness with analytic assessments. This is because even if surface features do not receive more attention than other features, the analytic assessments still need to be partitioned into separate criteria. Consequently, there is a risk of assessments becoming fragmented and missing the entirety.

However, in the same way as analytic assessments risk missing the whole when focusing on the parts, holistic assessments risk missing the parts when focusing on the whole. There is also, similar to the intuitive model of grading, the risk of unintentionally including other factors than student performance or attaching weight to other criteria than assumed [Bouwer et al., 2017]. For example, Tomas et al. [2019] used analytic rubrics as supplementary to holistic assessments in order to elicit weightings of the criteria used. Their analyses indicated the existence of a ranking of importance and contribution of different criteria to the overall marks, which differed from the expectations and assumptions of the assessors. While the assessors were assumed to give priority to complex skills such as critical thinking, descriptions and explanations of concepts contributed more highly towards the overall mark in their assessments. Since analytic assessments specify the criteria to consider, this situation is probably more common for holistic assessments. The authors therefore argue that analytic and holistic assessments should not be seen as opposites, but rather as complementary.

Still, there are researchers who are quite sceptical of analytic assessments [e.g. Bloxham et al., 2011; Sadler, 2009]. In their defence, it should be noted that these sceptics often work within systems that require teachers to numerically score student performance according to analytic criteria, and where the individual scores are arithmetically aggregated into a composite score, mark, or grade. Under such circumstances, where qualitative and quantitative aspects of assessment are mixed, the sum of the parts may very well depart from the teacher’s holistic judgement, depending on how the scoring logic or algorithm is designed, which may lead to frustration and dissatisfaction. It could be argued, however, that such specific applications should be kept apart from the basic principles of analytic and holistic assessments.

This study started from the observation that although grades are high stakes for students in Sweden, agreement in teachers’ grading is low, potentially making the selection to upper-secondary school and higher education unfair. And even though measures have been taken to increase the agreement in grading, these measures have not been very efficient. Furthermore, tying teachers’ grading even closer to results from national tests may have other negative consequences. This study therefore aims to explore other routes for increased agreement in teachers’ grading by comparing the advantages and disadvantages of analytic and holistic grading models in terms of a] agreement between teachers and b] how teachers justify their decisions.

The overall design of this study is experimental, where a number of teachers have been randomly assigned to two different conditions [i.e. analytic or holistic grading] in either English as a foreign language [EFL] or mathematics [Table 1]. The study builds on a previous pilot study [Jönsson & Balan, 2018] and uses a similar methodology.

Information about the project was distributed in two different regions of the country, where teachers volunteered to participate in the study. The teachers chose whether to participate in EFL or mathematics and were then randomly assigned to one of the two conditions. The study was performed entirely online and no personal data has been collected, only grades and written justifications from the teachers.

In the analytic condition, the teachers received written responses to the same assignment from four students on four occasions during one semester [i.e. a total of 16 responses]. The assignments all addressed writing in EFL or mathematics, but otherwise had different foci [Table 2]. All responses were from students aged 12, but with different levels of proficiency in either subject. The responses were authentic student responses that had been anonymised.

The teachers were asked to grade the student responses within a week after receiving them and report their decisions through an internet-based form. At the end of the semester, teachers were asked to provide an overall grade for each of the four students, accompanied by a justification. The final grades and written justifications were used as data in the study.

In the holistic condition, participants were given all of the material on a single occasion so that they were not influenced by any prior assessments of the students’ responses. Similar to the analytic condition, they were asked to provide a grade for each of the four students and a justification for each grade.

A common method for estimating the agreement between different assessors [i.e. inter-rater agreement] is by using correlation analysis. Depending on whether scores [a continuous variable] or grades [a discrete variable] are given, either Pearson’s or Spearman’s correlation may be used. Spearman’s correlation [ρ] is a nonparametric measure of rank correlation, which is suitable for ordinal scales, such as grades. Naturally, if letter grades are used, they need to be converted to numbers in order to perform a correlation analysis. In this study, the grade A [i.e. the highest grade] has been converted to 6 and F [fail] to 1. Since only rank correlation has been used, no assumptions regarding equal distance between numbers are needed.

A disadvantage of using correlation analysis is that the assessments of two assessors may be highly correlated even if they do not agree on the exact grade, only the internal ranking. Therefore, in this study Spearman’s correlation has been combined with measures of absolute agreement. Pair-wise comparison is a common measure of agreement [Jönsson & Balan, 2018], but it has some notable disadvantages. First, it does not take the possibility of the agreement occurring by chance into account. Second, all agreements are given equal weight, which means that a couple of assessors who agree, but whose assessments deviate strongly from the consensus of other assessors, still contribute to the overall agreement. Third, since the observed agreements are divided by the number of all possible comparisons between assessors, the estimation tends to be low and it could be argued to underestimate the agreement between teachers.

In this study, we have therefore used Fleiss’ κ and what we call ‘consensus agreement’ as measures of absolute agreement [for an in-depth discussion of different measures, see Stemler, 2004]. Fleiss’ κ provides an estimate similar to pair-wise agreement but takes the possibility of the agreement occurring by chance into account. We have also used ‘consensus agreement’, which is based on the idea that teachers as a group, or a ‘community of practice’, should ideally share a common interpretation of the grading criteria and standards, which the individual teacher either agrees or disagrees with. By using the most frequent grade for each student [i.e. the typical value], agreement among teachers has been estimated by calculating the proportion of grades that agree with the central tendency. Although this estimate of agreement only takes into consideration the proportion of grades that agree with the majority, it is generally higher than estimates based on pair-wise comparisons.

Finally, the median absolute deviation [MAD] has been calculated as a measure of distribution, and the Mann-Whitney U test has been used to test for statistical significance.

The justifications for the grades were subjected to both qualitative and quantitative content analysis. First, all words the teachers used to describe the quality [either positively or negatively] of students’ performance were identified. All words were coded as different nodes in the data, even if they referred to the same quality. This was done in order to recognise the full spectrum of teachers’ language describing quality.

Second, all nodes were grouped in relation to commonly used criteria for assessing either writing [i.e. mechanics, grammar, organisation, content, style, and voice] or mathematics [i.e. methods, concepts, problem solving, communication, and reasoning]. In addition, some teachers in EFL made references to comprehensibility and whether students followed instructions and finished the task. Some [in both EFL and mathematics] also made inferences about students’ abilities or referred only to grade levels [i.e. not describing quality]. Four additional criteria, called ‘Comprehensibility’, ‘Rigour’, ‘Student’, and ‘Only grades’, were therefore added to the categorisation framework. For an example of the categorisation procedure, see Figure 1.

In the quantitative phase, the frequency of teachers’ references to the different criteria was used to summarise the findings and make possible a comparison between the different conditions.

The estimations of agreement between and among teachers are summarised in Table 3. In EFL, the teachers in the analytic condition agreed on the exact same grade in two thirds of the cases, while teachers in the holistic condition agreed in three out of five cases. The kappa estimations suggest fair [holistic condition] to moderate [analytic condition] agreement. The mean rank correlation was high in both groups. The distribution estimate [MAD] was clearly greater in the holistic condition as compared to the analytic condition.

In mathematics, the agreement was also higher in the analytic condition as compared to the holistic condition, and the kappa estimations suggest slight [holistic condition] to fair [analytic condition] agreement. The mean rank correlation was high in both groups, and the distribution estimate was greater in the holistic condition as compared to the analytic condition.

Taken together, the agreement for the analytic condition is higher as compared to the holistic condition for both subjects, and for all estimates, although the difference is only statistically significant for EFL [p 

Chủ Đề