New “instant replay” rule in effect for Fall 2013 contestsPosted on
A “scoring variance” is defined as when at least one judge awards a song with a score that is significantly higher or lower than the other scores from the panel. Judges have long discussed these variances in the days and weeks following a contest; since early 2012, the Society’s Contest & Judging (C&J) Committee has been looking at methods to address significant scoring differences (variances) before they become official.
One C&J committee studied the data and decided that significant variances should be reviewed; a subsequent committee looked at how that could be done. The judging panels in each of the Spring 2013 contests simulated the process to gather data on how often variances occurred and whether judges would change their scores once a variance is noted. C&J reviewed all of the information in Toronto this July, approved the concept and forwarded the information to the Society Board of Directors (BoD) for approval.
The BoD has approved the program and rule changes. In anticipation, all judges and contest administrators formally trained on the process at Category School and are prepared to go “live” in the Fall 2013 season. If a statistical variance occurs with a set of scores given for a performance of a song, the judges within the category will have the opportunity to review their sheets and potentially change their scores, much like instant replay.
The following detailed explanation comes from Kevin Keller, chair of the Society Contest & Judging Committee:
Probably all of us have experienced or heard “war” stories about the effect that one or two “outlier” scores may have had on the outcome of a contest — I know I’ve questioned some scores that some of my groups have received!
As a former Category Specialist (CS) and now as C&J Chair, a top priority for me has been the improvement of scoring consistency. No matter what other feedback gems a judge may share, you’ll have a hard time hearing or accepting anything he says when a questionable scoring inconsistency has been left unresolved. More importantly, there are cases in which quartets and choruses have missed cut-offs from Division to District, and from Prelims to International based on one score that is inconsistent with the rest of the judging panel. To get a “wrong” score and have no recourse is painful. The ideal would be for judges to simply get the score “right” and let the chips fall where they may.
Judges strive to be 100% accurate in real time
Judging a live performance and assigning a score is not a precise science. Judges go through training to identify skills, sights, sounds, delivery, etc. that are characteristic of a certain scoring level. From this, they hone in on a particular score. Following each contest, all judges receives a scoring analysis to look at how they (individually and as a category) did against the rest of the panel. Each CS reviews his judges’ performance that past weekend and asks his judges for details regarding variances. In addition, for the past three years, statistical analyses are performed to assess the scoring performance of each judge. Thus, there is constant feedback for each judge regarding each contest.
Judges quickly become accurate and precise at scoring. That isn’t to say there is not variability between scores within and between categories; however, this isn’t always a bad thing. We’re human and we take in performances with our own filters of experience and expertise. Different judges bring different insights. Looking at overall scoring over time, judges have gotten more precise at defining your level, which is good!
However, there are unfortunate moments in which a judge awards a score that, upon reflection and more information/insight from other judges, causes him to admit that it was not the right score. (Not often, but it does happen.) Sometimes, the judge actually admits to the contestant in the post-contest “eval” that he got the score wrong. And the high judge isn’t always the right judge!
How does this happen? Usually, the judge focused on one element of the performance and did not properly weigh others. Physical or mental distractions can play a role. Multiply that 3 or 6 or 9 or 12 or 15 individual judges, and there is a real likelihood that some quartet or chorus gets an outlier score during a weekend. Based upon historical judging data, we see a variance flagged once in every 20 songs. Because many of the variances concern the same competitor, we would expect to see a variance every one to two times in a typical district quartet semifinals and perhaps once in a typical district chorus contest.
Details on the new scoring review system
As stated at the top of this post, all of the Spring 2013 judging panels experimented with a proposed tweak to the scoring system. No scores were changed officially, but the judges were allowed to walk through the process and give feedback. Based upon all of the data and feedback, the C&J Committee has approved the program and submitted rule changes to the Society BoD, which reviewed these and approved the program. At our recent Category School, all categories trained and prepared for implementation in the Fall 2013 season.
What will change? Here is the basic process:
- When the Contest Administrator (CA) enters the scores for the entire panel, a formula will determine whether the most extreme score (high or low) for a given song is a statistical variance. When it is, a flag comes up. The CA will print a report with all scores and the score that triggered the variance will be highlighted.
- At the end of the contest round, the category in which the variance occurred will review this report. Those judges will together review their notes for that performance. All judges in that category then have the option of changing their scores (or not) for that song, now knowing how the rest of the panel scored that performance and why.
- If any score is changed within that category, the CA will make that adjustment and then the final results are issued. There will be no indications on the Official Scoring Summary or the Contestant Scoring Analysis that any change was made. After this process, results are final and official.
Why are we doing this? Three major reasons:
- If a judge makes a mistake, until now there has been no recourse. “Sorry, that’s the way the chips fell” should not be acceptable. I would hope that we should welcome a process whereby a judge ultimately gets it right.
- C&J does not want a judging error to be the reason a group fails to advance to the next round of competition. If 5 out of 6 judges say “yes” and one judge says “no,” that doesn’t mean the one judge was necessarily wrong; but if the score is way off, we can take the time to ensure it was the right decision.
- Even though human scoring/assessment is not a perfect science, C&J continually gets better. You have a high expectation that scores will be consistent. When they aren’t, it becomes a distraction and casts doubt over the process. The evaluation session is often clouded by trying to explain wide variances in the scoring and especially if it cost the group advancement or placement.
What if there is variance among categories? Competitors and judges alike strongly believe that our three categories should be able to score independently of the others. That has not changed. If one category views a performance differently from another category, there is no variance as long as the judges’ scores are aligned within their own category.
How are variances determined? There are many statistical tests available. We have chosen the simplest of all, “Dixon’s Q Test,” for two reasons: (1) It is simple so everyone can understand; (2) if you got a 77, 68, 78, 77, 76, 77, the 68 score truly stands out and creates concern. Other types of tests can be more obscure and we didn’t want this to be complicated.
What is the formula? Calculate the range (R) from the highest and lowest values. Then calculate the largest distance (D) from the most extreme value (high or low) to its nearest score. You calculate the ratio of Q = D/R. If that ratio is “really large,” then it is a statistical outlier that creates a variance.
What is “really large”? It depends upon how many judges and how confident you want to be that it is truly an outlier and not by chance alone. After reviewing data and looking at the sensitivity of what we wanted to flag, we decided upon 90% confidence level.
Total Panel Q (90%)
Then we added one more level. It is possible that 5 out of the 6 judges were extremely close (let’s say 71, 70, 71, 71, 70). A 73 would flag as an outlier in this example, but we would accept this sort of variability in scores. We implemented a rule that the difference between category judges had to be five (5) or more points before a variance would be generated.
The process in action
Let’s look at the data I provided earlier. Since I’m a Music judge, I’ll let Music have the variance and I’ll be that judge who gave the low score.
MUS = 77, 68
PRS = 78, 77
SNG = 76, 77
The total range (R) is 78-68 = 10. The largest distance (D) is 76-68 = 8. Q = 8/10 = 0.800. For a double panel (6 judges), the critical value is 0.560. Since Q = 0.800 is greater than our critical value of 0.560, we would conclude that my 68 is statistically low compared to the panel. The difference between my score and the other Music judge’s score is 5 or more, so this song would flag and a variance in the Music category would be created.
At the end of the contest round, the CA would give the Music category a sheet with the competitor’s information and these scores. I would find my judging sheet for them as would my fellow Music judge. After reviewing my notes and discussing it with the other judge, I could stand by my original score, I could modify my score, the other Music judge could come down, or we could both move closer.
In many cases this spring, the judge that caused the variance moved his score closer to the other judge (and thus the panel). In a few cases, both category judges moved closer together to reflect a more consistent scoring assessment. For example, I might move to a 72 and the other Music judge could go to a 74. And in other cases, all judges stayed with the original scores.