The reliability of tests for sport-specific skill amongst elite youth rugby league players

Abstract In rugby league, tests of sport-specific skill often involve subjective assessments of performance by observers of varying qualification. However, the reliability of such subjective assessments has yet to be investigated via appropriate statistical techniques. Therefore, the aims of the current study were to investigate: (1) the intra-observer reliability of a non-qualified observer (‘novice’) and (2) the inter-observer reliability of the three observers (two qualified ‘experts’ and one novice observer) in the assessment of catching, passing and tackling (stages 1 and 2) ability in elite adolescent rugby league players (age: 14.7±0.5 years). Players performed each skill element within a simulated practice drill and were assessed in ‘real time’ by the observers according to pre-defined criteria. An overall bias (P<0.05) was revealed between the observers in stage 1 of catching and stage 1 of passing, the differences being higher for the novice compared to both expert coaches for each stage of catching and the first stage of passing, and between expert 2 and the novice for stage 2 of tackling. No comparisons met the pre-determined analytical goal of ‘perfect agreement’, for any of the skill components. Comparisons between the expert observers did not reach perfect agreement, with the lowest values occurring for both tackling skill stages (60–65%). None of the tests employed were sufficiently reliable to potentially discern between players of differing ability, which may mean up to 56% of players' skill being misinterpreted. The credibility of such assessments should be questioned and alternative tests considered.


Introduction
In rugby league, tests of sport-specific skill have been used to differentiate between higher and lower playing standards in both adult (Gabbett, Jenkins & Abernethy, 2011;Gabbett, Kelly & Pezet, 2007) and junior players (Gabbett, Jenkins & Abernethy, 2010). Such tests are typically technique (process) rather than outcome based, involving subjective assessments relating to the quality of the performed skill, performed within simulated playing scenarios.
Such tests meet the 'open' nature of skill within the context of team sport, requiring players to execute the correct technique in a realistic playing environment (Ali, 2011). For example, studies have employed the use of highly qualified coaching staff (Australian Rugby League Level 3 coaching accreditation) to devise standard criteria for the assessment of fundamental game-related skills such as tackling, catching and receiving the ball during sport-specific conditioning practices or rugby league matches (Gabbett, 2008;Gabbett, et al., 2007). The proficiency of players has been subsequently based upon a Likert scale rating provided by an observer (Gabbett et al., 2010;Gabbett, 2008;Gabbett, et al., 2007). Such tests have grown in popularity since the assessment of technical skills performed within an open environment may offer a more realistic playing scenario in comparison to the closed skill testing often utilised in skill test batteries (Ali, 2011).
While process-driven measures ostensibly reveal a deeper dimension of technical ability (Williams & Reilly, 2000), such tests remain scientifically questionable owing to the subjective nature of their scoring or assessment, even amongst experienced and appropriately qualified coaches (i.e. Level 3 Rugby League Coaching Qualification). Although Gabbett et al. (2007) demonstrated a seemingly reliable testing procedure for the assessment of tackling an opponent and passing or receiving of the ball (ICC = 0.85 to 0.98 and CV = 5.1 to 5.3%), the degree of attainable reliability for observers without qualification or experience in the sport has been reported less favourably (CV = 4.7 to 8.7%; Gabbett, 2008). Indeed, assessments of this type lack procedural consistency between studies and often overlook the potential for perceptual (systematic) differences between experienced or non-experienced observers. The higher degree of experience and level of coaching qualification is broadly considered to support the credibility of the coach to discern between correctly or incorrectly executed sport-specific skills (Ste-Marie, 1999). Therefore, the recognition of, and differentiation between, systematic bias and random error are important for research of this type (see Atkinson & Nevill, 1998). However, assessment of relative reliability (instead of absolute reliability) or statistics that fail to quantify systematic bias and random error have been commonly applied to skill tests (Gabbett, 2008). Furthermore, the use of certain statistical procedures, such as the CV or, indeed, traditional parametric analyses such as 95% limits of agreement (LoA), to test for agreement between ordinal data sets (Likert scales), is also questionable (Cooper, Hughes, O'Donoghue & Nevill, 2007).
Parametric statistical tests are carried out on the assumption that the dependent variables follow a normal distribution (Atkinson & Nevill, 1998). However, ordinal data often follow a non-normal distribution and, accordingly, should be treated with non-parametric analyses (Cooper et al., 2007;Bland & Altman, 1999). Further considerations, such as the tolerable degree of error when using a 1 to 5 Likert scale (Gabbett et al., 2007), should be made in the context of previous findings. For example, previous research using Likert scales to discern between lower and higher ability players in the skills of catching, passing and tackling in rugby league players, has demonstrated 'significant' differences equating to 0.5 and 0.6 on the Likert scale in adults (Gabbett et al., 2007) or 0.75 (mean difference over various skills) in junior players (Gabbett et al., 2010). Using a non-parametric reliability analysis on such ordinal data sets will quantify the repeatability of assessments ranging from zero to five (without decimals). Therefore, notwithstanding the erroneous presentation of unattainable mean scores in previous studies, it is clear that to recognize such minor differences below the score of 1, the observer must achieve a 'perfect' agreement (i.e. less than 1). As a result, any error in subjective assessment would be intolerable. In this context, the a-priori analytical goal of any researcher attempting to administer such tests in order to discern between playing standards in rugby league players, should be to achieve a high proportion of perfect agreement. For this reason, and those highlighted above, the credibility of subjectively scored tests in rugby league motor skill tests remains to be established, thus limiting the application of such tests for talent identification purposes. Accordingly, the aims of the current study were to investigate: (i) the intra-observer reliability of a non-qualified observer ('novice'), and (ii) the inter-observer reliability of the three observers (two qualified 'experts' and one novice observer) in the assessment of catching, passing and tackling (stages 1 and 2) ability in elite adolescent rugby league players.

Participants
Twenty elite youth male rugby league players (8 forwards, 6 backs & 6 adjustables; King, Hume & Clark, 2010) contracted to a professional club in the North West of England volunteered to participate in the study (age: 14.7 ± 0.5 years; body mass: 72.8 ± 10.7 kg; stature: 176.5 ± 6.5 cm). All participants were asked not to exercise on the day of testing and to follow their normal dietary guidelines. Each player and the coaches were familiar with the testing protocols (see Procedures section) via their usual training practices and had 7.2 ± 1.2 years of formal playing experience, defined as a minimum of one training session and weekend match with a rugby league club. Consent was obtained from the players and their parents/guardians and approval for the study was granted by the Faculty of Applied Sciences Ethics Committee.

Skill simulation
All testing procedures took place outdoors on a grass training pitch under dry, mild weather conditions, over a period of one-to-two hours on the same day. Using examples from previous research (Gabbett et al., 2010;Gabbett, 2008), a simulated sport-specific match scenario was devised and implemented (as shown in Figure 1). The skills of passing, tackling and catching were selected since they represent fundamental game skills in rugby league that are performed by all players (see Sirotic, Coutts, Knowles & Catterick, 2009). The players performed a 'warm-up' (in groups of three) led by the club coach, consisting of moderate intensity running and upper and lower body dynamic stretching exercises, immediately prior to the skills tests..
The players were randomly selected to complete the test as either one of two attacking players who retained possession of the ball or a defensive player. Set within a 10 x 10 m grid, attacking players (ball carriers) were required to advance from one side of the grid to the other and complete one pass each before being tackled by the defending player. After one cycle of this protocol, the players were instructed to wait for a brief recovery period (remaining on their feet) at the opposite end of the grid before repeating the drill in the opposite direction. The test was designed to obligate catching, passing and tackling from both the player's left and right hand sides. If an action was performed that was deemed to be outside of the skills being assessed, such as an incorrect sequence of passing, the players were allowed to re-start the trial. To avoid such issues, demonstrations from qualified coaches (Level 3 Rugby League Coaching Qualification, UK) were performed prior to the testing procedures in order to enhance players' understanding of the test and to provide them with a reference for the required match-like intensity. The practice was continued until the coaches notified the researcher that they had completed their assessment (see criteria below), which lasted between four and six repetitions for each trial (~ 2 min). Once the observer had provided a score out of five for each of the three skill components, the players were required to exchange roles, with one player per drill under assessment. Once the first set of players had completed their rotation as the tackler, the next group of three commenced an identical testing procedure. A camera (Canon MV 700i, 50 Hz, Japan) was set up approximately at eye level of the coaches in a static position at a distance of 15 m from one end of the grid and used to film all proceedings (Figure 1). This was later used by the novice observer for technical skill assessments (see following sections). ***************************** Figure 1 here******************************** Skill assessment 7 The skills of the players were assessed by two expert coaches with 10 and 15 years of coaching, respectively, using set criteria (Table 1) previously established by Gabbett et al. (2007). The aim for the observer was to rate the players (in real-time) on their overall proficiency in each skill using a Likert scale ranging from one to five, with five representing an optimal score and one representing the lowest score possible. The expert and novice observers were provided with the assessment criteria one week prior to the testing, and subsequently given explicit instruction to refer to the criteria during the testing procedures.
For consistency, the expert observers were positioned equi-distant either side of the camera, enabling a similar perspective of the players. Each observer was not made aware of the other's scores. The inclusion of a novice observer (having watched rugby league for the previous two seasons but no coaching qualification) enabled a comparison with the expert assessors. To be consistent with the analyses of the experts, the novice observer was required to analyze, continuously, the video footage (without slowing or re-watching the footage) of the players' performances using the set criteria, albeit post-event. In order to evaluate the consistency of his subjective assessments (intra-observer reliability), he was required to repeat this task a week later. Following the recommendation of Gabbett et al. (2007), two stages of assessment (approach, Stage 1; execution, Stage 2) were included for each skill yielding two scores per skill performed. ******************************* Table 1 here********************************

Statistical analyses
The distributions of the six skill elements (approach and execution of catching, passing and tackling) were initially checked for normality using the Shapiro-Wilk test and where violations were observed (P < 0.05), non-parametric Kruskal-Wallis tests were applied to test for differences between observers (expert 1, expert 2 and the novice). Post-hoc Mann Whitney-U tests were used for pairwise comparisons between each of the observers. The presence of bias between the test and re-test trials of the novice observer was checked via a median sign test. Owing to the multiple (six) comparisons made between each different observer (novice, expert 1 and expert 2), the Benjamini Hochberg False Discovery Rate (FDR) technique was applied to control for the potential increase in the type I error rate. The technique involves, firstly, ranking the P-values (p(1) ≤ p(2) ≤ … ≤ p(k)) obtained from a series of multiple comparison tests performed under a shared hypothesis, from smallest to largest (six comparisons between each observing pair in the current case). The formula k/n is used to derive the FDR where; k = rank, alpha level (0.05), n = number of tests. Beginning with the largest (step-up), each original P-value is compared to the FDR (i.e. compare p(k) to k/n). At the point at which p(k) ≤ k/n, the null hypothesis was rejected and every value thereafter (Benjamini & Hochberg, 1995). The degree of random variation between or within observers was evaluated using the non-parametric technique advocated by Cooper et al. (2007). This technique involved calculating the percentage of agreement and associated 95% confidence intervals (CIs) between or within observers inside a 'practically important' reference value (Nevill, Lane, Kilgour, Bowes & Whyte, 2001). As established above, a reference value of perfect agreement (zero difference between observations) was deemed as 'practically important' for each type of skill assessed. A secondary reference value of ± 1 (a difference of one in either direction) was also set in order to demonstrate the portion of agreement between observers in the presence of the smallest possible error that can be made on the 1-5 Likert scale. Additionally, the coefficient of variation (CV) was calculated to enable comparisons with the findings of previous research.
Systematic bias was not present between expert observers and whilst there were no instances of 100% perfect agreement between them, it ranged from 75% to 90% in all passing and catching skills. However, for the tackling skills the agreement was notably lowered (60% to 65%). Nonetheless, all the CVs for the three skills were below 10% (1.6% to 8.1%). For the novice, intra-observer analysis revealed no overall difference (P > 0.05) between any scores, and the levels of agreement in the range 70% to 85%. CVs were below the 10% threshold for all scores, ranging from 3.4% to 6.0%.
Based upon the less stringent analytical goal of plus or minus '1' on the Likert scale, better agreement was achieved for all comparisons. For example, Table 3 shows that between expert observers and the intra-reliability of the novice observer, agreement was 100% in all but one comparison (tackling stage two for expert 1 to expert 2). Expert versus novice agreement remained sub-optimal, though was as high as 95% for most of the scores.

Discussion
It was the analytical goal of the present study to obtain 'perfect agreement' between expert observers in order to meet the requirements outlined in previous research (i.e. a difference of less than '1' on the Likert scale). As it emerged, in no case was 100% perfect agreement obtained between the expert observers and, given the width of the 95% confidence intervals (approximately 44% to 100% for catching and passing skills), it is likely that some talented players could be incorrectly appraised using such tests, which may contribute to the coaches' misinterpretation of their playing ability. That is, in the skills of passing and catching, the 'population' agreement between experts could be as high as 100% or low as 44%, rendering the potential for disagreement and performance misinterpretation to be as high as 56%.
Importantly, it is noteworthy that, for the same data, the CV ranged from 2.8% to 8.1% which is less than the magnitude often deemed as 'reliable'(< 10%; Atkinson & Nevill, 1998) and, similar to previous research in rugby league (Gabbett et al., 2007). Given that, in the context of talent identification, it is typically expert coaches who are responsible for discerning between players showing signs of higher or lower ability, and that previous reports in rugby league have failed to establish the inter-observer reliability between expert observers via the correct statistical approach, the general application of subjective rating systems across different expert users has to be questioned.
The current results should be interpreted on behalf of the broader rugby league community in accordance with the tolerable degree of error. That is, those charged with identifying talented players based, in part, upon the construct of sport-specific skill measured in such a way are required to consider what degree of error is acceptable. For example, if a tolerance of plus or minus one on a scale of one to five is deemed satisfactory, then the current data would indicate a much better level of agreement between expert observers than if zero difference reference value was adopted. However, in the context of talent identification, this parity between observers does not support the worthiness of the test for correctly interpreting skilled performance in higher ability players. Rather, the probability of misinterpreting (falling within plus or minus one) the quality of sport-specific skill is reinforced.
The limited agreement between experts was also exacerbated within both stages of tackling.
The reliability of the assessment of tackling was the poorest between experts, with a perfect agreement as low as 50%. Such poor agreement may relate to the open nature of the skill in which a simulated collision between two participants induces a less predictable environment in which to base judgements of technical performance. Indeed, previous analyses have assessed such skills within the open match environment (Gabbett et al., 2007), in which a stability of the set criteria, such as the upper and lower body position, cannot be expected.
Moreover, it could be argued that the set criteria will vary according to the context in which the tackle is performed, such as side-on and chasing tackles. In addition, research has shown that only 17% of tackles in rugby league are performed in a one-on-one scenario, with players often tackling in conjunction with other team-mates (King et al., 2010). Such findings support our previous assertion regarding the situational inconsistencies during match time, adding further complication to the assessment of tackling technique. Whilst these suggestions detract from the potential reliability of tackling analysis, the intention of previous researchers to enhance the ecological validity of skill testing should be recognized. Given the current findings and the general disparity between both experts, it remains unclear exactly what criteria expert observers are basing their judgements on. Indeed, it would be useful to evaluate the intra-observer reliability of expert observers' ratings, with and without the use of the set criteria.
The ratings of the experts were found to be systematically higher (P < 0.05) than the novice observer in the skills of catching (all stages) and passing stage 1. Such results fundamentally question the validity of the rugby league tests for motor skill ability in the hands of an inexperienced observer and suggest that it would be inappropriate to use the assessments of novice or expert observers interchangeably. Indeed, Gabbett (2008) has discussed the results of previous studies that have used either a novice or an expert observer without consideration of the potential differences in interpretation. In relation to the analytical goals of perfect agreement, the degree of random variation between the scores of the expert coaches and the novice was as low as 30%, with associated CIs ranging from 19.5% to 46.8%. Furthermore, the largest perfect agreement was 65% and in no case did the comparisons between the novice and expert observers indicate the potential (via CIs) for 100% agreement. If it is the intention of future research to compare findings between different studies, than an a priori evaluation similar in nature to the current study should be undertaken in order to establish the reliability of the observer.
The differences found in the present study between novice and expert observers may be owing to the inconsistent use of the set criteria for skill assessment. It has been suggested that inexperienced observers over-rely on operational definitions whilst assessing technical actions during match play (O'Donoghue, 2007). In contrast, an expert observer may choose to underpin interpretations of performance with previously acquired tacit coaching knowledge, using definitions as a vague guide rather than to strictly inform assessment, even when instructed otherwise (O'Donoghue, 2007). Although such reasoning may partly explain the disparity between expert and novice coaches, it is reasonable to question the necessity of 'set criteria', particularly for the expert coaches, if it fails to inform the resultant assessment.
However, in the present study the novice observer demonstrated no systematic bias and perfect agreement ranging from 70% to 85% between repeated trials, which may support the utility of set criteria since this alone guided the interpretation of skill in the absence of sportspecific knowledge. It is therefore apparent that the set criteria may be used differently depending upon the user's prior experience of the sport. Consequently, it can only be assumed that the exact construct of skill being assessed will vary between users with more or less experience.

Conclusion
The current analysis has raised a general concern over the use of subjective ratings of rugby league skill in their current form and highlighted potential issues with the application of set 14 skill criteria in relation to the 1 to 5 Likert scale ratings. Collectively, the inter-observer trials have shown that the application of a Likert scale cannot be used reliably to obtain a perfect agreement, most likely reflecting the subjectivity of the observers. This finding was supported by the novice's higher level of reliability demonstrated over the two repeated trials.
Furthermore, it is clear that some skills, such as tackling, are inherently more difficult to assess reliably than others, perhaps owing to the open nature of the assessment method. If sport-specific skill is an underlying facet of talented performance, capable of discerning between the elite or sub-elite players, then a test based upon an objective outcome may provide a more suitable measure. However, whilst such tests offer greater control over the performed skill, a sacrifice in ecological validity is inevitable.   Table 3. Comparisons of the inter-and intra-observer reliability of expert and novice rugby league practitioners.
Note: * = significantly larger for the expert observer based on pairwise comparisons (n = 20). Benjamini Hochberg adjusted alpha levels.