23 Evaluating Reliability and Validity of Selection Models

23.1 Why Reliability and Validity Matter

A selection score that is unreliable cannot be valid; a selection score that is reliable can still be invalid. Both have to be tested before either can be trusted.

The previous chapter introduced the catalogue of selection methods and their evidential strength. This chapter goes one level deeper into the two technical disciplines that determine whether any specific selection model in any specific organisation actually delivers what the catalogue promises: reliability and validity. The two concepts are often conflated in HR conversations, and the conflation matters. Reliability is the consistency with which a selection model produces a score for the same candidate under similar conditions. Validity is the degree to which the score supports the inferences and decisions the firm wants to take from it. A model that is unreliable cannot be valid; a model that is reliable can still be invalid for the use the firm has in mind.

The standards that govern this work are rigorous and well documented. As the American Educational Research Association et al. (2014) Standards for Educational and Psychological Testing set out across multiple editions, the credible evaluation of a selection model is a programme rather than a single study. It accumulates evidence across reliability coefficients, validation studies, fairness analyses, and operational monitoring, and it documents that evidence in a form that can be defended internally and externally. The discipline is exacting, but it is also the only discipline that lets the firm distinguish a model that works from a model that simply produces numbers.

The conceptual frame for validity is the unitary view set out by Samuel Messick (1995) in his foundational work. Validity is not a property of the test; it is the degree to which the inferences drawn from the test scores can be defended for the use the firm wants to make of them. A general-mental-ability test can be validly used to predict job performance and invalidly used to assess interpersonal skill, even though the score is the same in both cases. The dashboard’s job is to render the supported uses visibly and to constrain the model’s use to those that the evidence covers.

The visualisation lens is what makes the evidence audience-readable. Reliability is rendered as a coefficient with a confidence interval. Validity is rendered as a study-level chart with comparison built in, stratified by job family and by demographic group, with sample sizes disclosed. Fairness analyses are rendered as subgroup comparisons. Operational monitoring is rendered as a longitudinal trend. A page that surfaces all four for each selection model is a page that lets the function defend the model in a meeting where every participant is sceptical, which is the only meeting that ultimately matters.

The reliability-and-validity contract

Every selection model on the dashboard is paired with its reliability coefficient and a defended validation study, surfaced as part of the page rather than buried in a methodology document.
Validity claims are constrained to the inferences the evidence supports. A model validated for one job family does not earn the dashboard for another without further work.
Reliability and validity are monitored longitudinally. A coefficient computed once at deployment is not enough; the dashboard tracks drift over time and prompts re-validation when the drift exceeds an agreed threshold.

23.2 Reliability

Reliability is the consistency with which a selection model produces a score. A perfectly reliable score will be the same on Monday and Tuesday, with one rater and another, on one form of the test and a parallel form. Reliability is necessary but not sufficient for validity, and a function that has not measured reliability cannot defensibly claim that the model works.

Four Forms of Reliability Evidence

Form	What it captures	Typical estimate
Internal consistency	Items in the test agree with one another	Cronbach’s alpha or McDonald’s omega
Test-retest	Same candidate produces a similar score across occasions	Correlation between scores at two time points
Inter-rater	Different raters give similar ratings to the same candidate	Intraclass correlation or Cohen’s kappa
Parallel-forms	Different forms of the test produce similar scores	Correlation between forms

Reading reliability coefficients

Reliability coefficients run from zero to one, and the interpretation depends on the use of the model. For high-stakes selection decisions in standardised tests, coefficients above 0.85 are conventionally expected; for structured-interview ratings, 0.70 is more typical and acceptable; for novel multi-rater instruments, even lower coefficients can support useful decisions if the evidence is honestly disclosed. The dashboard names the coefficient, the threshold the firm has chosen, and the sample size the coefficient was computed on, so that the audience reads the reliability at the strength the data supports.

23.3 Validity: The Unitary View

Validity, in the modern unitary view, is a single concept supported by multiple lines of evidence. A function that talks about content validity, construct validity, and criterion validity as separate properties has not yet adopted the unitary frame. Validity is one question — can the inferences from this score be defended for this use — and the evidence that supports it comes from several methods.

Five Lines of Validity Evidence

Line of evidence	What it shows	When it is most needed
Content evidence	The test items represent the job-relevant domain	When the inference is about job-related knowledge or behaviour
Construct evidence	The score correlates with theoretically related and unrelated constructs	When the score is used to infer a psychological construct
Criterion evidence	The score correlates with later job-performance outcomes	When the score is used to predict performance
Consequential evidence	The use of the score has acceptable individual and group consequences	When the score has high-stakes consequences
Generalisation evidence	The validity holds across job families, settings, and time	When the model is used beyond its original validation context

The validity argument

A defended validity claim is an argument supported by multiple lines of evidence, each appropriate to the inference being defended. As Samuel Messick (1995) argued, the validity argument is not a single coefficient. It is a structured chain — premise, evidence, conclusion — that the firm can audit and update. The dashboard surfaces the argument as a small visual: the inference being defended, the lines of evidence that support it, and the strength of each line. The audience reads the validity at the strength the evidence supports, rather than at the strength the marketing of the test suggested.

23.4 Validation Study Design

Three study designs are used to test the criterion validity of selection models. They differ in their evidential strength, in their feasibility, and in the conditions under which they apply. A function that knows the differences can choose the design that fits the model and the role.

Three Validation Designs

flowchart LR
  A[Predictive<br/>score now,<br/>outcome later] --> Z[Validity Coefficient<br/>with confidence interval]
  B[Concurrent<br/>score and outcome<br/>at the same time] --> Z
  C[Synthetic<br/>component validities<br/>combined for a new role] --> Z
  Z --> Y[Decision<br/>at the strength<br/>the evidence supports]
  style A fill:#E8F0FE,stroke:#1A73E8
  style B fill:#FEF7E0,stroke:#F9AB00
  style C fill:#E6F4EA,stroke:#137333
  style Z fill:#FCE8E6,stroke:#C5221F
  style Y fill:#F3E8FD,stroke:#8430CE

The predictive design administers the selection model to candidates, hires regardless of the score, and waits to compare scores with later performance — the most rigorous design, the slowest to produce evidence, and the costliest. The concurrent design administers the model to current incumbents and compares scores with current performance — faster, cheaper, and weaker because incumbents are not the same as candidates. The synthetic design combines the validities of components that have been studied separately to support a new role’s selection model — useful when bespoke validation is infeasible. The dashboard names the design that produced each validity coefficient.

Range restriction and other corrections

Validity coefficients are usually attenuated by features of the data: range restriction (only hired candidates contribute outcomes), criterion unreliability (the performance measure is itself imperfect), and measurement error in the predictor. Standard corrections exist and are documented in the Standards. The dashboard surfaces both the uncorrected coefficient and the corrected one, with the corrections labelled, so that the audience can see the full picture rather than only the headline number. Honest correction strengthens credibility; hidden correction undermines it.

23.5 Operational Monitoring

A validation study is a snapshot. Reliability and validity drift as candidate populations change, role requirements evolve, and the test itself ages. Operational monitoring is the discipline that catches drift before it becomes a credibility crisis. The dashboard’s monitoring view is not optional; it is what converts a one-time validation study into an ongoing programme.

What Monitoring Tracks

Signal	What it shows	When it triggers action
Reliability drift	Internal consistency or inter-rater agreement falling	When coefficient falls below the agreed threshold
Score-distribution shift	The distribution of candidate scores changing	When the shift is large enough to question the cut-score
Criterion-validity decay	The relationship to performance weakening	When the coefficient falls below the validation threshold
Subgroup-difference change	Adverse impact ratios moving	When the ratio falls outside the agreed band
Use-case expansion	The model used for inferences beyond the validated scope	Always; the dashboard prevents scope creep

Triggered re-validation

A monitoring system is only useful if it triggers re-validation when the data warrants. The dashboard names the threshold for each signal, surfaces the signal cycle by cycle, and highlights when the threshold is crossed. Triggered re-validation is more efficient than scheduled re-validation: it responds to evidence rather than to the calendar, and it concentrates the function’s analytical effort where the model is actually drifting.

23.6 Visualising Reliability and Validity

The dashboard that surfaces reliability and validity for the firm’s selection models has to do four things at once: name the coefficient, render the design, render the study, and track the drift. Five design choices, applied consistently, hold the page together for an audience that is not made up of psychometricians.

Five Design Choices for the Reliability-and-Validity Dashboard

Choice	What it does on the page
Coefficient with interval	Every coefficient is rendered with its confidence interval
Design label	Every validation result names the design that produced it
Subgroup panel	Each model has a subgroup panel for fairness evidence
Drift trend	The page shows the coefficient over multiple cycles
Use-case scope	The validated scope of each model is named on the page

The dashboard as a defence document

A reliability-and-validity dashboard is, in practice, a living defence document. When a regulator audits the selection programme, when a candidate challenges a decision, when an executive committee asks whether a tool is doing what it claims, the dashboard is what answers. As the American Educational Research Association et al. (2014) Standards emphasise, the credibility of a selection programme rests on the cumulative evidence the firm can produce on demand, and the dashboard is the surface on which that evidence is most readable. Build the surface for the moment of audit, and it serves the daily work as well.

23.7 Hands-On Exercise: Computing Reliability and Validity for a Selection Model

Aim, Scenario, Dataset, Deliverable

Aim. Compute the reliability of a structured selection assessment and the criterion validity of the selection scores against later performance, and render the evidence on a Power BI page that satisfies the reliability-and-validity contract.

Scenario. You are evaluating a structured-interview selection model used by Yuvijen Telecom for hiring frontline service-engineering staff. The model has been used for two years; you now have selection scores from two raters, plus performance ratings six months after each hire, for a sample of two hundred hires.

Dataset. A synthetic dataset you will build in Excel using the structure below. Generate values in a workbook named Yuvijen-Selection-Validation.xlsx with the following columns and the formulas indicated.

Column	Type	Generation formula
CandidateID	Integer	Sequence 1 to 200
Rater1Score	Integer (40 to 90)	`=RANDBETWEEN(40, 90)`
Rater2Score	Integer	`=Rater1Score + RANDBETWEEN(-8, 8)` (correlated by design)
AverageSelectionScore	Decimal	`=AVERAGE(Rater1Score, Rater2Score)`
Hired	Yes/No	`=IF(AverageSelectionScore>=65, "Yes", "No")`
PerformanceRating	Integer (1 to 5)	`=IF(Hired="Yes", MAX(1, MIN(5, ROUND(AverageSelectionScore/20 + RANDBETWEEN(-1,1), 0))), "")`
Gender	Female/Male	`=IF(RAND()<0.45, "Female", "Male")`

The correlated noise on Rater2Score and the dependency of PerformanceRating on AverageSelectionScore generate a defensible reliability and a moderate criterion validity for the lab.

Deliverable. The Yuvijen-Selection-Validation.xlsx workbook with reliability and validity calculations, plus a Selection-Validity.pbix Power BI file with the evidence page described below.

23.7.1 Step 1 — Generate the synthetic dataset

Open a new workbook, create the seven columns above, and fill the first row with the formulas. Drag down to two hundred rows. Paste the values back over the formulas (Copy > Paste Special > Values) so the dataset is fixed and your subsequent computations remain stable. Convert the range to a Table named Selection.

23.7.2 Step 2 — Compute inter-rater reliability

Use the Pearson correlation between Rater1Score and Rater2Score as a working inter-rater agreement measure.

Code

Excel Formula

Inter-Rater r = CORREL(Selection[Rater1Score], Selection[Rater2Score])

The expected value is around 0.6 to 0.8 given the noise in Step 1. Document the threshold the firm has chosen (for example, 0.70 for structured interviews) on a Definition sheet.

23.7.3 Step 3 — Compute criterion validity

Use the Pearson correlation between AverageSelectionScore and PerformanceRating, restricted to hired candidates only.

Code

Excel Formula

Validity Coefficient = CORREL(
    FILTER(Selection[AverageSelectionScore], Selection[Hired]="Yes"),
    FILTER(Selection[PerformanceRating], Selection[Hired]="Yes")
)

Compute the ninety-five per cent confidence interval using Fisher’s z-transformation on a Validity sheet.

23.7.4 Step 4 — Apply range-restriction correction

Compute the unrestricted standard deviation of AverageSelectionScore (across all candidates) and the restricted standard deviation (across hired candidates only). Apply the standard correction.

Code

Excel Formula

SD_Unrestricted = STDEV(Selection[AverageSelectionScore])
SD_Restricted   = STDEV.IF(Selection[AverageSelectionScore], Selection[Hired]="Yes")
Corrected r     = r_obs * (SD_Unrestricted / SD_Restricted)
                / SQRT(1 + r_obs^2 * ((SD_Unrestricted/SD_Restricted)^2 - 1))

Document the correction openly on the Validity sheet so both the corrected and uncorrected coefficients are visible.

23.7.5 Step 5 — Compute subgroup validity

Compute the validity coefficient separately for the Female and Male subgroups. The two coefficients should be similar; differences greater than 0.1 raise a fairness flag.

23.7.6 Step 6 — Promote to Power BI

Open Power BI Desktop and load the Selection table. Build the inter-rater agreement, validity coefficient, corrected validity, and subgroup validity as DAX measures.

23.7.7 Step 7 — Build the validity-argument page

Lay out the page with five regions:

A coefficient-and-interval row showing inter-rater agreement, validity, and corrected validity, each with its confidence interval.
A design label row naming the validation design (concurrent for this lab; the dataset is cross-sectional).
A subgroup panel showing the Female and Male validity coefficients side by side.
A scope text box naming the role family the model has been validated for.
A drift-trend placeholder for cycle-over-cycle re-validation.

23.7.9 Step 9 — Publish

Publish the report and tag it as the validity-evidence file for the structured-interview model. Confirm that the page is opened during every selection-programme review.

Connect to the Visualisation Layer

The validity-evidence page sits beside the recruitment funnel from Chapter 22 and the bias-and-prediction page from Chapter 24. The three pages together let the audience read selection volume, selection quality, and selection fairness in one coherent module-level dashboard.

Files and Screen Recordings

Yuvijen-Selection-Validation.xlsx, Selection-Validity.pbix, and ch23-validity-walkthrough.mp4 will be attached at this point in the published edition. The screen recording walks through Steps 1 to 9 with the Excel reliability and validity calculations and the Power BI evidence page shown side by side.

Summary

Concept	Description
Why Reliability and Validity Matter
Reliability versus validity	Reliability is consistency; validity is the defensibility of inferences from the score
Unitary validity view	Validity is one concept supported by multiple lines of evidence
Standards-based programme	Credible evaluation accumulates evidence across studies, fairness, and monitoring
Rendering supports defence	The dashboard surfaces evidence so the model can be defended in audit
Longitudinal monitoring	Reliability and validity drift; monitoring catches the drift in time
Reliability
Internal consistency	Items in the test agree with one another; alpha or omega
Test-retest reliability	Same candidate produces similar scores across occasions
Inter-rater reliability	Different raters give similar ratings to the same candidate
Parallel-forms reliability	Different forms of the test produce similar scores
Coefficient threshold by use	The threshold for an acceptable coefficient depends on the use of the model
Validity
Content evidence	The test items represent the job-relevant domain
Construct evidence	The score correlates with theoretically related and unrelated constructs
Criterion evidence	The score correlates with later job-performance outcomes
Consequential evidence	The use of the score has acceptable individual and group consequences
Generalisation evidence	The validity holds across job families, settings, and time
Validity argument	The validity argument is a structured chain of premise, evidence, conclusion
Validation Study Design
Predictive design	Score now and outcome later; rigorous, slow, costly
Concurrent design	Score and outcome at the same time on incumbents; faster, weaker
Synthetic design	Component validities combined for a new role; useful when bespoke is infeasible
Range restriction correction	Adjustment for the fact that only hired candidates contribute outcomes
Criterion unreliability correction	Adjustment for the imperfection of the performance measure itself
Operational Monitoring
Reliability drift signal	Internal consistency or inter-rater agreement falling below threshold
Score-distribution shift signal	The distribution of candidate scores changes enough to question the cut-score
Criterion-validity decay signal	The relationship between score and performance weakens over time
Subgroup-difference signal	Adverse impact ratios moving outside the agreed band
Use-case expansion signal	The model is used for inferences beyond the validated scope
Triggered re-validation	Re-validation triggered by evidence rather than by the calendar
Visualising the Evidence
Coefficient with interval	Every coefficient is rendered with its confidence interval
Design label on results	Every validation result names the design that produced it
Use-case scope on the page	The validated scope of each model is named on the page

23.1 Why Reliability and Validity Matter

23.2 Reliability

23.3 Validity: The Unitary View

23.4 Validation Study Design

23.5 Operational Monitoring

23.6 Visualising Reliability and Validity

23.7 Hands-On Exercise: Computing Reliability and Validity for a Selection Model

23.7.1 Step 1 — Generate the synthetic dataset

23.7.2 Step 2 — Compute inter-rater reliability

23.7.3 Step 3 — Compute criterion validity

23.7.4 Step 4 — Apply range-restriction correction

23.7.5 Step 5 — Compute subgroup validity

23.7.6 Step 6 — Promote to Power BI

23.7.7 Step 7 — Build the validity-argument page

23.7.8 Step 8 — Add the standards-aligned tooltip

23.7.9 Step 9 — Publish

Summary