23  Evaluating Reliability and Validity of Selection Models

23.1 Why Reliability and Validity Matter

A selection score that is unreliable cannot be valid; a selection score that is reliable can still be invalid. Both have to be tested before either can be trusted.

The previous chapter introduced the catalogue of selection methods and their evidential strength. This chapter goes one level deeper into the two technical disciplines that determine whether any specific selection model in any specific organisation actually delivers what the catalogue promises: reliability and validity. The two concepts are often conflated in HR conversations, and the conflation matters. Reliability is the consistency with which a selection model produces a score for the same candidate under similar conditions. Validity is the degree to which the score supports the inferences and decisions the firm wants to take from it. A model that is unreliable cannot be valid; a model that is reliable can still be invalid for the use the firm has in mind.

The standards that govern this work are rigorous and well documented. As the American Educational Research Association et al. (2014) Standards for Educational and Psychological Testing set out across multiple editions, the credible evaluation of a selection model is a programme rather than a single study. It accumulates evidence across reliability coefficients, validation studies, fairness analyses, and operational monitoring, and it documents that evidence in a form that can be defended internally and externally. The discipline is exacting, but it is also the only discipline that lets the firm distinguish a model that works from a model that simply produces numbers.

The conceptual frame for validity is the unitary view set out by Samuel Messick (1995) in his foundational work. Validity is not a property of the test; it is the degree to which the inferences drawn from the test scores can be defended for the use the firm wants to make of them. A general-mental-ability test can be validly used to predict job performance and invalidly used to assess interpersonal skill, even though the score is the same in both cases. The dashboard’s job is to render the supported uses visibly and to constrain the model’s use to those that the evidence covers.

The visualisation lens is what makes the evidence audience-readable. Reliability is rendered as a coefficient with a confidence interval. Validity is rendered as a study-level chart with comparison built in, stratified by job family and by demographic group, with sample sizes disclosed. Fairness analyses are rendered as subgroup comparisons. Operational monitoring is rendered as a longitudinal trend. A page that surfaces all four for each selection model is a page that lets the function defend the model in a meeting where every participant is sceptical, which is the only meeting that ultimately matters.

TipThe reliability-and-validity contract
  1. Every selection model on the dashboard is paired with its reliability coefficient and a defended validation study, surfaced as part of the page rather than buried in a methodology document.
  2. Validity claims are constrained to the inferences the evidence supports. A model validated for one job family does not earn the dashboard for another without further work.
  3. Reliability and validity are monitored longitudinally. A coefficient computed once at deployment is not enough; the dashboard tracks drift over time and prompts re-validation when the drift exceeds an agreed threshold.

23.2 Reliability

Reliability is the consistency with which a selection model produces a score. A perfectly reliable score will be the same on Monday and Tuesday, with one rater and another, on one form of the test and a parallel form. Reliability is necessary but not sufficient for validity, and a function that has not measured reliability cannot defensibly claim that the model works.

TipFour Forms of Reliability Evidence
Form What it captures Typical estimate
Internal consistency Items in the test agree with one another Cronbach’s alpha or McDonald’s omega
Test-retest Same candidate produces a similar score across occasions Correlation between scores at two time points
Inter-rater Different raters give similar ratings to the same candidate Intraclass correlation or Cohen’s kappa
Parallel-forms Different forms of the test produce similar scores Correlation between forms
TipReading reliability coefficients

Reliability coefficients run from zero to one, and the interpretation depends on the use of the model. For high-stakes selection decisions in standardised tests, coefficients above 0.85 are conventionally expected; for structured-interview ratings, 0.70 is more typical and acceptable; for novel multi-rater instruments, even lower coefficients can support useful decisions if the evidence is honestly disclosed. The dashboard names the coefficient, the threshold the firm has chosen, and the sample size the coefficient was computed on, so that the audience reads the reliability at the strength the data supports.

23.3 Validity: The Unitary View

Validity, in the modern unitary view, is a single concept supported by multiple lines of evidence. A function that talks about content validity, construct validity, and criterion validity as separate properties has not yet adopted the unitary frame. Validity is one question — can the inferences from this score be defended for this use — and the evidence that supports it comes from several methods.

TipFive Lines of Validity Evidence
Line of evidence What it shows When it is most needed
Content evidence The test items represent the job-relevant domain When the inference is about job-related knowledge or behaviour
Construct evidence The score correlates with theoretically related and unrelated constructs When the score is used to infer a psychological construct
Criterion evidence The score correlates with later job-performance outcomes When the score is used to predict performance
Consequential evidence The use of the score has acceptable individual and group consequences When the score has high-stakes consequences
Generalisation evidence The validity holds across job families, settings, and time When the model is used beyond its original validation context
TipThe validity argument

A defended validity claim is an argument supported by multiple lines of evidence, each appropriate to the inference being defended. As Samuel Messick (1995) argued, the validity argument is not a single coefficient. It is a structured chain — premise, evidence, conclusion — that the firm can audit and update. The dashboard surfaces the argument as a small visual: the inference being defended, the lines of evidence that support it, and the strength of each line. The audience reads the validity at the strength the evidence supports, rather than at the strength the marketing of the test suggested.

23.4 Validation Study Design

Three study designs are used to test the criterion validity of selection models. They differ in their evidential strength, in their feasibility, and in the conditions under which they apply. A function that knows the differences can choose the design that fits the model and the role.

TipThree Validation Designs

flowchart LR
  A[Predictive<br/>score now,<br/>outcome later] --> Z[Validity Coefficient<br/>with confidence interval]
  B[Concurrent<br/>score and outcome<br/>at the same time] --> Z
  C[Synthetic<br/>component validities<br/>combined for a new role] --> Z
  Z --> Y[Decision<br/>at the strength<br/>the evidence supports]
  style A fill:#E8F0FE,stroke:#1A73E8
  style B fill:#FEF7E0,stroke:#F9AB00
  style C fill:#E6F4EA,stroke:#137333
  style Z fill:#FCE8E6,stroke:#C5221F
  style Y fill:#F3E8FD,stroke:#8430CE

The predictive design administers the selection model to candidates, hires regardless of the score, and waits to compare scores with later performance — the most rigorous design, the slowest to produce evidence, and the costliest. The concurrent design administers the model to current incumbents and compares scores with current performance — faster, cheaper, and weaker because incumbents are not the same as candidates. The synthetic design combines the validities of components that have been studied separately to support a new role’s selection model — useful when bespoke validation is infeasible. The dashboard names the design that produced each validity coefficient.

TipRange restriction and other corrections

Validity coefficients are usually attenuated by features of the data: range restriction (only hired candidates contribute outcomes), criterion unreliability (the performance measure is itself imperfect), and measurement error in the predictor. Standard corrections exist and are documented in the Standards. The dashboard surfaces both the uncorrected coefficient and the corrected one, with the corrections labelled, so that the audience can see the full picture rather than only the headline number. Honest correction strengthens credibility; hidden correction undermines it.

23.5 Operational Monitoring

A validation study is a snapshot. Reliability and validity drift as candidate populations change, role requirements evolve, and the test itself ages. Operational monitoring is the discipline that catches drift before it becomes a credibility crisis. The dashboard’s monitoring view is not optional; it is what converts a one-time validation study into an ongoing programme.

TipWhat Monitoring Tracks
Signal What it shows When it triggers action
Reliability drift Internal consistency or inter-rater agreement falling When coefficient falls below the agreed threshold
Score-distribution shift The distribution of candidate scores changing When the shift is large enough to question the cut-score
Criterion-validity decay The relationship to performance weakening When the coefficient falls below the validation threshold
Subgroup-difference change Adverse impact ratios moving When the ratio falls outside the agreed band
Use-case expansion The model used for inferences beyond the validated scope Always; the dashboard prevents scope creep
TipTriggered re-validation

A monitoring system is only useful if it triggers re-validation when the data warrants. The dashboard names the threshold for each signal, surfaces the signal cycle by cycle, and highlights when the threshold is crossed. Triggered re-validation is more efficient than scheduled re-validation: it responds to evidence rather than to the calendar, and it concentrates the function’s analytical effort where the model is actually drifting.

23.6 Visualising Reliability and Validity

The dashboard that surfaces reliability and validity for the firm’s selection models has to do four things at once: name the coefficient, render the design, render the study, and track the drift. Five design choices, applied consistently, hold the page together for an audience that is not made up of psychometricians.

TipFive Design Choices for the Reliability-and-Validity Dashboard
Choice What it does on the page
Coefficient with interval Every coefficient is rendered with its confidence interval
Design label Every validation result names the design that produced it
Subgroup panel Each model has a subgroup panel for fairness evidence
Drift trend The page shows the coefficient over multiple cycles
Use-case scope The validated scope of each model is named on the page
TipThe dashboard as a defence document

A reliability-and-validity dashboard is, in practice, a living defence document. When a regulator audits the selection programme, when a candidate challenges a decision, when an executive committee asks whether a tool is doing what it claims, the dashboard is what answers. As the American Educational Research Association et al. (2014) Standards emphasise, the credibility of a selection programme rests on the cumulative evidence the firm can produce on demand, and the dashboard is the surface on which that evidence is most readable. Build the surface for the moment of audit, and it serves the daily work as well.

23.7 Hands-On Exercise: Computing Reliability and Validity for a Selection Model

NoteAim, Scenario, Dataset, Deliverable

Aim. Compute the reliability of a structured selection assessment and the criterion validity of the selection scores against later performance, and render the evidence on a Power BI page that satisfies the reliability-and-validity contract.

Scenario. You are evaluating a structured-interview selection model used by Yuvijen Telecom for hiring frontline service-engineering staff. The model has been used for two years; you now have selection scores from two raters, plus performance ratings six months after each hire, for a sample of two hundred hires.

Dataset. A synthetic dataset you will build in Excel using the structure below. Generate values in a workbook named Yuvijen-Selection-Validation.xlsx with the following columns and the formulas indicated.

Column Type Generation formula
CandidateID Integer Sequence 1 to 200
Rater1Score Integer (40 to 90) =RANDBETWEEN(40, 90)
Rater2Score Integer =Rater1Score + RANDBETWEEN(-8, 8) (correlated by design)
AverageSelectionScore Decimal =AVERAGE(Rater1Score, Rater2Score)
Hired Yes/No =IF(AverageSelectionScore>=65, "Yes", "No")
PerformanceRating Integer (1 to 5) =IF(Hired="Yes", MAX(1, MIN(5, ROUND(AverageSelectionScore/20 + RANDBETWEEN(-1,1), 0))), "")
Gender Female/Male =IF(RAND()<0.45, "Female", "Male")

The correlated noise on Rater2Score and the dependency of PerformanceRating on AverageSelectionScore generate a defensible reliability and a moderate criterion validity for the lab.

Deliverable. The Yuvijen-Selection-Validation.xlsx workbook with reliability and validity calculations, plus a Selection-Validity.pbix Power BI file with the evidence page described below.

23.7.1 Step 1 — Generate the synthetic dataset

Open a new workbook, create the seven columns above, and fill the first row with the formulas. Drag down to two hundred rows. Paste the values back over the formulas (Copy > Paste Special > Values) so the dataset is fixed and your subsequent computations remain stable. Convert the range to a Table named Selection.

23.7.2 Step 2 — Compute inter-rater reliability

Use the Pearson correlation between Rater1Score and Rater2Score as a working inter-rater agreement measure.

Code
Excel Formula
Inter-Rater r = CORREL(Selection[Rater1Score], Selection[Rater2Score])

The expected value is around 0.6 to 0.8 given the noise in Step 1. Document the threshold the firm has chosen (for example, 0.70 for structured interviews) on a Definition sheet.

23.7.3 Step 3 — Compute criterion validity

Use the Pearson correlation between AverageSelectionScore and PerformanceRating, restricted to hired candidates only.

Code
Excel Formula
Validity Coefficient = CORREL(
    FILTER(Selection[AverageSelectionScore], Selection[Hired]="Yes"),
    FILTER(Selection[PerformanceRating], Selection[Hired]="Yes")
)

Compute the ninety-five per cent confidence interval using Fisher’s z-transformation on a Validity sheet.

23.7.4 Step 4 — Apply range-restriction correction

Compute the unrestricted standard deviation of AverageSelectionScore (across all candidates) and the restricted standard deviation (across hired candidates only). Apply the standard correction.

Code
Excel Formula
SD_Unrestricted = STDEV(Selection[AverageSelectionScore])
SD_Restricted   = STDEV.IF(Selection[AverageSelectionScore], Selection[Hired]="Yes")
Corrected r     = r_obs * (SD_Unrestricted / SD_Restricted)
                / SQRT(1 + r_obs^2 * ((SD_Unrestricted/SD_Restricted)^2 - 1))

Document the correction openly on the Validity sheet so both the corrected and uncorrected coefficients are visible.

23.7.5 Step 5 — Compute subgroup validity

Compute the validity coefficient separately for the Female and Male subgroups. The two coefficients should be similar; differences greater than 0.1 raise a fairness flag.

23.7.6 Step 6 — Promote to Power BI

Open Power BI Desktop and load the Selection table. Build the inter-rater agreement, validity coefficient, corrected validity, and subgroup validity as DAX measures.

23.7.7 Step 7 — Build the validity-argument page

Lay out the page with five regions:

  • A coefficient-and-interval row showing inter-rater agreement, validity, and corrected validity, each with its confidence interval.
  • A design label row naming the validation design (concurrent for this lab; the dataset is cross-sectional).
  • A subgroup panel showing the Female and Male validity coefficients side by side.
  • A scope text box naming the role family the model has been validated for.
  • A drift-trend placeholder for cycle-over-cycle re-validation.

23.7.8 Step 8 — Add the standards-aligned tooltip

Add a Description to each measure that names the standards it is calibrated against (the AERA, APA, and NCME Standards threshold for the use case). The tooltip is the audit-grade documentation that the page becomes when a regulator opens it.

23.7.9 Step 9 — Publish

Publish the report and tag it as the validity-evidence file for the structured-interview model. Confirm that the page is opened during every selection-programme review.

TipConnect to the Visualisation Layer

The validity-evidence page sits beside the recruitment funnel from Chapter 22 and the bias-and-prediction page from Chapter 24. The three pages together let the audience read selection volume, selection quality, and selection fairness in one coherent module-level dashboard.

TipFiles and Screen Recordings

Yuvijen-Selection-Validation.xlsx, Selection-Validity.pbix, and ch23-validity-walkthrough.mp4 will be attached at this point in the published edition. The screen recording walks through Steps 1 to 9 with the Excel reliability and validity calculations and the Power BI evidence page shown side by side.

Summary

Concept Description
Why Reliability and Validity Matter
Reliability versus validity Reliability is consistency; validity is the defensibility of inferences from the score
Unitary validity view Validity is one concept supported by multiple lines of evidence
Standards-based programme Credible evaluation accumulates evidence across studies, fairness, and monitoring
Rendering supports defence The dashboard surfaces evidence so the model can be defended in audit
Longitudinal monitoring Reliability and validity drift; monitoring catches the drift in time
Reliability
Internal consistency Items in the test agree with one another; alpha or omega
Test-retest reliability Same candidate produces similar scores across occasions
Inter-rater reliability Different raters give similar ratings to the same candidate
Parallel-forms reliability Different forms of the test produce similar scores
Coefficient threshold by use The threshold for an acceptable coefficient depends on the use of the model
Validity
Content evidence The test items represent the job-relevant domain
Construct evidence The score correlates with theoretically related and unrelated constructs
Criterion evidence The score correlates with later job-performance outcomes
Consequential evidence The use of the score has acceptable individual and group consequences
Generalisation evidence The validity holds across job families, settings, and time
Validity argument The validity argument is a structured chain of premise, evidence, conclusion
Validation Study Design
Predictive design Score now and outcome later; rigorous, slow, costly
Concurrent design Score and outcome at the same time on incumbents; faster, weaker
Synthetic design Component validities combined for a new role; useful when bespoke is infeasible
Range restriction correction Adjustment for the fact that only hired candidates contribute outcomes
Criterion unreliability correction Adjustment for the imperfection of the performance measure itself
Operational Monitoring
Reliability drift signal Internal consistency or inter-rater agreement falling below threshold
Score-distribution shift signal The distribution of candidate scores changes enough to question the cut-score
Criterion-validity decay signal The relationship between score and performance weakens over time
Subgroup-difference signal Adverse impact ratios moving outside the agreed band
Use-case expansion signal The model is used for inferences beyond the validated scope
Triggered re-validation Re-validation triggered by evidence rather than by the calendar
Visualising the Evidence
Coefficient with interval Every coefficient is rendered with its confidence interval
Design label on results Every validation result names the design that produced it
Use-case scope on the page The validated scope of each model is named on the page