flowchart LR A[Predictive<br/>score now,<br/>outcome later] --> Z[Validity Coefficient<br/>with confidence interval] B[Concurrent<br/>score and outcome<br/>at the same time] --> Z C[Synthetic<br/>component validities<br/>combined for a new role] --> Z Z --> Y[Decision<br/>at the strength<br/>the evidence supports] style A fill:#E8F0FE,stroke:#1A73E8 style B fill:#FEF7E0,stroke:#F9AB00 style C fill:#E6F4EA,stroke:#137333 style Z fill:#FCE8E6,stroke:#C5221F style Y fill:#F3E8FD,stroke:#8430CE
23 Evaluating Reliability and Validity of Selection Models
23.1 Why Reliability and Validity Matter
A selection score that is unreliable cannot be valid; a selection score that is reliable can still be invalid. Both have to be tested before either can be trusted.
The previous chapter introduced the catalogue of selection methods and their evidential strength. This chapter goes one level deeper into the two technical disciplines that determine whether any specific selection model in any specific organisation actually delivers what the catalogue promises: reliability and validity. The two concepts are often conflated in HR conversations, and the conflation matters. Reliability is the consistency with which a selection model produces a score for the same candidate under similar conditions. Validity is the degree to which the score supports the inferences and decisions the firm wants to take from it. A model that is unreliable cannot be valid; a model that is reliable can still be invalid for the use the firm has in mind.
The standards that govern this work are rigorous and well documented. As the American Educational Research Association et al. (2014) Standards for Educational and Psychological Testing set out across multiple editions, the credible evaluation of a selection model is a programme rather than a single study. It accumulates evidence across reliability coefficients, validation studies, fairness analyses, and operational monitoring, and it documents that evidence in a form that can be defended internally and externally. The discipline is exacting, but it is also the only discipline that lets the firm distinguish a model that works from a model that simply produces numbers.
The conceptual frame for validity is the unitary view set out by Samuel Messick (1995) in his foundational work. Validity is not a property of the test; it is the degree to which the inferences drawn from the test scores can be defended for the use the firm wants to make of them. A general-mental-ability test can be validly used to predict job performance and invalidly used to assess interpersonal skill, even though the score is the same in both cases. The dashboard’s job is to render the supported uses visibly and to constrain the model’s use to those that the evidence covers.
The visualisation lens is what makes the evidence audience-readable. Reliability is rendered as a coefficient with a confidence interval. Validity is rendered as a study-level chart with comparison built in, stratified by job family and by demographic group, with sample sizes disclosed. Fairness analyses are rendered as subgroup comparisons. Operational monitoring is rendered as a longitudinal trend. A page that surfaces all four for each selection model is a page that lets the function defend the model in a meeting where every participant is sceptical, which is the only meeting that ultimately matters.
- Every selection model on the dashboard is paired with its reliability coefficient and a defended validation study, surfaced as part of the page rather than buried in a methodology document.
- Validity claims are constrained to the inferences the evidence supports. A model validated for one job family does not earn the dashboard for another without further work.
- Reliability and validity are monitored longitudinally. A coefficient computed once at deployment is not enough; the dashboard tracks drift over time and prompts re-validation when the drift exceeds an agreed threshold.
23.2 Reliability
Reliability is the consistency with which a selection model produces a score. A perfectly reliable score will be the same on Monday and Tuesday, with one rater and another, on one form of the test and a parallel form. Reliability is necessary but not sufficient for validity, and a function that has not measured reliability cannot defensibly claim that the model works.
| Form | What it captures | Typical estimate |
|---|---|---|
| Internal consistency | Items in the test agree with one another | Cronbach’s alpha or McDonald’s omega |
| Test-retest | Same candidate produces a similar score across occasions | Correlation between scores at two time points |
| Inter-rater | Different raters give similar ratings to the same candidate | Intraclass correlation or Cohen’s kappa |
| Parallel-forms | Different forms of the test produce similar scores | Correlation between forms |
Reliability coefficients run from zero to one, and the interpretation depends on the use of the model. For high-stakes selection decisions in standardised tests, coefficients above 0.85 are conventionally expected; for structured-interview ratings, 0.70 is more typical and acceptable; for novel multi-rater instruments, even lower coefficients can support useful decisions if the evidence is honestly disclosed. The dashboard names the coefficient, the threshold the firm has chosen, and the sample size the coefficient was computed on, so that the audience reads the reliability at the strength the data supports.
23.3 Validity: The Unitary View
Validity, in the modern unitary view, is a single concept supported by multiple lines of evidence. A function that talks about content validity, construct validity, and criterion validity as separate properties has not yet adopted the unitary frame. Validity is one question — can the inferences from this score be defended for this use — and the evidence that supports it comes from several methods.
| Line of evidence | What it shows | When it is most needed |
|---|---|---|
| Content evidence | The test items represent the job-relevant domain | When the inference is about job-related knowledge or behaviour |
| Construct evidence | The score correlates with theoretically related and unrelated constructs | When the score is used to infer a psychological construct |
| Criterion evidence | The score correlates with later job-performance outcomes | When the score is used to predict performance |
| Consequential evidence | The use of the score has acceptable individual and group consequences | When the score has high-stakes consequences |
| Generalisation evidence | The validity holds across job families, settings, and time | When the model is used beyond its original validation context |
A defended validity claim is an argument supported by multiple lines of evidence, each appropriate to the inference being defended. As Samuel Messick (1995) argued, the validity argument is not a single coefficient. It is a structured chain — premise, evidence, conclusion — that the firm can audit and update. The dashboard surfaces the argument as a small visual: the inference being defended, the lines of evidence that support it, and the strength of each line. The audience reads the validity at the strength the evidence supports, rather than at the strength the marketing of the test suggested.
23.4 Validation Study Design
Three study designs are used to test the criterion validity of selection models. They differ in their evidential strength, in their feasibility, and in the conditions under which they apply. A function that knows the differences can choose the design that fits the model and the role.
The predictive design administers the selection model to candidates, hires regardless of the score, and waits to compare scores with later performance — the most rigorous design, the slowest to produce evidence, and the costliest. The concurrent design administers the model to current incumbents and compares scores with current performance — faster, cheaper, and weaker because incumbents are not the same as candidates. The synthetic design combines the validities of components that have been studied separately to support a new role’s selection model — useful when bespoke validation is infeasible. The dashboard names the design that produced each validity coefficient.
Validity coefficients are usually attenuated by features of the data: range restriction (only hired candidates contribute outcomes), criterion unreliability (the performance measure is itself imperfect), and measurement error in the predictor. Standard corrections exist and are documented in the Standards. The dashboard surfaces both the uncorrected coefficient and the corrected one, with the corrections labelled, so that the audience can see the full picture rather than only the headline number. Honest correction strengthens credibility; hidden correction undermines it.
23.5 Operational Monitoring
A validation study is a snapshot. Reliability and validity drift as candidate populations change, role requirements evolve, and the test itself ages. Operational monitoring is the discipline that catches drift before it becomes a credibility crisis. The dashboard’s monitoring view is not optional; it is what converts a one-time validation study into an ongoing programme.
| Signal | What it shows | When it triggers action |
|---|---|---|
| Reliability drift | Internal consistency or inter-rater agreement falling | When coefficient falls below the agreed threshold |
| Score-distribution shift | The distribution of candidate scores changing | When the shift is large enough to question the cut-score |
| Criterion-validity decay | The relationship to performance weakening | When the coefficient falls below the validation threshold |
| Subgroup-difference change | Adverse impact ratios moving | When the ratio falls outside the agreed band |
| Use-case expansion | The model used for inferences beyond the validated scope | Always; the dashboard prevents scope creep |
A monitoring system is only useful if it triggers re-validation when the data warrants. The dashboard names the threshold for each signal, surfaces the signal cycle by cycle, and highlights when the threshold is crossed. Triggered re-validation is more efficient than scheduled re-validation: it responds to evidence rather than to the calendar, and it concentrates the function’s analytical effort where the model is actually drifting.
23.6 Visualising Reliability and Validity
The dashboard that surfaces reliability and validity for the firm’s selection models has to do four things at once: name the coefficient, render the design, render the study, and track the drift. Five design choices, applied consistently, hold the page together for an audience that is not made up of psychometricians.
| Choice | What it does on the page |
|---|---|
| Coefficient with interval | Every coefficient is rendered with its confidence interval |
| Design label | Every validation result names the design that produced it |
| Subgroup panel | Each model has a subgroup panel for fairness evidence |
| Drift trend | The page shows the coefficient over multiple cycles |
| Use-case scope | The validated scope of each model is named on the page |
A reliability-and-validity dashboard is, in practice, a living defence document. When a regulator audits the selection programme, when a candidate challenges a decision, when an executive committee asks whether a tool is doing what it claims, the dashboard is what answers. As the American Educational Research Association et al. (2014) Standards emphasise, the credibility of a selection programme rests on the cumulative evidence the firm can produce on demand, and the dashboard is the surface on which that evidence is most readable. Build the surface for the moment of audit, and it serves the daily work as well.
23.7 Hands-On Exercise: Computing Reliability and Validity for a Selection Model
Aim. Compute the reliability of a structured selection assessment and the criterion validity of the selection scores against later performance, and render the evidence on a Power BI page that satisfies the reliability-and-validity contract.
Scenario. You are evaluating a structured-interview selection model used by Yuvijen Telecom for hiring frontline service-engineering staff. The model has been used for two years; you now have selection scores from two raters, plus performance ratings six months after each hire, for a sample of two hundred hires.
Dataset. A synthetic dataset you will build in Excel using the structure below. Generate values in a workbook named Yuvijen-Selection-Validation.xlsx with the following columns and the formulas indicated.
| Column | Type | Generation formula |
|---|---|---|
| CandidateID | Integer | Sequence 1 to 200 |
| Rater1Score | Integer (40 to 90) | =RANDBETWEEN(40, 90) |
| Rater2Score | Integer |
=Rater1Score + RANDBETWEEN(-8, 8) (correlated by design) |
| AverageSelectionScore | Decimal | =AVERAGE(Rater1Score, Rater2Score) |
| Hired | Yes/No | =IF(AverageSelectionScore>=65, "Yes", "No") |
| PerformanceRating | Integer (1 to 5) | =IF(Hired="Yes", MAX(1, MIN(5, ROUND(AverageSelectionScore/20 + RANDBETWEEN(-1,1), 0))), "") |
| Gender | Female/Male | =IF(RAND()<0.45, "Female", "Male") |
The correlated noise on Rater2Score and the dependency of PerformanceRating on AverageSelectionScore generate a defensible reliability and a moderate criterion validity for the lab.
Deliverable. The Yuvijen-Selection-Validation.xlsx workbook with reliability and validity calculations, plus a Selection-Validity.pbix Power BI file with the evidence page described below.
23.7.1 Step 1 — Generate the synthetic dataset
Open a new workbook, create the seven columns above, and fill the first row with the formulas. Drag down to two hundred rows. Paste the values back over the formulas (Copy > Paste Special > Values) so the dataset is fixed and your subsequent computations remain stable. Convert the range to a Table named Selection.
23.7.2 Step 2 — Compute inter-rater reliability
Use the Pearson correlation between Rater1Score and Rater2Score as a working inter-rater agreement measure.
Code
Excel Formula
Inter-Rater r = CORREL(Selection[Rater1Score], Selection[Rater2Score])The expected value is around 0.6 to 0.8 given the noise in Step 1. Document the threshold the firm has chosen (for example, 0.70 for structured interviews) on a Definition sheet.
23.7.3 Step 3 — Compute criterion validity
Use the Pearson correlation between AverageSelectionScore and PerformanceRating, restricted to hired candidates only.
Code
Excel Formula
Validity Coefficient = CORREL(
FILTER(Selection[AverageSelectionScore], Selection[Hired]="Yes"),
FILTER(Selection[PerformanceRating], Selection[Hired]="Yes")
)Compute the ninety-five per cent confidence interval using Fisher’s z-transformation on a Validity sheet.
23.7.4 Step 4 — Apply range-restriction correction
Compute the unrestricted standard deviation of AverageSelectionScore (across all candidates) and the restricted standard deviation (across hired candidates only). Apply the standard correction.
Code
Excel Formula
SD_Unrestricted = STDEV(Selection[AverageSelectionScore])
SD_Restricted = STDEV.IF(Selection[AverageSelectionScore], Selection[Hired]="Yes")
Corrected r = r_obs * (SD_Unrestricted / SD_Restricted)
/ SQRT(1 + r_obs^2 * ((SD_Unrestricted/SD_Restricted)^2 - 1))Document the correction openly on the Validity sheet so both the corrected and uncorrected coefficients are visible.
23.7.5 Step 5 — Compute subgroup validity
Compute the validity coefficient separately for the Female and Male subgroups. The two coefficients should be similar; differences greater than 0.1 raise a fairness flag.
23.7.6 Step 6 — Promote to Power BI
Open Power BI Desktop and load the Selection table. Build the inter-rater agreement, validity coefficient, corrected validity, and subgroup validity as DAX measures.
23.7.7 Step 7 — Build the validity-argument page
Lay out the page with five regions:
- A coefficient-and-interval row showing inter-rater agreement, validity, and corrected validity, each with its confidence interval.
- A design label row naming the validation design (concurrent for this lab; the dataset is cross-sectional).
- A subgroup panel showing the Female and Male validity coefficients side by side.
- A scope text box naming the role family the model has been validated for.
- A drift-trend placeholder for cycle-over-cycle re-validation.
23.7.8 Step 8 — Add the standards-aligned tooltip
Add a Description to each measure that names the standards it is calibrated against (the AERA, APA, and NCME Standards threshold for the use case). The tooltip is the audit-grade documentation that the page becomes when a regulator opens it.
23.7.9 Step 9 — Publish
Publish the report and tag it as the validity-evidence file for the structured-interview model. Confirm that the page is opened during every selection-programme review.
The validity-evidence page sits beside the recruitment funnel from Chapter 22 and the bias-and-prediction page from Chapter 24. The three pages together let the audience read selection volume, selection quality, and selection fairness in one coherent module-level dashboard.
Yuvijen-Selection-Validation.xlsx, Selection-Validity.pbix, and ch23-validity-walkthrough.mp4 will be attached at this point in the published edition. The screen recording walks through Steps 1 to 9 with the Excel reliability and validity calculations and the Power BI evidence page shown side by side.
Summary
| Concept | Description |
|---|---|
| Why Reliability and Validity Matter | |
| Reliability versus validity | Reliability is consistency; validity is the defensibility of inferences from the score |
| Unitary validity view | Validity is one concept supported by multiple lines of evidence |
| Standards-based programme | Credible evaluation accumulates evidence across studies, fairness, and monitoring |
| Rendering supports defence | The dashboard surfaces evidence so the model can be defended in audit |
| Longitudinal monitoring | Reliability and validity drift; monitoring catches the drift in time |
| Reliability | |
| Internal consistency | Items in the test agree with one another; alpha or omega |
| Test-retest reliability | Same candidate produces similar scores across occasions |
| Inter-rater reliability | Different raters give similar ratings to the same candidate |
| Parallel-forms reliability | Different forms of the test produce similar scores |
| Coefficient threshold by use | The threshold for an acceptable coefficient depends on the use of the model |
| Validity | |
| Content evidence | The test items represent the job-relevant domain |
| Construct evidence | The score correlates with theoretically related and unrelated constructs |
| Criterion evidence | The score correlates with later job-performance outcomes |
| Consequential evidence | The use of the score has acceptable individual and group consequences |
| Generalisation evidence | The validity holds across job families, settings, and time |
| Validity argument | The validity argument is a structured chain of premise, evidence, conclusion |
| Validation Study Design | |
| Predictive design | Score now and outcome later; rigorous, slow, costly |
| Concurrent design | Score and outcome at the same time on incumbents; faster, weaker |
| Synthetic design | Component validities combined for a new role; useful when bespoke is infeasible |
| Range restriction correction | Adjustment for the fact that only hired candidates contribute outcomes |
| Criterion unreliability correction | Adjustment for the imperfection of the performance measure itself |
| Operational Monitoring | |
| Reliability drift signal | Internal consistency or inter-rater agreement falling below threshold |
| Score-distribution shift signal | The distribution of candidate scores changes enough to question the cut-score |
| Criterion-validity decay signal | The relationship between score and performance weakens over time |
| Subgroup-difference signal | Adverse impact ratios moving outside the agreed band |
| Use-case expansion signal | The model is used for inferences beyond the validated scope |
| Triggered re-validation | Re-validation triggered by evidence rather than by the calendar |
| Visualising the Evidence | |
| Coefficient with interval | Every coefficient is rendered with its confidence interval |
| Design label on results | Every validation result names the design that produced it |
| Use-case scope on the page | The validated scope of each model is named on the page |