27 Evaluating Training and Development Effectiveness

27.1 Why Evaluation Matters

A training programme that cannot be evaluated honestly will be evaluated dishonestly, and the dishonest evaluation will be the one used to defend the next budget round.

The previous chapter described how the learning function decides what to teach. This chapter is about how the function evaluates whether the teaching worked. Evaluation is the discipline that converts a budget line into a defensible investment. Without it, the function is reduced to defending its programmes with completion rates, satisfaction scores, and stories that will not survive an executive challenge. With it, the function can show, programme by programme, whether the workforce is now able to do what the strategy required and whether the business outcomes the programme was supposed to influence have actually moved.

The framework that has shaped the field for half a century remains the four-level model articulated by Donald L. Kirkpatrick & James D. Kirkpatrick (2006) and refined across multiple editions: reaction, learning, behaviour, and results. The four levels map directly onto the efficiency-effectiveness-impact lens of this book and onto the evidence pyramid that an executive committee implicitly applies when it reviews a training investment. A function that lives only at level one is reporting on the catering; a function that reaches level four is making the kind of impact claim the firm can defend in a budget conversation.

The evidence base on what makes training effective has accumulated steadily across the literature. As Bradford S. Bell et al. (2017) set out in the centennial review of training and development research at the Journal of Applied Psychology, the methodological maturity of the field now supports specific claims about what works, in what conditions, with what designs, and at what time horizons. The function that engages with that evidence — rather than relying on training-vendor marketing or on annual conference fashions — produces evaluation work whose conclusions outlast the programme they describe.

The visualisation lens is what carries the discipline into the executive review. A reaction chart is rendered as a distribution gauge with response rate disclosed. A learning chart is rendered as a pre-and-post comparison with confidence intervals. A behaviour chart is rendered as a manager-rated heat map with time-to-application visible. A results chart is rendered as a cohort comparison with the business KPI overlaid and the comparison group named. The page that surfaces all four for a programme is the page that lets the audience read the evaluation as evidence rather than as advocacy.

The training-evaluation contract

Every learning programme that earns the dashboard is evaluated at all four Kirkpatrick levels, even when the higher-level evidence is partial or proxied for the current cycle.
The level reported on the page is matched to the audience: reaction for programme owners, learning for instructional designers, behaviour for managers, results for the executive committee.
Each level chart shows its comparison — control group, target, prior cohort, or counterfactual — so that the audience reads the evaluation at the strength the design supports.

27.2 The Four Levels of Training Evaluation

The four-level Kirkpatrick model — reaction, learning, behaviour, results — is the most influential framework in training evaluation. Each level answers a different question, requires different data, and supports a different visualisation. A scorecard that lives at one level is reporting on a quarter of the programme; a scorecard that reaches all four levels is making the case the programme deserves.

The Four Levels at a Glance

Level	Question it answers	Data sources	Typical visualisation
Reaction	Did the learners enjoy and value the experience	LMS post-course survey, NPS	Distribution gauge with response rate
Learning	Did the learners acquire the knowledge or skill	Pre-and-post tests, mastery checks	Before-and-after distribution chart
Behaviour	Did the learners apply it on the job	Manager survey, performance system, observation	Heat map of behaviour change at ninety days
Results	Did the business outcome change	Operational data, finance, customer system	Cohort comparison chart with control

The four-level arc

flowchart LR
  A[Reaction<br/>did learners value the experience] --> B[Learning<br/>did they acquire the knowledge]
  B --> C[Behaviour<br/>are they applying it on the job]
  C --> D[Results<br/>did the business outcome change]
  style A fill:#FEF7E0,stroke:#F9AB00
  style B fill:#E6F4EA,stroke:#137333
  style C fill:#FCE8E6,stroke:#C5221F
  style D fill:#F3E8FD,stroke:#8430CE

The arc has a natural ordering. Reaction is fast, cheap, and almost always positive. Learning is the test of whether the programme taught what it claimed to teach. Behaviour is the test of whether the learning crossed into the workplace. Results is the test of whether the workforce change moved the business outcome the programme was meant to influence. A function that climbs the arc deliberately for each major programme is a function that has built a defensible evaluation discipline.

27.3 Evaluation Designs

The choice of evaluation design determines the strength of the claim the dashboard can make. Four designs recur across credible training-evaluation studies, in increasing order of evidential strength.

Four Designs for Training Evaluation

Design	What it does	What it can claim	What it cannot claim
Post-only	Measures outcomes after the programme	Description of the post-state	Causation; no comparison
Pre-and-post	Measures outcomes before and after	Within-subject change	Causation; selection and time effects remain
Control-group comparison	Compares trained and untrained groups	Effect under matched conditions	Causation if the groups differ on unobserved variables
Randomised pilot	Randomly assigns the training within a population	Causal effect within the studied population	Generalisation beyond the population

The chart that matches the design

The visual that carries the result has to match the design that produced it. A post-only design renders as a single distribution; a pre-and-post design renders as a paired-difference chart; a control-group design renders as a side-by-side comparison; a randomised pilot renders as a treatment-versus-control plot with the randomisation labelled. As Bradford S. Bell et al. (2017) emphasise, the strongest training-evaluation claims have always come from designs that include a comparison, even an imperfect one, because the comparison is what lets the audience separate what the programme caused from what would have happened anyway.

27.4 Time Horizons and the Transfer Question

Training evaluation has to take time horizons seriously. Reaction can be measured at the close of the session; learning can be measured days or weeks later; behaviour change usually takes months to appear and stabilise; results changes can take a full performance cycle or longer. A function that evaluates a training programme three days after delivery and concludes that it failed has measured the wrong thing at the wrong time.

The Time-Horizon Map for Training Evaluation

Level	Typical time horizon	Why the horizon matters
Reaction	Immediately after the session	Captures perception while it is vivid
Learning	One to four weeks after	Captures retention beyond immediate recall
Behaviour	Three to six months after	Allows new behaviours to stabilise on the job
Results	Six to eighteen months after	Allows business outcomes to register the workforce change

Transfer of training as the binding constraint

Most training programmes fail at the transfer step — the move from learning in the classroom to behaviour on the job. The evaluation function’s most useful single contribution is to surface the transfer question as a measurable variable, with named factors that influence whether transfer happens: manager support, opportunity to apply, organisational climate for the new behaviour, and time elapsed since the programme. As Donald L. Kirkpatrick & James D. Kirkpatrick (2006) emphasised across the model’s editions, the level-three behaviour measurement is what separates training programmes that produce capability from those that produce only credentials.

27.5 ROI and Beyond

Return on investment is the most-requested fifth level of training evaluation, sometimes added to the four-level model as a programme-level economic measure. The discipline is to compute ROI honestly, with the cost side complete, the benefit side connected to a named business outcome, and the comparison group present. ROI calculated without those properties is marketing rather than analysis.

The Cost and Benefit Sides of Training ROI

Side	What it includes	Common omission
Cost	Programme design, delivery, materials, technology, learner time	Learner time and opportunity cost
Benefit	Business-outcome change attributable to the programme	The comparison-group counterfactual
Time horizon	The period over which costs and benefits are accrued	Truncating the benefit window before the effect emerges
Confidence	The uncertainty in the estimate	Reporting a point estimate without confidence

Beyond ROI: utility analysis

Utility analysis goes beyond simple ROI by combining the validity of the programme, the criticality of the role, the duration of the effect, and the cost of the alternative course of action. As Bradford S. Bell et al. (2017) review, utility analysis remains under-used despite its strong evidence base, because it requires more inputs than a simple ROI and because the inputs themselves require honest measurement. The dashboard’s value is to make the inputs visible — validity, criticality, duration, and cost — so that the audience reads the utility estimate as the synthesis it is rather than as a single confident number.

27.6 Visualising Training Effectiveness

The training-effectiveness dashboard is the single page on which the function defends its largest investments. Five design choices, applied consistently, hold the four levels, the design strength, and the time horizon together so that the audience reads the page as evidence.

Five Design Choices for the Effectiveness Dashboard

Choice	What it does on the page
Four-level summary by programme	Each programme has a four-level row showing reaction, learning, behaviour, results
Design label	Each level chart names the design that produced the result
Time-horizon indicator	The page declares when each level was measured relative to delivery
Comparison built in	Every level chart carries its target, control, or counterfactual
ROI and utility footer	Programme economics are surfaced with cost, benefit, horizon, and confidence

The dashboard as the budget defence

A training-effectiveness dashboard built with the disciplines of this chapter is, in operation, the budget-defence file for the learning function. The page does not aim to defend every programme as a success. It aims to render every programme honestly. Programmes that do not transfer to behaviour or to results are visible as such, and the function uses that visibility to redesign or retire them rather than to defend them. The credibility that follows from honest evaluation is what earns the budget for the programmes that do work.

27.7 Hands-On Exercise: Implementing the Kirkpatrick Four-Level Evaluation

Aim, Scenario, Dataset, Deliverable

Aim. Implement the Kirkpatrick four-level evaluation for one training programme: reaction distribution, pre-and-post learning chart, behaviour heat map at ninety days, and results cohort comparison. Render the four levels on a single Power BI page that summarises the programme.

Scenario. You are evaluating a learning programme for an organisation. The chief people officer wants the function to defend the programme using the four-level framework rather than completion rates alone, and the page is the defence document.

Dataset. Learning and Development Metrics (Excel) from the HRMD library. The workbook includes EmployeeID, Training Programme, Reaction Score, Pre-Test, Post-Test, Behaviour Rating at 90 Days, Productivity After Training, and related fields.

Deliverable. A Training-Effectiveness.xlsx workbook with the four levels computed for one programme, plus a Training-Effectiveness.pbix Power BI file with the four-level summary page described below.

27.7.1 Step 1 — Compute level one (reaction)

Filter the workbook to a single training programme. Compute the distribution of Reaction Score and the response rate.

Code

Excel Formula

Reaction Mean       = AVERAGE(Learning[Reaction Score])
Reaction NPS        = (COUNTIF(Learning[Reaction Score], ">=9") - COUNTIF(Learning[Reaction Score], "<=6"))
                    / COUNTA(Learning[Reaction Score]) * 100
Response Rate       = COUNTA(Learning[Reaction Score]) / COUNTA(Learning[EmployeeID]) * 100

Render reaction as a distribution gauge with the response rate disclosed.

27.7.2 Step 2 — Compute level two (learning)

Compute the per-learner gain from Pre-Test to Post-Test, and the cohort-level mean gain.

Code

Excel Formula

Learner Gain     = Learning[Post-Test] - Learning[Pre-Test]
Mean Gain        = AVERAGE(Learner Gain)
Mean Gain (Pct)  = Mean Gain / Mean(Learning[Pre-Test]) * 100

Render the level as a paired-difference chart with confidence interval.

27.7.3 Step 3 — Compute level three (behaviour)

Use the manager-rated Behaviour Rating at 90 Days to compute the percentage of learners whose behaviour was rated as changed.

Code

Excel Formula

Behaviour Change Rate = COUNTIF(Learning[Behaviour Rating at 90 Days], ">=4")
                      / COUNTA(Learning[Behaviour Rating at 90 Days]) * 100

Compute the rate by department and by manager so the heat map at the page level can surface the variation.

27.7.4 Step 4 — Compute level four (results)

Compare the productivity of trained learners with an untrained control group from the same role family. Use the Data Analysis ToolPak’s two-sample t-test.

Code

Excel Formula

Productivity Difference = AVERAGEIF(Trained, "Productivity After Training")
                        - AVERAGEIF(Control, "Productivity After Training")
T-Statistic             = (Mean_Trained - Mean_Control) / SQRT(Var_Trained/N_Trained + Var_Control/N_Control)

Render the comparison as a side-by-side bar chart with confidence intervals.

27.7.5 Step 5 — Document the design and time horizon

On a Design sheet, name the design behind each level (post-only for reaction, pre-and-post for learning, manager-rated comparison for behaviour, control-group comparison for results) and the time horizon at which each level was measured. The Design sheet becomes the audit trail.

27.7.6 Step 6 — Promote to Power BI

Load the workbook into Power BI. Build the four levels as DAX measures. Add a Programmes table so the page can be filtered to one programme at a time.

27.7.7 Step 7 — Build the four-level summary page

Lay out the page using the design choices from Section 5 of this chapter.

Each programme has a four-level row showing reaction, learning, behaviour, and results visuals side by side.
Each level chart names the design that produced the result (use a small label above each visual).
The page declares the time horizon at which each level was measured.
Every level chart carries its target, control, prior cohort, or counterfactual.
A small ROI-and-utility footer surfaces the programme economics with cost, benefit, horizon, and confidence visible.

27.7.8 Step 8 — Add the multi-programme view

Below the single-programme page, add a portfolio view that shows all major programmes as four-level rows. The portfolio surfaces which programmes have reached which level and which deserve next-cycle redesign.

27.7.9 Step 9 — Publish

Publish the report and add it to the annual learning-budget review. Confirm that programmes without level-three or level-four evidence are flagged for re-design rather than for renewed investment.

Connect to the Visualisation Layer

The training-effectiveness page sits downstream of the training-requirements dashboard from Chapter 26. The capability gaps surfaced in Chapter 26 are the gaps this chapter’s level-three and level-four evidence measures the firm’s progress against. The two pages together form the learning-function block of Module 3.

Files and Screen Recordings

Training-Effectiveness.xlsx, Training-Effectiveness.pbix, and ch27-effectiveness-walkthrough.mp4 will be attached at this point in the published edition. The screen recording walks through Steps 1 to 9 with the Excel four-level workbench and the Power BI evaluation page shown side by side.

Summary

Concept	Description
Why Evaluation Matters
Evaluation as budget defence	Programmes that cannot be evaluated honestly will be evaluated dishonestly
Four-level model	Reaction, learning, behaviour, and results, mapped to efficiency-effectiveness-impact
Methodological maturity of the field	Specific claims about what works, in what conditions, at what horizons
Comparison built in	Every level chart carries its target, control, or counterfactual
Honest rendering over advocacy	The page renders programmes honestly rather than defending all of them
The Four Levels
Reaction level	Did the learners enjoy and value the experience
Learning level	Did the learners acquire the knowledge or skill
Behaviour level	Did the learners apply the new behaviour on the job
Results level	Did the business outcome the programme targeted change
The four-level arc	The four levels are climbed in order, with each level testing the next
Evaluation Designs
Post-only design	Outcomes measured after the programme; description only
Pre-and-post design	Outcomes measured before and after; within-subject change
Control-group comparison	Trained and untrained groups compared under matched conditions
Randomised pilot	Random assignment within a population for causal claim
Chart matches the design	The chart that carries the result matches the design that produced it
Time Horizons and Transfer
Time horizon for reaction	Reaction is measured immediately after the session
Time horizon for learning	Learning is measured one to four weeks after the session
Time horizon for behaviour	Behaviour is measured three to six months after the programme
Time horizon for results	Results are measured six to eighteen months after delivery
Transfer of training	The move from learning in the classroom to behaviour on the job
Manager support for transfer	Manager support is among the strongest predictors of transfer
Climate for transfer	Organisational climate for the new behaviour shapes whether it is applied
ROI and Utility
Cost side of ROI	Programme design, delivery, materials, technology, and learner time
Benefit side of ROI	Business-outcome change attributable to the programme with a comparison
Time horizon for ROI	The period over which costs and benefits are accrued, not truncated
Confidence in ROI	Uncertainty in the estimate is rendered, not hidden behind a point estimate
Utility analysis	Combines validity, criticality, duration, and cost beyond simple ROI
Visualising Effectiveness
Four-level summary by programme	Each programme has a four-level row showing all four Kirkpatrick levels
Design label on every chart	Every level chart names the design that produced the result
Time-horizon indicator on the page	The page declares when each level was measured relative to delivery
Comparison built in to every level chart	Every chart shows target, control, prior cohort, or counterfactual
ROI and utility footer	Programme economics surfaced with cost, benefit, horizon, and confidence
Honest evaluation as credibility	Honest evaluation earns the function the budget for the programmes that work