27  Evaluating Training and Development Effectiveness

27.1 Why Evaluation Matters

A training programme that cannot be evaluated honestly will be evaluated dishonestly, and the dishonest evaluation will be the one used to defend the next budget round.

The previous chapter described how the learning function decides what to teach. This chapter is about how the function evaluates whether the teaching worked. Evaluation is the discipline that converts a budget line into a defensible investment. Without it, the function is reduced to defending its programmes with completion rates, satisfaction scores, and stories that will not survive an executive challenge. With it, the function can show, programme by programme, whether the workforce is now able to do what the strategy required and whether the business outcomes the programme was supposed to influence have actually moved.

The framework that has shaped the field for half a century remains the four-level model articulated by Donald L. Kirkpatrick & James D. Kirkpatrick (2006) and refined across multiple editions: reaction, learning, behaviour, and results. The four levels map directly onto the efficiency-effectiveness-impact lens of this book and onto the evidence pyramid that an executive committee implicitly applies when it reviews a training investment. A function that lives only at level one is reporting on the catering; a function that reaches level four is making the kind of impact claim the firm can defend in a budget conversation.

The evidence base on what makes training effective has accumulated steadily across the literature. As Bradford S. Bell et al. (2017) set out in the centennial review of training and development research at the Journal of Applied Psychology, the methodological maturity of the field now supports specific claims about what works, in what conditions, with what designs, and at what time horizons. The function that engages with that evidence — rather than relying on training-vendor marketing or on annual conference fashions — produces evaluation work whose conclusions outlast the programme they describe.

The visualisation lens is what carries the discipline into the executive review. A reaction chart is rendered as a distribution gauge with response rate disclosed. A learning chart is rendered as a pre-and-post comparison with confidence intervals. A behaviour chart is rendered as a manager-rated heat map with time-to-application visible. A results chart is rendered as a cohort comparison with the business KPI overlaid and the comparison group named. The page that surfaces all four for a programme is the page that lets the audience read the evaluation as evidence rather than as advocacy.

TipThe training-evaluation contract
  1. Every learning programme that earns the dashboard is evaluated at all four Kirkpatrick levels, even when the higher-level evidence is partial or proxied for the current cycle.
  2. The level reported on the page is matched to the audience: reaction for programme owners, learning for instructional designers, behaviour for managers, results for the executive committee.
  3. Each level chart shows its comparison — control group, target, prior cohort, or counterfactual — so that the audience reads the evaluation at the strength the design supports.

27.2 The Four Levels of Training Evaluation

The four-level Kirkpatrick model — reaction, learning, behaviour, results — is the most influential framework in training evaluation. Each level answers a different question, requires different data, and supports a different visualisation. A scorecard that lives at one level is reporting on a quarter of the programme; a scorecard that reaches all four levels is making the case the programme deserves.

TipThe Four Levels at a Glance
Level Question it answers Data sources Typical visualisation
Reaction Did the learners enjoy and value the experience LMS post-course survey, NPS Distribution gauge with response rate
Learning Did the learners acquire the knowledge or skill Pre-and-post tests, mastery checks Before-and-after distribution chart
Behaviour Did the learners apply it on the job Manager survey, performance system, observation Heat map of behaviour change at ninety days
Results Did the business outcome change Operational data, finance, customer system Cohort comparison chart with control
TipThe four-level arc

flowchart LR
  A[Reaction<br/>did learners value the experience] --> B[Learning<br/>did they acquire the knowledge]
  B --> C[Behaviour<br/>are they applying it on the job]
  C --> D[Results<br/>did the business outcome change]
  style A fill:#FEF7E0,stroke:#F9AB00
  style B fill:#E6F4EA,stroke:#137333
  style C fill:#FCE8E6,stroke:#C5221F
  style D fill:#F3E8FD,stroke:#8430CE

The arc has a natural ordering. Reaction is fast, cheap, and almost always positive. Learning is the test of whether the programme taught what it claimed to teach. Behaviour is the test of whether the learning crossed into the workplace. Results is the test of whether the workforce change moved the business outcome the programme was meant to influence. A function that climbs the arc deliberately for each major programme is a function that has built a defensible evaluation discipline.

27.3 Evaluation Designs

The choice of evaluation design determines the strength of the claim the dashboard can make. Four designs recur across credible training-evaluation studies, in increasing order of evidential strength.

TipFour Designs for Training Evaluation
Design What it does What it can claim What it cannot claim
Post-only Measures outcomes after the programme Description of the post-state Causation; no comparison
Pre-and-post Measures outcomes before and after Within-subject change Causation; selection and time effects remain
Control-group comparison Compares trained and untrained groups Effect under matched conditions Causation if the groups differ on unobserved variables
Randomised pilot Randomly assigns the training within a population Causal effect within the studied population Generalisation beyond the population
TipThe chart that matches the design

The visual that carries the result has to match the design that produced it. A post-only design renders as a single distribution; a pre-and-post design renders as a paired-difference chart; a control-group design renders as a side-by-side comparison; a randomised pilot renders as a treatment-versus-control plot with the randomisation labelled. As Bradford S. Bell et al. (2017) emphasise, the strongest training-evaluation claims have always come from designs that include a comparison, even an imperfect one, because the comparison is what lets the audience separate what the programme caused from what would have happened anyway.

27.4 Time Horizons and the Transfer Question

Training evaluation has to take time horizons seriously. Reaction can be measured at the close of the session; learning can be measured days or weeks later; behaviour change usually takes months to appear and stabilise; results changes can take a full performance cycle or longer. A function that evaluates a training programme three days after delivery and concludes that it failed has measured the wrong thing at the wrong time.

TipThe Time-Horizon Map for Training Evaluation
Level Typical time horizon Why the horizon matters
Reaction Immediately after the session Captures perception while it is vivid
Learning One to four weeks after Captures retention beyond immediate recall
Behaviour Three to six months after Allows new behaviours to stabilise on the job
Results Six to eighteen months after Allows business outcomes to register the workforce change
TipTransfer of training as the binding constraint

Most training programmes fail at the transfer step — the move from learning in the classroom to behaviour on the job. The evaluation function’s most useful single contribution is to surface the transfer question as a measurable variable, with named factors that influence whether transfer happens: manager support, opportunity to apply, organisational climate for the new behaviour, and time elapsed since the programme. As Donald L. Kirkpatrick & James D. Kirkpatrick (2006) emphasised across the model’s editions, the level-three behaviour measurement is what separates training programmes that produce capability from those that produce only credentials.

27.5 ROI and Beyond

Return on investment is the most-requested fifth level of training evaluation, sometimes added to the four-level model as a programme-level economic measure. The discipline is to compute ROI honestly, with the cost side complete, the benefit side connected to a named business outcome, and the comparison group present. ROI calculated without those properties is marketing rather than analysis.

TipThe Cost and Benefit Sides of Training ROI
Side What it includes Common omission
Cost Programme design, delivery, materials, technology, learner time Learner time and opportunity cost
Benefit Business-outcome change attributable to the programme The comparison-group counterfactual
Time horizon The period over which costs and benefits are accrued Truncating the benefit window before the effect emerges
Confidence The uncertainty in the estimate Reporting a point estimate without confidence
TipBeyond ROI: utility analysis

Utility analysis goes beyond simple ROI by combining the validity of the programme, the criticality of the role, the duration of the effect, and the cost of the alternative course of action. As Bradford S. Bell et al. (2017) review, utility analysis remains under-used despite its strong evidence base, because it requires more inputs than a simple ROI and because the inputs themselves require honest measurement. The dashboard’s value is to make the inputs visible — validity, criticality, duration, and cost — so that the audience reads the utility estimate as the synthesis it is rather than as a single confident number.

27.6 Visualising Training Effectiveness

The training-effectiveness dashboard is the single page on which the function defends its largest investments. Five design choices, applied consistently, hold the four levels, the design strength, and the time horizon together so that the audience reads the page as evidence.

TipFive Design Choices for the Effectiveness Dashboard
Choice What it does on the page
Four-level summary by programme Each programme has a four-level row showing reaction, learning, behaviour, results
Design label Each level chart names the design that produced the result
Time-horizon indicator The page declares when each level was measured relative to delivery
Comparison built in Every level chart carries its target, control, or counterfactual
ROI and utility footer Programme economics are surfaced with cost, benefit, horizon, and confidence
TipThe dashboard as the budget defence

A training-effectiveness dashboard built with the disciplines of this chapter is, in operation, the budget-defence file for the learning function. The page does not aim to defend every programme as a success. It aims to render every programme honestly. Programmes that do not transfer to behaviour or to results are visible as such, and the function uses that visibility to redesign or retire them rather than to defend them. The credibility that follows from honest evaluation is what earns the budget for the programmes that do work.

27.7 Hands-On Exercise: Implementing the Kirkpatrick Four-Level Evaluation

NoteAim, Scenario, Dataset, Deliverable

Aim. Implement the Kirkpatrick four-level evaluation for one training programme: reaction distribution, pre-and-post learning chart, behaviour heat map at ninety days, and results cohort comparison. Render the four levels on a single Power BI page that summarises the programme.

Scenario. You are evaluating a learning programme for an organisation. The chief people officer wants the function to defend the programme using the four-level framework rather than completion rates alone, and the page is the defence document.

Dataset. Learning and Development Metrics (Excel) from the HRMD library. The workbook includes EmployeeID, Training Programme, Reaction Score, Pre-Test, Post-Test, Behaviour Rating at 90 Days, Productivity After Training, and related fields.

Deliverable. A Training-Effectiveness.xlsx workbook with the four levels computed for one programme, plus a Training-Effectiveness.pbix Power BI file with the four-level summary page described below.

27.7.1 Step 1 — Compute level one (reaction)

Filter the workbook to a single training programme. Compute the distribution of Reaction Score and the response rate.

Code
Excel Formula
Reaction Mean       = AVERAGE(Learning[Reaction Score])
Reaction NPS        = (COUNTIF(Learning[Reaction Score], ">=9") - COUNTIF(Learning[Reaction Score], "<=6"))
                    / COUNTA(Learning[Reaction Score]) * 100
Response Rate       = COUNTA(Learning[Reaction Score]) / COUNTA(Learning[EmployeeID]) * 100

Render reaction as a distribution gauge with the response rate disclosed.

27.7.2 Step 2 — Compute level two (learning)

Compute the per-learner gain from Pre-Test to Post-Test, and the cohort-level mean gain.

Code
Excel Formula
Learner Gain     = Learning[Post-Test] - Learning[Pre-Test]
Mean Gain        = AVERAGE(Learner Gain)
Mean Gain (Pct)  = Mean Gain / Mean(Learning[Pre-Test]) * 100

Render the level as a paired-difference chart with confidence interval.

27.7.3 Step 3 — Compute level three (behaviour)

Use the manager-rated Behaviour Rating at 90 Days to compute the percentage of learners whose behaviour was rated as changed.

Code
Excel Formula
Behaviour Change Rate = COUNTIF(Learning[Behaviour Rating at 90 Days], ">=4")
                      / COUNTA(Learning[Behaviour Rating at 90 Days]) * 100

Compute the rate by department and by manager so the heat map at the page level can surface the variation.

27.7.4 Step 4 — Compute level four (results)

Compare the productivity of trained learners with an untrained control group from the same role family. Use the Data Analysis ToolPak’s two-sample t-test.

Code
Excel Formula
Productivity Difference = AVERAGEIF(Trained, "Productivity After Training")
                        - AVERAGEIF(Control, "Productivity After Training")
T-Statistic             = (Mean_Trained - Mean_Control) / SQRT(Var_Trained/N_Trained + Var_Control/N_Control)

Render the comparison as a side-by-side bar chart with confidence intervals.

27.7.5 Step 5 — Document the design and time horizon

On a Design sheet, name the design behind each level (post-only for reaction, pre-and-post for learning, manager-rated comparison for behaviour, control-group comparison for results) and the time horizon at which each level was measured. The Design sheet becomes the audit trail.

27.7.6 Step 6 — Promote to Power BI

Load the workbook into Power BI. Build the four levels as DAX measures. Add a Programmes table so the page can be filtered to one programme at a time.

27.7.7 Step 7 — Build the four-level summary page

Lay out the page using the design choices from Section 5 of this chapter.

  • Each programme has a four-level row showing reaction, learning, behaviour, and results visuals side by side.
  • Each level chart names the design that produced the result (use a small label above each visual).
  • The page declares the time horizon at which each level was measured.
  • Every level chart carries its target, control, prior cohort, or counterfactual.
  • A small ROI-and-utility footer surfaces the programme economics with cost, benefit, horizon, and confidence visible.

27.7.8 Step 8 — Add the multi-programme view

Below the single-programme page, add a portfolio view that shows all major programmes as four-level rows. The portfolio surfaces which programmes have reached which level and which deserve next-cycle redesign.

27.7.9 Step 9 — Publish

Publish the report and add it to the annual learning-budget review. Confirm that programmes without level-three or level-four evidence are flagged for re-design rather than for renewed investment.

TipConnect to the Visualisation Layer

The training-effectiveness page sits downstream of the training-requirements dashboard from Chapter 26. The capability gaps surfaced in Chapter 26 are the gaps this chapter’s level-three and level-four evidence measures the firm’s progress against. The two pages together form the learning-function block of Module 3.

TipFiles and Screen Recordings

Training-Effectiveness.xlsx, Training-Effectiveness.pbix, and ch27-effectiveness-walkthrough.mp4 will be attached at this point in the published edition. The screen recording walks through Steps 1 to 9 with the Excel four-level workbench and the Power BI evaluation page shown side by side.

Summary

Concept Description
Why Evaluation Matters
Evaluation as budget defence Programmes that cannot be evaluated honestly will be evaluated dishonestly
Four-level model Reaction, learning, behaviour, and results, mapped to efficiency-effectiveness-impact
Methodological maturity of the field Specific claims about what works, in what conditions, at what horizons
Comparison built in Every level chart carries its target, control, or counterfactual
Honest rendering over advocacy The page renders programmes honestly rather than defending all of them
The Four Levels
Reaction level Did the learners enjoy and value the experience
Learning level Did the learners acquire the knowledge or skill
Behaviour level Did the learners apply the new behaviour on the job
Results level Did the business outcome the programme targeted change
The four-level arc The four levels are climbed in order, with each level testing the next
Evaluation Designs
Post-only design Outcomes measured after the programme; description only
Pre-and-post design Outcomes measured before and after; within-subject change
Control-group comparison Trained and untrained groups compared under matched conditions
Randomised pilot Random assignment within a population for causal claim
Chart matches the design The chart that carries the result matches the design that produced it
Time Horizons and Transfer
Time horizon for reaction Reaction is measured immediately after the session
Time horizon for learning Learning is measured one to four weeks after the session
Time horizon for behaviour Behaviour is measured three to six months after the programme
Time horizon for results Results are measured six to eighteen months after delivery
Transfer of training The move from learning in the classroom to behaviour on the job
Manager support for transfer Manager support is among the strongest predictors of transfer
Climate for transfer Organisational climate for the new behaviour shapes whether it is applied
ROI and Utility
Cost side of ROI Programme design, delivery, materials, technology, and learner time
Benefit side of ROI Business-outcome change attributable to the programme with a comparison
Time horizon for ROI The period over which costs and benefits are accrued, not truncated
Confidence in ROI Uncertainty in the estimate is rendered, not hidden behind a point estimate
Utility analysis Combines validity, criticality, duration, and cost beyond simple ROI
Visualising Effectiveness
Four-level summary by programme Each programme has a four-level row showing all four Kirkpatrick levels
Design label on every chart Every level chart names the design that produced the result
Time-horizon indicator on the page The page declares when each level was measured relative to delivery
Comparison built in to every level chart Every chart shows target, control, prior cohort, or counterfactual
ROI and utility footer Programme economics surfaced with cost, benefit, horizon, and confidence
Honest evaluation as credibility Honest evaluation earns the function the budget for the programmes that work