Statistics 1 begins with organising, displaying, and summarising data. This topic covers the core graphical and numerical tools you need: stem-and-leaf diagrams, box-and-whisker plots, histograms with unequal class widths, and the key measures of centre and spread. You will also learn how to compare two distributions meaningfully — a skill examined in almost every S1 paper.
Mean = Σfx / Σf | σ² = Σfx²/Σf − x̄² | Freq. Density = Frequency / Class Width
Learning Objectives
Construct and read back-to-back stem-and-leaf diagrams
Draw and interpret box-and-whisker plots including outliers
Draw histograms correctly using frequency density on the y-axis
Calculate mean, median, and mode from frequency tables and grouped data
Use the median interpolation formula for grouped data
Apply coded data transformations for mean and variance
Calculate range, IQR, variance, and standard deviation
Identify positive and negative skew from diagrams and averages
Compare two distributions using correct statistical language
Stem-and-Leaf
Back-to-back diagrams for two datasets side by side
Box Plots
Min, Q1, median, Q3, max and outlier detection
Histograms
Frequency density — essential for unequal class widths
Mean & Median
From tables, grouped data, and coded values
Variance & SD
σ² = Σfx²/Σf − x̄² and coded variance rules
Skewness
Positive, negative, symmetric — from plots and averages
A stem-and-leaf diagram displays raw data while preserving individual values. Each data value is split into a stem (leading digits) and a leaf (final digit). Leaves are written in ascending order away from the stem.
Key: 4|2 means 42 (Group B) 5|4 means 45 (Group A)
Always write a key. The examiner will not award marks for a diagram without a key. For back-to-back, make clear which side is which.
Reading From a Stem-and-Leaf
From the diagram above (Group B only):
Values: 42, 46, 50, 53, 55, 58, 61, 64, 67, 69, 72, 73
n = 12
Median = average of 6th and 7th values = (58 + 61)/2 = 59.5
Q1 = median of lower half (first 6) = (53 + 55)/2 = 54
Q3 = median of upper half (last 6) = (67 + 69)/2 = 68
IQR = 68 − 54 = 14
Box-and-Whisker Plots
A box plot displays five key summary statistics on a number line:
Min Q1 Median (Q2) Q3 Max
The box spans from Q1 to Q3 (the interquartile range).
A vertical line inside the box marks the median. Whiskers extend from the box to the minimum and maximum values (or to outlier fences if outliers exist). Outliers are plotted individually as crosses or dots beyond the whiskers.
Interquartile Range (IQR)
IQR = Q3 − Q1
The IQR measures the spread of the middle 50% of the data. It is resistant to outliers — unlike the range.
Detecting Outliers
A value is an outlier if it lies beyond the outlier fences:
Example: Q1 = 54, Q3 = 68, IQR = 14
Lower fence = 54 − 1.5 × 14 = 54 − 21 = 33
Upper fence = 68 + 1.5 × 14 = 68 + 21 = 89
Any value below 33 or above 89 is an outlier.
When a box plot has outliers, the whisker only extends to the last non-outlier value. Outliers are then marked separately (usually with a cross ×).
Quartile Calculation Method (Cambridge)
For n data values in ascending order:
• Q1: lower quartile — median of the lower half (exclude median if n is odd)
• Q2: median — middle value (n+1)/2 position
• Q3: upper quartile — median of the upper half
Cambridge S1 uses the (n+1)/4, (n+1)/2, 3(n+1)/4 position method for larger datasets.
Learn 2 — Histograms
Why Frequency Density?
When class widths are unequal, you cannot use raw frequency on the y-axis — taller bars would appear to have more data simply because they are wider. Instead, the y-axis shows frequency density, and the area of each bar represents the frequency.
Frequency Density = Frequency ÷ Class Width Frequency = Frequency Density × Class Width
Drawing a Histogram — Step by Step
Step 1: Calculate the class width for each group. Step 2: Calculate frequency density = frequency / class width. Step 3: Draw bars with heights equal to frequency density. Step 4: Bars must be adjacent (no gaps) — this is NOT a bar chart! Step 5: Label the y-axis as "Frequency Density" — never "Frequency".
Notice the 10–15 class has the tallest bar (fd = 2.4) even though the 0–10 class has more area because it is wider.
Finding Frequency From a Histogram
If a bar has height (freq. density) = 3.5 and class width = 4:
Frequency = 3.5 × 4 = 14
Total frequency = sum of all (freq. density × class width) = sum of all areas.
Finding a Missing Frequency
Problem: Total n = 50. Three bars give frequencies 8, 12, 10. Find the missing frequency for the last bar (class width 5, fd = 3.0).
Missing frequency = 3.0 × 5 = 15. Check: 8+12+10+15 = 45. Still 5 missing — adjust as needed from the question context.
Common trap: Some exam questions give a histogram and ask you to estimate a percentage or proportion. Always find the frequencies from areas first, then calculate proportions.
Reading the Median and Quartiles From a Histogram
To find the median from a histogram:
1. Find total frequency n.
2. Count up cumulative frequency until you reach n/2.
3. Use interpolation within the class that contains the median.
If the question says "equal class widths", frequency and frequency density are proportional — you can use frequency directly on the y-axis. Only use frequency density when class widths differ.
For grouped data, the median is estimated using linear interpolation:
Median = L + [ (n/2 − F) / f ] × w
Where:
L = lower class boundary of the median class
n = total frequency
F = cumulative frequency before the median class
f = frequency of the median class
w = class width
Example (using table above):
n = 20, so find the 10th value. Cumulative: 4, 11, 16, 20.
Median class: 20–30 (cumulative goes past 10 at this class).
L = 20, F = 4, f = 7, w = 10
Median = 20 + [(10 − 4)/7] × 10 = 20 + 60/7 ≈ 28.57
Mode
The mode is the value (or class) that occurs most often. For grouped data, it is the modal class (highest frequency or highest frequency density for unequal widths). There is no single modal value for grouped data.
Coded Data — Transforming the Mean
Coding simplifies calculations. If y = x − a (shift only):
ȳ = x̄ − a so x̄ = ȳ + a
If y = (x − a) / b (shift and scale):
ȳ = (x̄ − a) / b so x̄ = bȳ + a
Example: Data coded as y = (x − 25)/5. From the coded data, ȳ = 1.4.
x̄ = 5 × 1.4 + 25 = 7 + 25 = 32
Coding never changes the shape of the distribution — only the scale. Adding a constant shifts the mean but not the spread. Multiplying by a constant scales both the mean and the spread.
Learn 4 — Measures of Spread
Range and Interquartile Range
Range = Max − Min IQR = Q3 − Q1
The range uses only two values and is badly affected by outliers. The IQR describes the middle 50% and is much more robust.
Variance and Standard Deviation
σ² = Σfx² / Σf − (Σfx / Σf)² = Σfx²/n − x̄²
σ = √σ²
Method — direct from frequency table:
1. Compute x̄ = Σfx/Σf
2. Compute Σfx² = Σ(f × x²)
3. σ² = Σfx²/Σf − x̄²
4. σ = √σ²
This is algebraically equivalent — use whichever form your working gives more cleanly. Both are on the formula sheet as Sxx/n where Sxx = Σfx² − (Σfx)²/n.
Note: adding a constant a does NOT change the variance or SD — only multiplying by b does.
Calculator Method vs Formula Method
Formula method: Build the Σfx and Σfx² columns, then apply σ² = Σfx²/n − x̄².
Calculator method: Enter data into statistics mode (1-var stats). The calculator gives x̄ and σ directly. Always verify your table gives the same Σfx before trusting the calculator.
Variance has squared units (e.g., cm²), while standard deviation has the same units as the data (e.g., cm). Always state units when interpreting σ in context.
Learn 5 — Skewness & Comparing Distributions
What Is Skewness?
Skewness describes the asymmetry of a distribution — which tail is longer. Comparing mean, median, and mode reveals skew without a diagram.
Positive Skew
Positive skew: Mode < Median < Mean
The right (upper) tail is longer. A few very high values pull the mean upward above the median. The mode is the lowest of the three averages. On a box plot: The right whisker is longer than the left whisker, and the median line is closer to Q1 than to Q3.
Negative Skew
Negative skew: Mean < Median < Mode
The left (lower) tail is longer. A few very low values pull the mean downward below the median. The mode is the highest of the three averages. On a box plot: The left whisker is longer, and the median line is closer to Q3.
Symmetry
Symmetric: Mean = Median = Mode
A perfectly symmetric distribution has equal tails and all three averages coincide. In practice, approximate equality suggests near-symmetry.
Comparing Two Distributions
In an exam, you must make a comparison of central tendency AND a comparison of spread, each stated in context. A single sentence comparing only one aspect will lose marks.
Template comparison statements:
"Group A has a higher median (42) than Group B (35), suggesting Group A generally scored higher."
"Group A has a larger IQR (18 vs 11), indicating more variability in Group A's scores."
"Both distributions are positively skewed, but Group B is more symmetric."
Exam warning: Never just quote numbers. Always interpret in context. "Group A's mean is 42" earns no marks — you must say what this means about the data.
Identifying Skew From a Box Plot
1. Compare the position of the median line within the box:
• Closer to Q1 → positive skew
• Closer to Q3 → negative skew
• Central → symmetric
2. Compare whisker lengths:
• Longer right whisker → positive skew
• Longer left whisker → negative skew
In a comparison question, use comparative language: "greater", "higher", "more spread out", "less variable". Avoid vague terms like "better" unless the context makes clear what "better" means.
Worked Examples
Example 1 — Reading a Back-to-Back Stem-and-Leaf
The stem-and-leaf below shows waiting times (minutes) for two bus routes A and B.
Route A |Stem| Route B
9 7 4 | 0 | 3 5 8
8 6 5 2 | 1 | 1 4 6 9
7 3 1 | 2 | 0 2 5
6 | 3 | 1 4
Key: 1|4 means 14 min (Route B) 4|1 means 14 min (Route A)
Find the median and IQR for Route B.
Step 1 — List Route B in order: 3, 5, 8, 11, 14, 16, 19, 20, 22, 25, 31, 34. n = 12.
Step 2 — Median: Average of 6th and 7th values = (16 + 19)/2 = 17.5 minutes. B1
Route A: median = 16 min, IQR = 14 min, positively skewed. Route B: median = 17.5 min, IQR = 14 min, slightly positive skew. Write a comparison.
Central tendency: "Route B has a slightly higher median waiting time (17.5 min vs 16 min), so on average Route B passengers wait slightly longer." B1
Spread: "Both routes have the same IQR (14 min), indicating equal consistency in the middle 50% of waiting times." B1
Skewness: "Both distributions are positively skewed, meaning there are occasional very long waits on each route." B1
Common Mistakes
Mistake 1 — Using Frequency Instead of Frequency Density on a Histogram
WRONG: Drawing a bar at height = frequency (e.g., height = 16 for class 10–20 with width 10).
CORRECT: Height = frequency density = 16/10 = 1.6. Area of bar = 1.6 × 10 = 16 = frequency. ✓
Mistake 2 — Wrong Outlier Formula
WRONG: Outlier if value > Q3 + 1.5 × Q3 (using Q3 instead of IQR).
CORRECT: Outlier if value < Q1 − 1.5 × IQR or value > Q3 + 1.5 × IQR. The 1.5 multiplies the IQR, NOT Q1 or Q3.
Mistake 3 — Using Class Boundaries Instead of Midpoints for the Mean
WRONG: For class 20–30, using x = 20 or x = 30 in Σfx.
CORRECT: Use the midpoint x = (20+30)/2 = 25. The midpoint is the best estimate for all values in the class.
Mistake 4 — Variance Formula Sign Error
WRONG: σ² = Σfx²/n + x̄² (adding instead of subtracting).
CORRECT: σ² = Σfx²/n − x̄². The mean squared is subtracted. This always gives σ² ≥ 0.
Mistake 5 — Forgetting the Class Width in Median Interpolation
WRONG: Median = L + (n/2 − F)/f (missing the × w at the end).
CORRECT: Median = L + [(n/2 − F)/f] × w. The formula gives a fraction of the class width added to the lower boundary.
Mistake 6 — Coded Variance: Forgetting to Square b
WRONG: σx² = b × σy² (using b, not b²).
CORRECT: σx² = b² × σy². When y = (x−a)/b, the variance is divided by b², so σx² = b² σy². But σx = b × σy (square root cancels the square).
Mistake 7 — Skewness: Confusing Positive and Negative
WRONG: "A longer left tail means positive skew" — mixing up which direction means which.
CORRECT: Positive skew = longer RIGHT tail (mean pulled UP by high values, so mean > median > mode). Negative skew = longer LEFT tail (mean < median < mode).
Mistake 8 — Comparison Without Context
WRONG: "Group A has mean 42, Group B has mean 35." (No comparison, no context.)
CORRECT: "Group A has a higher mean (42 vs 35), suggesting Group A students generally performed better on the test." State the comparison AND interpret it.
Key Formulas
Formula
Description
x̄ = Σfx / Σf
Mean from frequency table (use midpoints for grouped data)
Median = L + [(n/2 − F)/f] × w
Median by interpolation in grouped data; L=lower boundary, F=cumulative freq before, f=freq in class, w=width
IQR = Q3 − Q1
Interquartile range — spread of middle 50%
Outlier: x < Q1 − 1.5×IQR or x > Q3 + 1.5×IQR
Cambridge outlier criterion
Freq. Density = Frequency / Class Width
y-axis value for histograms with unequal class widths
Why n−1? When you compute x̄ from the sample, you use one "degree of freedom" estimating the mean. The deviations (x−x̄) are constrained to sum to zero, so only n−1 of them are free to vary. Dividing by n−1 corrects the bias — it makes s² an unbiased estimator of the true population variance σ².
In Cambridge S1, you are always given data for a specific dataset (not sampling from an infinite population), so you divide by n. The formula σ² = Σfx²/n − x̄² is correct for all S1 questions.
Proof 3 — Coded Variance: Why σx = b·σy when y = (x−a)/b
Adding the constant a has no effect on variance — it merely shifts all values by the same amount, so all deviations from the mean are unchanged.
Proof 4 — Median Interpolation Formula
Suppose the median is in a class [L, L+w) with frequency f, and F is the cumulative frequency before this class.
We need the (n/2)th value. Within the class, we need the (n/2 − F)th value out of f values.
Assuming values are uniformly spread within the class, the position fraction is (n/2 − F)/f.
Multiplying this fraction by the class width w gives the distance into the class:
Median = L + [(n/2 − F)/f] × w ∎
This is a linear interpolation — it treats the data as uniformly distributed within each class. It is always an estimate for grouped data.
Histogram Visualiser
Enter up to 6 classes. For each class enter: lower bound, upper bound, frequency. Then click Draw. The visualiser computes frequency density and draws the histogram, then shows the estimated mean and SD.
Class
Lower
Upper
Frequency
1
2
3
4
5
6
Enter data and click Draw to see the histogram and statistics.
Exercise 3 — Mean & Median from Frequency Tables (10 Questions)
Exercise 4 — Variance & Standard Deviation (10 Questions)
Exercise 5 — Coded Data & Comparison (10 Questions)
Practice — 30 Mixed Questions
Challenge — 15 Harder Questions
Exam Style Questions (8 Questions)
Cambridge S1 style questions. Attempt each before revealing the mark scheme.
Question 1 [6 marks]
The heights (cm) of 20 students are summarised: Q1 = 162, median = 168, Q3 = 174. The minimum is 150 and maximum is 195.
(i) Show that 195 is an outlier. (ii) Draw a box-and-whisker plot. (iii) State the type of skewness and give a reason.
(i) IQR = 174 − 162 = 12. Upper fence = 174 + 1.5×12 = 174 + 18 = 192. Since 195 > 192, it is an outlier. [M1 A1]
(ii) Box from 162 to 174, median at 168. Left whisker to 150. Right whisker to next non-outlier value (assume 188, given in full question). Outlier marked at 195 with ×. [B2]
(iii) Positive skew — the median (168) is closer to Q1 (162) than to Q3 (174), and/or the right whisker/tail is longer. [B1 reason B1]
Question 2 [5 marks]
A frequency table for weights (kg): 50–60 (f=4), 60–70 (f=9), 70–80 (f=11), 80–90 (f=6). Find the mean and an estimate for the standard deviation.
Estimate the median from: Age 0–10 (f=5), 10–20 (f=14), 20–30 (f=21), 30–50 (f=10). Show your working clearly.
Σf = 50. Median at n/2 = 25th value. [B1]
Cumulative: 5, 19, 40. Median falls in class 20–30. [M1]
L = 20, F = 19, f = 21, w = 10. [B1]
Median = 20 + [(25−19)/21] × 10 = 20 + 60/21 = 20 + 2.857 = 22.86. [M1 A1]
Question 6 [4 marks]
For dataset A: mean = 45.2, SD = 8.3. For dataset B: mean = 45.2, SD = 14.7. Both are approximately symmetric. Compare the two distributions fully.
Central tendency: "Both datasets have the same mean (45.2), so on average their values are equal." [B1]
Spread: "Dataset B has a much larger standard deviation (14.7 vs 8.3), indicating B is considerably more spread out / variable." [B1]
Skewness: "Both are approximately symmetric." [B1]
Context (if given): Apply each point to the real-world meaning. [B1]
Question 7 [5 marks]
A back-to-back stem-and-leaf diagram shows test scores for classes X and Y. Class X leaves for stem 5: 8 6 3 1. Class Y leaves: 2 4 7 9. Stem 6: X leaves 7 5 2, Y leaves 0 3 5 8 8. Identify which class has the higher median and which has more spread (by IQR), giving values.
Class X values (from diagram): 51, 53, 56, 58, 62, 65, 67. n=7. Median = 58 (4th value). Q1=53, Q3=65. IQR=12. [M1 A1]
Class Y values: 52, 54, 57, 59, 60, 63, 65, 68, 68. n=9. Median = 60 (5th value). Q1=55.5, Q3=66.5. IQR=11. [M1 A1]
Class Y has higher median (60 vs 58). Class X has slightly larger IQR (12 vs 11). [A1]
Question 8 [6 marks]
The following values are given for a dataset: Σx = 360, Σx² = 14800, n = 12. (i) Calculate mean and variance. (ii) A value of 45 is found to be an error and replaced with 51. Find the new mean and new variance.
The times (seconds) taken by 40 athletes to complete a race are recorded. The coded variable y = (x − 60)/2 gives Σy = −24 and Σy² = 116. (i) Find the mean time x̄. (ii) Find the variance of the times. (iii) State the effect on x̄ and σ if all times increase by 3 seconds.
(iii) Mean increases by 3: new x̄ = 61.8 s. σ is unchanged (adding a constant does not affect spread). [B1 B1]
Past Paper 3 — June 2018 Style [5 marks]
A box-and-whisker plot for plant heights (cm) shows: min=12, Q1=18, Q2=25, Q3=34, max=52. A value is an outlier if it is more than 1.5×IQR beyond Q1 or Q3. (i) Identify any outliers. (ii) Describe the skewness of the distribution, giving two reasons.
(ii) Positive skew. Reason 1: The median (25) is closer to Q1 (18) than to Q3 (34) — the distance from Q1 to Q2 is 7, while Q2 to Q3 is 9. Reason 2: The right whisker (52−34=18) is longer than the left whisker (18−12=6). [B1 B1 B1]
Past Paper 4 — November 2017 Style [6 marks]
Two groups sat a test. Group P: n=15, Σx=675, Σx²=31500. Group Q: n=10, Σx=460, Σx²=22000. (i) Find the mean and variance for each group. (ii) Make two comparisons in context (the test was out of 60).
(ii) Central tendency: "Group Q has a slightly higher mean mark (46 vs 45), suggesting Group Q performed marginally better on average." [B1]
Spread: "Group Q has a larger variance (84 vs 75), indicating Group Q's marks were slightly more spread out / inconsistent." [B1]
Past Paper 5 — June 2016 Style [7 marks]
A dataset of 50 values has Q1=23.5, Q3=41.5. The stem-and-leaf for the lower values shows a leaf of 8 on stem 0 (representing 8) and a leaf of 2 on stem 6 (representing 62). (i) Find the IQR. (ii) Determine whether 8 and 62 are outliers. (iii) The median is 31 and mean is 33.2. State the skewness and give a reason.
(i) IQR = 41.5 − 23.5 = 18. [B1]
(ii) Lower fence = 23.5 − 1.5×18 = 23.5 − 27 = −3.5. Since 8 > −3.5, 8 is NOT an outlier. [M1 A1]
Upper fence = 41.5 + 1.5×18 = 41.5 + 27 = 68.5. Since 62 < 68.5, 62 is NOT an outlier. [M1 A1]
(iii) Positive skew — the mean (33.2) is greater than the median (31), which occurs when the distribution has a longer upper tail pulling the mean upward. [B1 B1]