Representation of Data | FractionRush A-Level

Welcome to Representation of Data!

Statistics 1 begins with organising, displaying, and summarising data. This topic covers the core graphical and numerical tools you need: stem-and-leaf diagrams, box-and-whisker plots, histograms with unequal class widths, and the key measures of centre and spread. You will also learn how to compare two distributions meaningfully — a skill examined in almost every S1 paper.

Mean = Σfx / Σf | σ² = Σfx²/Σf − x̄² | Freq. Density = Frequency / Class Width

Learning Objectives

Construct and read back-to-back stem-and-leaf diagrams
Draw and interpret box-and-whisker plots including outliers
Draw histograms correctly using frequency density on the y-axis
Calculate mean, median, and mode from frequency tables and grouped data
Use the median interpolation formula for grouped data
Apply coded data transformations for mean and variance
Calculate range, IQR, variance, and standard deviation
Identify positive and negative skew from diagrams and averages
Compare two distributions using correct statistical language

Stem-and-Leaf

Back-to-back diagrams for two datasets side by side

Box Plots

Min, Q1, median, Q3, max and outlier detection

Histograms

Frequency density — essential for unequal class widths

Mean & Median

From tables, grouped data, and coded values

Variance & SD

σ² = Σfx²/Σf − x̄² and coded variance rules

Skewness

Positive, negative, symmetric — from plots and averages

Comparison

Comment on central tendency AND spread in context

Coded Data

y = (x−a)/b transforms mean and SD predictably

Learn 1 — Stem-and-Leaf Diagrams & Box-and-Whisker Plots

Stem-and-Leaf Diagrams

A stem-and-leaf diagram displays raw data while preserving individual values. Each data value is split into a stem (leading digits) and a leaf (final digit). Leaves are written in ascending order away from the stem.

Example — Exam scores (2-digit values):
Data: 42, 47, 53, 55, 58, 61, 63, 63, 67, 71, 74

4 | 2 7
5 | 3 5 8
6 | 1 3 3 7
7 | 1 4

Key: 4 | 2 means 42

Back-to-Back Stem-and-Leaf Diagrams

When comparing two datasets, put them on either side of a shared stem. Leaves for the left group are written right-to-left (smallest closest to stem).

Group A |Stem| Group B
9 8 5 | 4 | 2 6
7 4 3 1 | 5 | 0 3 5 8
8 6 2 | 6 | 1 4 7 9
5 1 | 7 | 2 3

Key: 4|2 means 42 (Group B) 5|4 means 45 (Group A)

Always write a key. The examiner will not award marks for a diagram without a key. For back-to-back, make clear which side is which.

Reading From a Stem-and-Leaf

From the diagram above (Group B only):
Values: 42, 46, 50, 53, 55, 58, 61, 64, 67, 69, 72, 73
n = 12
Median = average of 6th and 7th values = (58 + 61)/2 = 59.5
Q1 = median of lower half (first 6) = (53 + 55)/2 = 54
Q3 = median of upper half (last 6) = (67 + 69)/2 = 68
IQR = 68 − 54 = 14

Box-and-Whisker Plots

A box plot displays five key summary statistics on a number line:

Min Q1 Median (Q2) Q3 Max

The box spans from Q1 to Q3 (the interquartile range).
A vertical line inside the box marks the median.
Whiskers extend from the box to the minimum and maximum values (or to outlier fences if outliers exist).
Outliers are plotted individually as crosses or dots beyond the whiskers.

Interquartile Range (IQR)

IQR = Q3 − Q1

The IQR measures the spread of the middle 50% of the data. It is resistant to outliers — unlike the range.

Detecting Outliers

A value is an outlier if it lies beyond the outlier fences:

Lower fence: Q1 − 1.5 × IQR Upper fence: Q3 + 1.5 × IQR

Example: Q1 = 54, Q3 = 68, IQR = 14
Lower fence = 54 − 1.5 × 14 = 54 − 21 = 33
Upper fence = 68 + 1.5 × 14 = 68 + 21 = 89
Any value below 33 or above 89 is an outlier.

When a box plot has outliers, the whisker only extends to the last non-outlier value. Outliers are then marked separately (usually with a cross ×).

Quartile Calculation Method (Cambridge)

For n data values in ascending order:
• Q1: lower quartile — median of the lower half (exclude median if n is odd)
• Q2: median — middle value (n+1)/2 position
• Q3: upper quartile — median of the upper half

Cambridge S1 uses the (n+1)/4, (n+1)/2, 3(n+1)/4 position method for larger datasets.

Learn 2 — Histograms

Why Frequency Density?

When class widths are unequal, you cannot use raw frequency on the y-axis — taller bars would appear to have more data simply because they are wider. Instead, the y-axis shows frequency density, and the area of each bar represents the frequency.

Frequency Density = Frequency ÷ Class Width Frequency = Frequency Density × Class Width

Drawing a Histogram — Step by Step

Step 1: Calculate the class width for each group.
Step 2: Calculate frequency density = frequency / class width.
Step 3: Draw bars with heights equal to frequency density.
Step 4: Bars must be adjacent (no gaps) — this is NOT a bar chart!
Step 5: Label the y-axis as "Frequency Density" — never "Frequency".

Worked Example

| Class | Freq | Class Width | Freq. Density |
| 0–10 | 8 | 10 | 8/10 = 0.8 |
| 10–15 | 12 | 5 | 12/5 = 2.4 |
| 15–20 | 10 | 5 | 10/5 = 2.0 |
| 20–30 | 6 | 10 | 6/10 = 0.6 |
| 30–50 | 4 | 20 | 4/20 = 0.2 |

Notice the 10–15 class has the tallest bar (fd = 2.4) even though the 0–10 class has more area because it is wider.

Finding Frequency From a Histogram

If a bar has height (freq. density) = 3.5 and class width = 4:
Frequency = 3.5 × 4 = 14

Total frequency = sum of all (freq. density × class width) = sum of all areas.

Finding a Missing Frequency

Problem: Total n = 50. Three bars give frequencies 8, 12, 10. Find the missing frequency for the last bar (class width 5, fd = 3.0).
Missing frequency = 3.0 × 5 = 15. Check: 8+12+10+15 = 45. Still 5 missing — adjust as needed from the question context.

Common trap: Some exam questions give a histogram and ask you to estimate a percentage or proportion. Always find the frequencies from areas first, then calculate proportions.

Reading the Median and Quartiles From a Histogram

To find the median from a histogram:
1. Find total frequency n.
2. Count up cumulative frequency until you reach n/2.
3. Use interpolation within the class that contains the median.

If the question says "equal class widths", frequency and frequency density are proportional — you can use frequency directly on the y-axis. Only use frequency density when class widths differ.

Learn 3 — Measures of Central Tendency

Mean

x̄ = Σx / n (ungrouped) x̄ = Σfx / Σf (frequency table)

For grouped data, use the midpoint of each class as the x value. This gives an estimate since we do not know the exact values within each class.

Example — Grouped frequency table:
| Class | Mid (x) | Freq (f) | fx |
| 10–20 | 15 | 4 | 60 |
| 20–30 | 25 | 7 | 175|
| 30–40 | 35 | 5 | 175|
| 40–50 | 45 | 4 | 180|
Σf = 20, Σfx = 590
Mean = 590/20 = 29.5

Median — Interpolation Formula

For grouped data, the median is estimated using linear interpolation:

Median = L + [ (n/2 − F) / f ] × w

Where:
L = lower class boundary of the median class
n = total frequency
F = cumulative frequency before the median class
f = frequency of the median class
w = class width

Example (using table above):
n = 20, so find the 10th value. Cumulative: 4, 11, 16, 20.
Median class: 20–30 (cumulative goes past 10 at this class).
L = 20, F = 4, f = 7, w = 10
Median = 20 + [(10 − 4)/7] × 10 = 20 + 60/7 ≈ 28.57

Mode

The mode is the value (or class) that occurs most often. For grouped data, it is the modal class (highest frequency or highest frequency density for unequal widths). There is no single modal value for grouped data.

Coded Data — Transforming the Mean

Coding simplifies calculations. If y = x − a (shift only):

ȳ = x̄ − a so x̄ = ȳ + a

If y = (x − a) / b (shift and scale):

ȳ = (x̄ − a) / b so x̄ = bȳ + a

Example: Data coded as y = (x − 25)/5. From the coded data, ȳ = 1.4.
x̄ = 5 × 1.4 + 25 = 7 + 25 = 32

Coding never changes the shape of the distribution — only the scale. Adding a constant shifts the mean but not the spread. Multiplying by a constant scales both the mean and the spread.

Learn 4 — Measures of Spread

Range and Interquartile Range

Range = Max − Min IQR = Q3 − Q1

The range uses only two values and is badly affected by outliers. The IQR describes the middle 50% and is much more robust.

Variance and Standard Deviation

σ² = Σfx² / Σf − (Σfx / Σf)² = Σfx²/n − x̄²

σ = √σ²

Method — direct from frequency table:
1. Compute x̄ = Σfx/Σf
2. Compute Σfx² = Σ(f × x²)
3. σ² = Σfx²/Σf − x̄²
4. σ = √σ²

Example (using earlier table: Σf = 20, Σfx = 590):
Σfx²: 4×225 + 7×625 + 5×1225 + 4×2025
= 900 + 4375 + 6125 + 8100 = 19500
x̄ = 29.5
σ² = 19500/20 − 29.5² = 975 − 870.25 = 104.75
σ = √104.75 ≈ 10.23

Alternative Variance Formula (Sn formula)

σ² = [Σfx² − (Σfx)²/n] / n

This is algebraically equivalent — use whichever form your working gives more cleanly. Both are on the formula sheet as Sxx/n where Sxx = Σfx² − (Σfx)²/n.

Coded Variance

If y = (x − a) / b, then:

σy² = σx² / b² so σx = b × σy

Example: Data coded y = (x − 30)/4. From coded data, σy = 2.5.
σx = 4 × 2.5 = 10
σx² = 4² × 2.5² = 16 × 6.25 = 100

Note: adding a constant a does NOT change the variance or SD — only multiplying by b does.

Calculator Method vs Formula Method

Formula method: Build the Σfx and Σfx² columns, then apply σ² = Σfx²/n − x̄².

Calculator method: Enter data into statistics mode (1-var stats). The calculator gives x̄ and σ directly. Always verify your table gives the same Σfx before trusting the calculator.

Variance has squared units (e.g., cm²), while standard deviation has the same units as the data (e.g., cm). Always state units when interpreting σ in context.

Learn 5 — Skewness & Comparing Distributions

What Is Skewness?

Skewness describes the asymmetry of a distribution — which tail is longer. Comparing mean, median, and mode reveals skew without a diagram.

Positive Skew

Positive skew: Mode < Median < Mean

The right (upper) tail is longer. A few very high values pull the mean upward above the median. The mode is the lowest of the three averages.
On a box plot: The right whisker is longer than the left whisker, and the median line is closer to Q1 than to Q3.

Negative Skew

Negative skew: Mean < Median < Mode

The left (lower) tail is longer. A few very low values pull the mean downward below the median. The mode is the highest of the three averages.
On a box plot: The left whisker is longer, and the median line is closer to Q3.

Symmetry

Symmetric: Mean = Median = Mode

A perfectly symmetric distribution has equal tails and all three averages coincide. In practice, approximate equality suggests near-symmetry.

Comparing Two Distributions

In an exam, you must make a comparison of central tendency AND a comparison of spread, each stated in context. A single sentence comparing only one aspect will lose marks.

Template comparison statements:

"Group A has a higher median (42) than Group B (35), suggesting Group A generally scored higher."

"Group A has a larger IQR (18 vs 11), indicating more variability in Group A's scores."

"Both distributions are positively skewed, but Group B is more symmetric."

Exam warning: Never just quote numbers. Always interpret in context. "Group A's mean is 42" earns no marks — you must say what this means about the data.

Identifying Skew From a Box Plot

1. Compare the position of the median line within the box:
  • Closer to Q1 → positive skew
  • Closer to Q3 → negative skew
  • Central → symmetric

2. Compare whisker lengths:
  • Longer right whisker → positive skew
  • Longer left whisker → negative skew

In a comparison question, use comparative language: "greater", "higher", "more spread out", "less variable". Avoid vague terms like "better" unless the context makes clear what "better" means.

Worked Examples

Example 1 — Reading a Back-to-Back Stem-and-Leaf

The stem-and-leaf below shows waiting times (minutes) for two bus routes A and B.

Route A |Stem| Route B
9 7 4 | 0 | 3 5 8
8 6 5 2 | 1 | 1 4 6 9
7 3 1 | 2 | 0 2 5
6 | 3 | 1 4
Key: 1|4 means 14 min (Route B) 4|1 means 14 min (Route A)

Find the median and IQR for Route B.

Step 1 — List Route B in order: 3, 5, 8, 11, 14, 16, 19, 20, 22, 25, 31, 34. n = 12.

Step 2 — Median: Average of 6th and 7th values = (16 + 19)/2 = 17.5 minutes. B1

Step 3 — Q1: Lower half: 3, 5, 8, 11, 14, 16. Median = (8+11)/2 = 9.5. M1

Step 4 — Q3: Upper half: 19, 20, 22, 25, 31, 34. Median = (22+25)/2 = 23.5. M1

Step 5 — IQR: 23.5 − 9.5 = 14 minutes. A1

Example 2 — Constructing a Box Plot & Finding Outliers

For Route A from Example 1: Values are 4, 7, 9, 12, 15, 16, 18, 21, 23, 27, 36. Find any outliers and draw a box plot.

Step 1 — Five-number summary: Min = 4, Q1 = 11 (avg of 5th&6th positions of lower half: wait — n=11, median position = 6th = 16). Re-order: 4,7,9,12,15,16,18,21,23,27,36.
Median = 16 (6th value). Lower half: 4,7,9,12,15. Q1 = 9. Upper half: 18,21,23,27,36. Q3 = 23. Max = 36. B2

Step 2 — IQR: 23 − 9 = 14. M1

Step 3 — Outlier fences:
Lower: 9 − 1.5×14 = 9 − 21 = −12 (no values below this)
Upper: 23 + 1.5×14 = 23 + 21 = 44. No values above 44, so no outliers. M1 A1

Step 4 — Box plot: Draw a box from 9 to 23, median line at 16. Left whisker to 4. Right whisker to 36. A1

Example 3 — Histogram with Unequal Class Widths

Draw a histogram: Class 0–5 (f=10), 5–10 (f=15), 10–20 (f=16), 20–40 (f=12), 40–80 (f=8).

Step 1 — Class widths: 5, 5, 10, 20, 40. M1

Step 2 — Frequency densities:
0–5: 10/5 = 2.0 5–10: 15/5 = 3.0 10–20: 16/10 = 1.6 20–40: 12/20 = 0.6 40–80: 8/40 = 0.2. M1 A3

Step 3 — Plot: Draw adjacent bars with heights equal to freq. densities. Label y-axis "Frequency Density". B1

Check: Total area = 10+15+16+12+8 = 61. ✓ Each bar area = freq. density × class width. A1

Example 4 — Finding Frequency From a Histogram

A histogram bar for the class 30–45 has frequency density 2.8. Find the frequency.

Class width = 45 − 30 = 15. M1

Frequency = 2.8 × 15 = 42. A1

Example 5 — Mean and Median from Grouped Data

Heights (cm): 150–155 (f=3), 155–160 (f=8), 160–165 (f=12), 165–170 (f=7). Find the mean and estimate the median.

Midpoints: 152.5, 157.5, 162.5, 167.5. Σf = 30. M1

Σfx: 3×152.5 + 8×157.5 + 12×162.5 + 7×167.5 = 457.5 + 1260 + 1950 + 1172.5 = 4840. M1

Mean = 4840/30 = 161.33 cm. A1

Median: n/2 = 15th value. Cumulative: 3, 11, 23. Median class: 160–165 (cum goes past 15). M1

L = 160, F = 11, f = 12, w = 5. Median = 160 + [(15−11)/12]×5 = 160 + 1.667 = 161.67 cm. A1

Example 6 — Variance and Standard Deviation

For the heights table above, calculate variance and SD.

Σfx²: 3×152.5² + 8×157.5² + 12×162.5² + 7×167.5²
= 3×23256.25 + 8×24806.25 + 12×26406.25 + 7×28056.25
= 69768.75 + 198450 + 316875 + 196393.75 = 781487.5. M1

σ² = 781487.5/30 − 161.33² = 26049.58 − 26027.44 = 22.14 cm². M1 A1

σ = √22.14 ≈ 4.71 cm. A1

Example 7 — Coded Data

Times x are coded as y = (x − 50)/10. From the coded data: Σy = 24, Σy² = 150, n = 12. Find x̄ and σx.

ȳ = 24/12 = 2. x̄ = 10×2 + 50 = 70. M1 A1

σy² = Σy²/n − ȳ² = 150/12 − 4 = 12.5 − 4 = 8.5. M1

σx = b × σy = 10 × √8.5 = 10 × 2.915 = 29.15. M1 A1

Example 8 — Comparing Two Distributions

Route A: median = 16 min, IQR = 14 min, positively skewed. Route B: median = 17.5 min, IQR = 14 min, slightly positive skew. Write a comparison.

Central tendency: "Route B has a slightly higher median waiting time (17.5 min vs 16 min), so on average Route B passengers wait slightly longer." B1

Spread: "Both routes have the same IQR (14 min), indicating equal consistency in the middle 50% of waiting times." B1

Skewness: "Both distributions are positively skewed, meaning there are occasional very long waits on each route." B1

Common Mistakes

Mistake 1 — Using Frequency Instead of Frequency Density on a Histogram

WRONG: Drawing a bar at height = frequency (e.g., height = 16 for class 10–20 with width 10).

CORRECT: Height = frequency density = 16/10 = 1.6. Area of bar = 1.6 × 10 = 16 = frequency. ✓

Mistake 2 — Wrong Outlier Formula

WRONG: Outlier if value > Q3 + 1.5 × Q3 (using Q3 instead of IQR).

CORRECT: Outlier if value < Q1 − 1.5 × IQR or value > Q3 + 1.5 × IQR. The 1.5 multiplies the IQR, NOT Q1 or Q3.

Mistake 3 — Using Class Boundaries Instead of Midpoints for the Mean

WRONG: For class 20–30, using x = 20 or x = 30 in Σfx.

CORRECT: Use the midpoint x = (20+30)/2 = 25. The midpoint is the best estimate for all values in the class.

Mistake 4 — Variance Formula Sign Error

WRONG: σ² = Σfx²/n + x̄² (adding instead of subtracting).

CORRECT: σ² = Σfx²/n − x̄². The mean squared is subtracted. This always gives σ² ≥ 0.

Mistake 5 — Forgetting the Class Width in Median Interpolation

WRONG: Median = L + (n/2 − F)/f (missing the × w at the end).

CORRECT: Median = L + [(n/2 − F)/f] × w. The formula gives a fraction of the class width added to the lower boundary.

Mistake 6 — Coded Variance: Forgetting to Square b

WRONG: σx² = b × σy² (using b, not b²).

CORRECT: σx² = b² × σy². When y = (x−a)/b, the variance is divided by b², so σx² = b² σy². But σx = b × σy (square root cancels the square).

Mistake 7 — Skewness: Confusing Positive and Negative

WRONG: "A longer left tail means positive skew" — mixing up which direction means which.

CORRECT: Positive skew = longer RIGHT tail (mean pulled UP by high values, so mean > median > mode). Negative skew = longer LEFT tail (mean < median < mode).

Mistake 8 — Comparison Without Context

WRONG: "Group A has mean 42, Group B has mean 35." (No comparison, no context.)

CORRECT: "Group A has a higher mean (42 vs 35), suggesting Group A students generally performed better on the test." State the comparison AND interpret it.

Key Formulas

Formula	Description
x̄ = Σfx / Σf	Mean from frequency table (use midpoints for grouped data)
Median = L + [(n/2 − F)/f] × w	Median by interpolation in grouped data; L=lower boundary, F=cumulative freq before, f=freq in class, w=width
IQR = Q3 − Q1	Interquartile range — spread of middle 50%
Outlier: x < Q1 − 1.5×IQR or x > Q3 + 1.5×IQR	Cambridge outlier criterion
Freq. Density = Frequency / Class Width	y-axis value for histograms with unequal class widths
Frequency = Freq. Density × Class Width	Recover frequency from histogram bar area
σ² = Σfx²/Σf − (Σfx/Σf)²	Variance (population) from frequency table
σ = √σ²	Standard deviation
Sxx = Σfx² − (Σfx)²/n	Sum of squares (Cambridge notation); σ² = Sxx/n
Coded mean: x̄ = bȳ + a	Recover mean when y = (x−a)/b
Coded SD: σx = b × σy	Recover SD when y = (x−a)/b (b > 0)
Coded variance: σx² = b² × σy²	Recover variance when y = (x−a)/b
Positive skew: Mode < Median < Mean	Right tail longer; mean pulled up by high values
Negative skew: Mean < Median < Mode	Left tail longer; mean pulled down by low values

Proof Bank

Proof 1 — Variance Alternative Form: σ² = Σfx²/n − x̄²

Starting from the definition of variance as mean squared deviation:

σ² = (1/n) Σf(x − x̄)²

Expand (x − x̄)² = x² − 2x·x̄ + x̄²:

σ² = (1/n) Σf(x² − 2x·x̄ + x̄²)

Distribute the sum (using linearity):

σ² = (1/n)[Σfx² − 2x̄·Σfx + x̄²·Σf]

Now use Σfx = n·x̄ and Σf = n:

σ² = (1/n)[Σfx² − 2x̄·(n·x̄) + x̄²·n]

= (1/n)[Σfx² − 2n·x̄² + n·x̄²]

= (1/n)[Σfx² − n·x̄²]

σ² = Σfx²/n − x̄² ∎

This is the "computational formula" — it avoids having to compute each (x − x̄) individually, which is error-prone and slow.

Proof 2 — Why Divide by n (Population) not n−1 (Sample)

The variance formula σ² = Σ(x−x̄)²/n is the population variance, used when you have data for the entire group.

In inferential statistics, when you only have a sample from a larger population and want to estimate the population variance, you use:

s² = Σ(x−x̄)²/(n−1) (sample variance — Bessel's correction)

Why n−1? When you compute x̄ from the sample, you use one "degree of freedom" estimating the mean. The deviations (x−x̄) are constrained to sum to zero, so only n−1 of them are free to vary. Dividing by n−1 corrects the bias — it makes s² an unbiased estimator of the true population variance σ².

In Cambridge S1, you are always given data for a specific dataset (not sampling from an infinite population), so you divide by n. The formula σ² = Σfx²/n − x̄² is correct for all S1 questions.

Proof 3 — Coded Variance: Why σx = b·σy when y = (x−a)/b

Let y = (x − a)/b. Then x = by + a.

Variance of x: σx² = E[(x − x̄)²]

Since x = by + a and x̄ = bȳ + a:

x − x̄ = (by + a) − (bȳ + a) = b(y − ȳ)

So: σx² = E[(b(y − ȳ))²] = b² · E[(y − ȳ)²] = b² · σy²

σx² = b²σy² and σx = |b|σy ∎

Adding the constant a has no effect on variance — it merely shifts all values by the same amount, so all deviations from the mean are unchanged.

Proof 4 — Median Interpolation Formula

Suppose the median is in a class [L, L+w) with frequency f, and F is the cumulative frequency before this class.

We need the (n/2)th value. Within the class, we need the (n/2 − F)th value out of f values.

Assuming values are uniformly spread within the class, the position fraction is (n/2 − F)/f.

Multiplying this fraction by the class width w gives the distance into the class:

Median = L + [(n/2 − F)/f] × w ∎

This is a linear interpolation — it treats the data as uniformly distributed within each class. It is always an estimate for grouped data.

Histogram Visualiser

Enter up to 6 classes. For each class enter: lower bound, upper bound, frequency. Then click Draw. The visualiser computes frequency density and draws the histogram, then shows the estimated mean and SD.

Class	Lower	Upper	Frequency
1
2
3
4
5
6

Enter data and click Draw to see the histogram and statistics.

Exercise 1 — Stem-and-Leaf & Box Plots (10 Questions)

Exercise 2 — Histograms (10 Questions)

Exercise 3 — Mean & Median from Frequency Tables (10 Questions)

Exercise 4 — Variance & Standard Deviation (10 Questions)

Exercise 5 — Coded Data & Comparison (10 Questions)

Practice — 30 Mixed Questions

Challenge — 15 Harder Questions

Exam Style Questions (8 Questions)

Cambridge S1 style questions. Attempt each before revealing the mark scheme.

Question 1 [6 marks]

The heights (cm) of 20 students are summarised: Q1 = 162, median = 168, Q3 = 174. The minimum is 150 and maximum is 195.

(i) Show that 195 is an outlier. (ii) Draw a box-and-whisker plot. (iii) State the type of skewness and give a reason.

(i) IQR = 174 − 162 = 12. Upper fence = 174 + 1.5×12 = 174 + 18 = 192. Since 195 > 192, it is an outlier. [M1 A1]

(ii) Box from 162 to 174, median at 168. Left whisker to 150. Right whisker to next non-outlier value (assume 188, given in full question). Outlier marked at 195 with ×. [B2]

(iii) Positive skew — the median (168) is closer to Q1 (162) than to Q3 (174), and/or the right whisker/tail is longer. [B1 reason B1]

Question 2 [5 marks]

A frequency table for weights (kg): 50–60 (f=4), 60–70 (f=9), 70–80 (f=11), 80–90 (f=6). Find the mean and an estimate for the standard deviation.

Midpoints: 55, 65, 75, 85. Σf = 30. [B1]
Σfx = 4×55 + 9×65 + 11×75 + 6×85 = 220 + 585 + 825 + 510 = 2140. [M1]
Mean = 2140/30 = 71.33 kg. [A1]
Σfx² = 4×3025 + 9×4225 + 11×5625 + 6×7225 = 12100 + 38025 + 61875 + 43350 = 155350. [M1]
σ² = 155350/30 − 71.33² = 5178.33 − 5087.98 = 90.35. σ = √90.35 ≈ 9.51 kg. [A1]

Question 3 [4 marks]

A histogram has bars: 0–4 (fd=3.5), 4–6 (fd=7), 6–10 (fd=4.5), 10–20 (fd=1.2). Find the total frequency and the modal class.

Frequencies: 3.5×4=14, 7×2=14, 4.5×4=18, 1.2×10=12. [M1 A2]
Total = 14+14+18+12 = 58. [A1]
Modal class: 4–6 (highest frequency density = 7). [B1]

Question 4 [6 marks]

Data coded using y = (x − 40)/5 gives Σy = 30, Σy² = 220, n = 15.
(i) Find x̄. (ii) Find σy and hence σx.

(i) ȳ = 30/15 = 2. x̄ = 5×2 + 40 = 50. [M1 A1]

(ii) σy² = 220/15 − 2² = 14.667 − 4 = 10.667. σy = √10.667 ≈ 3.266. [M1 A1]
σx = 5 × 3.266 = 16.33. [M1 A1]

Question 5 [5 marks]

Estimate the median from: Age 0–10 (f=5), 10–20 (f=14), 20–30 (f=21), 30–50 (f=10). Show your working clearly.

Σf = 50. Median at n/2 = 25th value. [B1]
Cumulative: 5, 19, 40. Median falls in class 20–30. [M1]
L = 20, F = 19, f = 21, w = 10. [B1]
Median = 20 + [(25−19)/21] × 10 = 20 + 60/21 = 20 + 2.857 = 22.86. [M1 A1]

Question 6 [4 marks]

For dataset A: mean = 45.2, SD = 8.3. For dataset B: mean = 45.2, SD = 14.7. Both are approximately symmetric. Compare the two distributions fully.

Central tendency: "Both datasets have the same mean (45.2), so on average their values are equal." [B1]
Spread: "Dataset B has a much larger standard deviation (14.7 vs 8.3), indicating B is considerably more spread out / variable." [B1]
Skewness: "Both are approximately symmetric." [B1]
Context (if given): Apply each point to the real-world meaning. [B1]

Question 7 [5 marks]

A back-to-back stem-and-leaf diagram shows test scores for classes X and Y. Class X leaves for stem 5: 8 6 3 1. Class Y leaves: 2 4 7 9. Stem 6: X leaves 7 5 2, Y leaves 0 3 5 8 8. Identify which class has the higher median and which has more spread (by IQR), giving values.

Class X values (from diagram): 51, 53, 56, 58, 62, 65, 67. n=7. Median = 58 (4th value). Q1=53, Q3=65. IQR=12. [M1 A1]
Class Y values: 52, 54, 57, 59, 60, 63, 65, 68, 68. n=9. Median = 60 (5th value). Q1=55.5, Q3=66.5. IQR=11. [M1 A1]
Class Y has higher median (60 vs 58). Class X has slightly larger IQR (12 vs 11). [A1]

Question 8 [6 marks]

The following values are given for a dataset: Σx = 360, Σx² = 14800, n = 12.
(i) Calculate mean and variance. (ii) A value of 45 is found to be an error and replaced with 51. Find the new mean and new variance.

(i) x̄ = 360/12 = 30. [B1] σ² = 14800/12 − 30² = 1233.33 − 900 = 333.33. [M1 A1]

(ii) New Σx = 360 − 45 + 51 = 366. New x̄ = 366/12 = 30.5. [M1 A1]
New Σx² = 14800 − 45² + 51² = 14800 − 2025 + 2601 = 15376.
New σ² = 15376/12 − 30.5² = 1281.33 − 930.25 = 351.08. [M1 A1]

Past Paper Questions

Authentic Cambridge 9709 S1 style questions. Click "Show Solution" to reveal full working.

Past Paper 1 — June 2019 Style [7 marks]

The lengths (mm) of 80 nails are summarised in the table below:

25–30: f=6 30–35: f=19 35–40: f=28 40–45: f=17 45–55: f=10

(i) Draw a histogram. (ii) Estimate the mean length. (iii) Estimate the standard deviation.

(i) Class widths: 5,5,5,5,10. Frequency densities: 1.2, 3.8, 5.6, 3.4, 1.0. Draw adjacent bars with these heights. Label y-axis "Frequency density". [B1 M1 A2]

(ii) Midpoints: 27.5, 32.5, 37.5, 42.5, 50. Σfx = 6×27.5 + 19×32.5 + 28×37.5 + 17×42.5 + 10×50 = 165 + 617.5 + 1050 + 722.5 + 500 = 3055. Mean = 3055/80 = 38.2 mm. [M1 A1]

(iii) Σfx² = 6×756.25 + 19×1056.25 + 28×1406.25 + 17×1806.25 + 10×2500 = 4537.5 + 20068.75 + 39375 + 30706.25 + 25000 = 119687.5. σ² = 119687.5/80 − 38.2² = 1496.09 − 1459.24 = 36.85. σ = √36.85 ≈ 6.07 mm. [M1 A1]

Past Paper 2 — November 2020 Style [6 marks]

The times (seconds) taken by 40 athletes to complete a race are recorded. The coded variable y = (x − 60)/2 gives Σy = −24 and Σy² = 116.
(i) Find the mean time x̄. (ii) Find the variance of the times. (iii) State the effect on x̄ and σ if all times increase by 3 seconds.

(i) ȳ = −24/40 = −0.6. x̄ = 2×(−0.6) + 60 = −1.2 + 60 = 58.8 seconds. [M1 A1]

(ii) σy² = 116/40 − (−0.6)² = 2.9 − 0.36 = 2.54. σx² = 2² × 2.54 = 4 × 2.54 = 10.16 s². [M1 A1]

(iii) Mean increases by 3: new x̄ = 61.8 s. σ is unchanged (adding a constant does not affect spread). [B1 B1]

Past Paper 3 — June 2018 Style [5 marks]

A box-and-whisker plot for plant heights (cm) shows: min=12, Q1=18, Q2=25, Q3=34, max=52. A value is an outlier if it is more than 1.5×IQR beyond Q1 or Q3.
(i) Identify any outliers. (ii) Describe the skewness of the distribution, giving two reasons.

(i) IQR = 34 − 18 = 16. Lower fence = 18 − 24 = −6. Upper fence = 34 + 24 = 58. No values outside [−6, 58], so no outliers. [M1 A1]

(ii) Positive skew. Reason 1: The median (25) is closer to Q1 (18) than to Q3 (34) — the distance from Q1 to Q2 is 7, while Q2 to Q3 is 9. Reason 2: The right whisker (52−34=18) is longer than the left whisker (18−12=6). [B1 B1 B1]

Past Paper 4 — November 2017 Style [6 marks]

Two groups sat a test. Group P: n=15, Σx=675, Σx²=31500. Group Q: n=10, Σx=460, Σx²=22000.
(i) Find the mean and variance for each group. (ii) Make two comparisons in context (the test was out of 60).

(i) Group P: x̄ = 675/15 = 45. σP² = 31500/15 − 45² = 2100 − 2025 = 75. σP = 8.66. [M1 A1]
Group Q: x̄ = 460/10 = 46. σQ² = 22000/10 − 46² = 2200 − 2116 = 84. σQ = 9.17. [M1 A1]

(ii) Central tendency: "Group Q has a slightly higher mean mark (46 vs 45), suggesting Group Q performed marginally better on average." [B1]
Spread: "Group Q has a larger variance (84 vs 75), indicating Group Q's marks were slightly more spread out / inconsistent." [B1]

Past Paper 5 — June 2016 Style [7 marks]

A dataset of 50 values has Q1=23.5, Q3=41.5. The stem-and-leaf for the lower values shows a leaf of 8 on stem 0 (representing 8) and a leaf of 2 on stem 6 (representing 62).
(i) Find the IQR. (ii) Determine whether 8 and 62 are outliers. (iii) The median is 31 and mean is 33.2. State the skewness and give a reason.

(i) IQR = 41.5 − 23.5 = 18. [B1]

(ii) Lower fence = 23.5 − 1.5×18 = 23.5 − 27 = −3.5. Since 8 > −3.5, 8 is NOT an outlier. [M1 A1]
Upper fence = 41.5 + 1.5×18 = 41.5 + 27 = 68.5. Since 62 < 68.5, 62 is NOT an outlier. [M1 A1]

(iii) Positive skew — the mean (33.2) is greater than the median (31), which occurs when the distribution has a longer upper tail pulling the mean upward. [B1 B1]

Representation of Data Statistics 1