Explorations in Statistics

Posted on:

(or why you should never extrapolate from tiny sets of data)

I was wrong today. That was pretty cool.

It all started off when my fellow second year computer science students started getting results back for an assignment written in Haskell.

I got my result back. 65%. I was shocked. I normally average 85%+, and here I was, with what I considered to be an awful mark. There wasn’t even any feedback on where I’d gone right and wrong. Just a single, cold mark.
Naturally I reached out and asked my friends what they’d got.


Plus, they’d all received feedback as well. This made no sense—these were people who had come to me for help during the assignment. Why were they scoring higher? I asked another friend, who I was sure would have close to 90% what he scored. His response? 64%. And even moreso, he hadn’t received feedback either. Curious.

A second friend revealed that she too, had got a much lower mark than expected, 61%, and no feedback. Something was up here. People with feed­back had higher marks than those without. We reasoned that there must have been two markers, one who had supplied feedback and marked nicely, and one who was a grinch.

So, I made a post on the class Facebook page, asking people to send me their marks and whether or not they received feedback, and I kept track of the numbers. Recently in our Data Driven Computing module, we’d been learning about Bayesian classifiers and normal probability curves. This seemed a great chance to apply what I’d learnt to a real situation!

I put the data into a CSV file, loaded it up in MATLAB, and plotted a bar graph. Nothing complex, just short and sweet.

Naive bar chart

There was a definitely noticeable trend. Marker A (the blue bars, who gave feedback) was clearly giving higher scores than Marker B.
This was just not on, so I gathered more data, and refined the graph, until I had the following graphic.

Naive bar chart, but with more info

I’d generated two normal distribution curves using the mean and standard deviation of the data I had. Quite clearly, there was a skew! There was a 16% score difference between the means of the two markers. We had irrefutable proof—it was so obvious that some of us had been shafted whereas others had an easier ride, markwise.

So, off I shot an email to our professor, showing him the injustice, asking what could be done, and if we could see the true means and standard deviations of the markers.

My professor’s response:

Marker A has an average mark of 71 with standard deviation 23.
Marker B has an average mark of 64 with standard deviation of 18.

My stats are rusty but I don’t think that’s a statistically significant
difference: B’s average is well within 1 SD of A’s average.

Wait, what? That’s not what our data suggested! We’d shown an obvious bias, yet if you plotted the new data…

Correct normal distribution graphs

…suddenly there’s not so much of a difference. They’re a lot closer together.

I was wrong.

How could this have happened though? Let’s take a moment to look at the data again.

See that little n = 19 on my graph? 19 seemed a reasonable amount of participants. Turns out, when your entire population is only 100, only having 19 participants is nowhere near enough. You WILL get skewed and inaccurate results if you try to extrapolate from such a small sample. For some reason, this had slipped my mind.

Consider those in past ages, who on ships out to sea, may have encountered a deadly storm. Many would have prayed for safety and help. Those who returned home would spread the word that praying had kept them safe. But very few consider the fates of those who prayed, and were not saved. Nobody would be alive to tell you that they died. This is called silent evidence, and it can be incredibly tricky to work around.
Also check out Abraham Wald’s selection bias suggestion.
We also have to remember there’s a self-selecting bias here - those with poor marks and no feedback we of course going to be upset, and more likely to vocalise it. People who were perfectly okay with their results, or without any “interesting” data to add may have simply not said anything. Silent evidence can sometimes be the strongest.

In my disappointment of having only scored 65%, I’d overlooked details so that I could be correct. Of COURSE it wasn’t fair that I’d got a low mark—it must have been the markers’ fault!

The maths of the true results don’t lie—there’s barely enough of a difference to say anything about the data. As noted by my professor, the mean of B is less than one quarter of a standard deviation away from the mean of A.

So, there we have it. A valuable life lesson learned, that can be applied to almost any situation conceivable; “never trust data of a small sample size that lacks a robust confidence rating”.

For the record, the rest of us did receive feedback, just later on. Unfortunately it wasn’t very clear or useful, so, maybe the marker WAS a grinch after all.