Section 1: Data Concepts and Analysis
Uses of Statistical Analysis
Description and analysis
Inference
Assessing risk and probability
Identifying important relationships
Descriptive Statistics
Measures of central tendency
Mean
o Sum of scores divided by number of scores
Median
o Order scores by size, then take middle datapoint
Mode
o Most common score or category
When summarising data use descriptive statistics as it reduces the amount of
information in order to increase its clarity
Deceptive Descriptive
Things to look out for;
Precision verses accuracy
o Precision refers to the exactitude with which something can be
stated, accuracy is how well the information fits the truth
The question being asked
o The way in which the question is asked affects the answer
Unit of analysis
o The focus of the analysis can affect the accuracy of the result in
representation of the whole population
Mean verses median
o Both convey different information, the dataset can seem larger or
smaller depending on the central tendency measure chosen
Units of measurement
o Unit of measurement chosen can affect the output
Percent verses absolute
o i.e. 10% raise sounds fair until one person has $50,000 and another
has $500,000
Aggregation
o How the data is aggregated or subdivided creates different
impressions
Measuring the right thing
o What is measured can alter incentives
Correlation
The correlation coefficient can be used to summarise the nature of relationship
between two variables. r is called the pearson correlation coefficient and when
using it the units don’t have to be the same. Pearsons’ correlation coefficient is a
number between -1.0 and 1.0 and includes information about direction (positive
or negative) and strength (magnitude of number). Measures the linear
relationship (not good for non-linear). Correlation is NOT causation
Probability
Probability is the study of events and outcomes involving an element of
uncertainty. Probabilities are all between 0.0 and 1.0, some events have inherent
probabilities and others are inferred from past data.
Cumulative probabilities have important implications. If many variables are
measured in a study and there are no true relationships the chance of seeing a
large correlation for one specific variable is fairly low, but, the chance of finding
a large correlation for one of the variables may be high even if this is a fluke.
Therefore it is very important not to assume that correlation means causation.
There are two types of multiple event probabilities, joint events (the probability
of one event AND another event occurring) and disjoint events (the probability of
one event OR another event occurring).
Expected value is the sum of the values of the payoff of each outcome each
multiplied by its probability.
Law of large numbers: as the number of independent trials increase the n=mean
of the outcomes will get closer to the expected value
Problems with Probability
Probabilities allow us to quantify future events and are thus important aids to
good decision making but we have to be aware of the errors that may arise in
calculating and interpreting probabilities;
Assuming events are independent when they are not
Not understanding when events ARE independent
Clusters happen
The prosecutor’s fallacy
o Can’t necessarily infer one probability from the other, we need to
compare the probabilities
Regression to the mean
Statistical discrimination
Three different types of probability
Classic probability is deduced from the properties of well defined objects
Frequentist probability is derived from the frequency of events
Subjective probability expresses a belief about the likelihood of an event