.

Lessons

- Research process
- Types of data analysis
- Generating and testing theories
- Levels of measurement
- Validity and reliability
- Central tendency
- Type of hypotheses

The research process

Types of data analysis

Quantitative Methods
- Testing theories using numbers
Qualitative Methods
- Testing theories using language
  - Magazine articles/Interviews
  - Conversations
  - Newspapers
  - Media broadcasts

Initial observation

Find something that needs explaining
- Observe the real world
- Read other research
Test the concept: collect data
- Collect data to see whether your hunch is correct
- To do this you need to define variables
  - Anything that can be measured and can differ across entities or time.

The research process

Generating and testing theories

- Theory
  - A hypothesized general principle or set of principles that explains known findings about a topic and from which new hypotheses can be generated.
- Hypothesis
  - A prediction from a theory.
  - e.g. the number of people turning up for a Big Brother audition that have narcissistic personality disorder will be higher than the general level (1%) in the population.
- Falsification
  - The act of disproving a theory or hypothesis.

Generating and testing theories

A table of the number of people at the Big Brother audition split by whether they had narcissistic personality disorder and whether they were selected as contestants by the producers.

Why is reading other research important in the early stages of inquiry?

To replicate existing models without modification.
To avoid collecting new data.
To understand existing theories and refine your hypothesis
To finalize your conclusions.

What is a theory in the context of statistical research?

A general principle explaining known findings and generating new hypotheses
A method for collecting data.
A proven fact about a population.
A random guess about data.

Which of the following best defines a hypothesis?

A prediction derived from a theory
A summary of previous research.
A statistical model.
A method for data collection.

The research process

Data collection1: What to measure?

Hypothesis	Independent variable	Dependent variable
coca-cola kills sperm	The proposed cause A predictor variable A manipulated variable (in experiments) Coca Cola in the hypothesis above	The proposed effect An outcome variable Measured not manipulated (in experiments) Sperm in the hypothesis above

Levels of measurement

Categorical	Continuous
Binary variable There are only two categories e.g. dead or alive.	Interval variable Equal intervals on the variable represent equal differences in the property being measured e.g. the difference between 6 and 8 is equivalent to the difference between 13 and 15.
Nominal variable There are more than two categories e.g. whether someone is an omnivore, vegetarian, vegan, or fruitarian.	Ratio variable The same as an interval variable, but the ratios of scores on the scale must also make sense e.g. a score of 16 on an anxiety scale means that the person is, in reality, twice as anxious as someone scoring 8.
Ordinal variable The same as a nominal variable but the categories have a logical order e.g. whether people got a fail, a pass, a merit or a distinction in their exam.

Measurement error

- The discrepancy between the actual value we’re trying to measure, and the number we use to represent that value.
- Example:
  - You (in reality) weigh 80 kg.
  - You stand on your bathroom scales and they say 83 kg.
  - The measurement error is 3 kg.

Validity

- Whether an instrument measures what it set out to measure.
- Content validity
  - Evidence that the content of a test corresponds to the content of the construct it was designed to cover
- Ecological validity
  - Evidence that the results of a study, experiment or test can be applied, and allow inferences, to real-world conditions.

Reliability

- Reliability
  - The ability of the measure to produce the same results under the same conditions.
- Test–Retest Reliability
  - The ability of a measure to produce consistent results when the same entities are tested at two different points in time.

Data collection 2: How to measure

- Correlational research
  - Observing what naturally goes on in the world without directly interfering with it.
- Cross-sectional research
  - This term implies that data come from people at different age points, with different people representing each age point.
- Experimental research
  - One or more variable is systematically manipulated to see their effect (alone or in combination) on an outcome variable.
  - Statements can be made about cause and effect.

The research process

Fitting model to real-world data

Population and samples

- Population
  - The collection of units (be they people, plankton, plants, cities, suicidal authors, etc.) to which we want to generalize a set of findings or a statistical model
- Sample
  - A smaller (but hopefully representative) collection of units from a population used to determine truths about that population

A simple statistical model

\[ outcome_i = (model) + error_i \]

- In statistics we fit models to our data (i.e. we use a statistical model to represent what is happening in the real world).
- The mean is a hypothetical value (i.e. it doesn’t have to be a value that actually exists in the data set).
- As such, the mean is simple statistical model.

The Mean

\[ \text{mean} (\bar{X}) = \frac{\Sigma^n_{i=1} x_i}{n} \]

- The mean is the sum of all scores divided by the number of scores.
- The mean is also the value from which the (squared) scores deviate least (it has the least error).

The Mean: example

Collect some data

\[ 1, 3, 4, 3, 2 \]

Add them up

\[ \Sigma^n_{i=1}x_i = 1 + 3 + 4 + 3 + 2 = 13 \]

Divide by the number of scores, \(n\):

\[ \bar{x} = \frac{\Sigma^n_{i=1}x_i}{n} = \frac{13}{5} = 2.6 \]

The Mean as a model

\[ outcome_i = (model) + error_i \]

\[ outcome_{lecture1} = (\bar{x}) + error_{lecture1} \]

\[ 1 = 2.6 + error_{lecture1} \]

Measuring the ‘fit’ of a model

- The mean is a model of what happens in the real world: the typical score.
- It is not a perfect representation of the data.
- How can we assess how well the mean represents reality?

A Perfect Fit

Calculating error

- A deviation is the difference between the mean and an actual data point.
- Deviations can be calculated by taking each score and subtracting the mean from it:

\[ \text{deviation} = x_i - \bar{x} \]

Calculating error

Use the total error?

- We could just take the error between the mean and the data and add them.

\[ \Sigma(x_i - \bar{x} = 0) \]

Sum of squared errors

- We could add the deviations to find out the total error.
- Deviations cancel out because some are positive and others negative.
- Therefore, we square each deviation.
- If we add these squared deviations we get the sum of squared errors (SS).

Sum of squared errors

\[ SS = \Sigma(x_i - \bar{x})^2 = 5.20 \]

Variance

- The sum of squares is a good measure of overall variability, but is dependent on the number of scores.
- We calculate the average variability by dividing by the number of scores (\(n\)).
- This value is called the variance (\(s^2\)).

\[ \text{variance} (s^2) = \frac{SS}{N-1} = \frac{\Sigma(x_i - \bar{x})^2}{N-1} = 1.3 \]

Degrees of freedom

Standard deviation

- The variance has one problem: it is measured in units squared.
- This isn’t a very meaningful metric so we take the square root value.
- This is the standard deviation (s).

\[ s = \sqrt{s^2} = \sqrt{\frac{\Sigma(x_i - \bar{x})^2}{N-1}} = \sqrt{\frac{5.20}{5}} = 1.02 \]

Important things to remember

- The sum of squares, variance, and standard deviation represent the same thing:
  - The ‘fit’ of the mean to the data
  - The variability in the data
  - How well the mean represents the observed data
  - Error

Same mean, different SD

SD and shape of distribution

Sample vs Population

- Sample
  - Mean and SD describe only the sample from which they were calculated.
- Population
  - Mean and SD are intended to describe the entire population (very rare in psychology).
- Sample to Population
  - Mean and SD are obtained from a sample, but are used to estimate the mean and SD of the population (very common in psychology).

Sample vs Population

Confidence interval (CI)

- Domjan et al. (1998)
  - ‘Conditioned’ sperm release in Japanese quail.
- True mean
  - 15 million sperm
- Sample mean
  - 17 million sperm
- Interval estimate
  - 12 to 22 million (contains true value)
  - 16 to 18 million (misses true value)
  - CIs constructed such that 95% contain the true value.

Confidence interval (CI)

Test statistics

- A statistic for which the frequency of particular values is known.
- Observed values can be used to test hypotheses.

\[ test\ statistic = \frac{variance\ explained\ by\ the\ model} {variance\ not\ explained\ by\ the\ model} = \frac{effect}{error} \]