The correlation coefficient formula (Pearson's $r$) measures the strength and direction of a linear relationship between two variables, producing a value between $-1$ and $+1$.
Quick Reference:
Pearson correlation formula: $$r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}$$
Range: $-1 \leq r \leq 1$
Interpretation: $r = 1$ (perfect positive), $r = -1$ (perfect negative), $r = 0$ (no linear correlation)
Type: Statistical measure — bivariate analysis
Used in: Statistics, data science, economics, psychology, biology, machine learning
Definition
The Pearson correlation coefficient $r$ quantifies how closely two variables $x$ and $y$ vary together in a linear pattern. A positive $r$ means both variables increase together; a negative $r$ means one increases as the other decreases; $r = 0$ means no linear association.
The formula normalises the joint variation (covariance) by the product of the individual standard deviations, ensuring the result always lies in $[-1, 1]$.
Variable Key
Symbol | Meaning |
|---|---|
$r$ | Pearson correlation coefficient |
$n$ | Number of data pairs $(x_i, y_i)$ |
$x$ | Values of the first variable |
$y$ | Values of the second variable |
$\sum xy$ | Sum of products of each $x$–$y$ pair |
$\sum x^2$ | Sum of squares of $x$ values |
$\sum y^2$ | Sum of squares of $y$ values |
$\sum x$ | Sum of all $x$ values |
$\sum y$ | Sum of all $y$ values |
Alternative Form Using Means
The correlation coefficient can also be written as:
$$r = \frac{\sum(x - \bar{x})(y - \bar{y})}{\sqrt{\sum(x-\bar{x})^2 \cdot \sum(y-\bar{y})^2}}$$
where $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$ respectively. This form makes the intuition clearer: $r$ measures how much $x$ and $y$ deviate from their means in the same direction at the same time.
Interpreting The Correlation Coefficient
Value of $r$ | Interpretation |
|---|---|
$0.9$ to $1.0$ | Very strong positive correlation |
$0.7$ to $0.9$ | Strong positive correlation |
$0.5$ to $0.7$ | Moderate positive correlation |
$0.3$ to $0.5$ | Weak positive correlation |
$0$ to $0.3$ | Very weak or no linear correlation |
Negative values | Same scale, opposite direction |
Origin of Correlation Coefficient Formula
Karl Pearson (1857–1936, UK) developed the correlation coefficient in 1895, building on earlier work by Francis Galton (1822–1911, UK). Galton had noticed that tall parents tend to have tall children (though not as tall as the parents) — a phenomenon he called "regression to the mean." Pearson formalised this observation into a precise numerical measure. Their work established the mathematical foundation of modern statistics.
Worked Example of Correlation Coefficient
Find the correlation coefficient for the following data: $(x, y)$: $(1, 2), (2, 4), (3, 5), (4, 4), (5, 5)$.
$x$ | $y$ | $xy$ | $x^2$ | $y^2$ |
|---|---|---|---|---|
1 | 2 | 2 | 1 | 4 |
2 | 4 | 8 | 4 | 16 |
3 | 5 | 15 | 9 | 25 |
4 | 4 | 16 | 16 | 16 |
5 | 5 | 25 | 25 | 25 |
Σ = 15 | Σ = 20 | Σ = 66 | Σ = 55 | Σ = 86 |
$n = 5$. Applying the formula:
$$r = \frac{5(66) - (15)(20)}{\sqrt{[5(55) - 15^2][5(86) - 20^2]}}$$
$$= \frac{330 - 300}{\sqrt{[275 - 225][430 - 400]}} = \frac{30}{\sqrt{50 \times 30}} = \frac{30}{\sqrt{1500}} = \frac{30}{38.73} \approx 0.77$$
Final answer: $r \approx 0.77$ — strong positive correlation.
Common Confusions With The Correlation Coefficient Formula
Correlation does not imply causation. A high $r$ between two variables does not mean one causes the other. Ice cream sales and drowning rates are strongly correlated (both rise in summer) — the cause is a third variable (hot weather), not a direct link.
The correlation coefficient measures linear relationships only. Two variables can have a strong non-linear relationship (e.g. quadratic) with $r \approx 0$. Always inspect a scatter plot alongside $r$.
$r$ is not a percentage. $r = 0.7$ does not mean "70% correlated." The coefficient of determination $r^2 = 0.49$ means that $49%$ of the variation in $y$ is explained by its linear relationship with $x$.
Was this article helpful?
Your feedback helps us write better content