Interlude: Correlated measures

Power analysis through simulation in R

Niklas Johannes

Takeaways

  • Understand difference between independent and correlated measures
  • Know how to simulate correlated measures
  • Get to know the while logic

So far

So far we’ve been drawing samples from two independent groups.

set.seed(42)

control <- rnorm(100)
treatment <- rnorm(100, 0.2)

t.test(control, treatment)

    Welch Two Sample t-test

data:  control and treatment
t = -0.58009, df = 194.18, p-value = 0.5625
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3519980  0.1919951
sample estimates:
 mean of x  mean of y 
0.03251482 0.11251629 

What about two measures from the same person

  • Think of a typical pre/posttest design
  • Someone who has a tendency to score low will therefore score low on both pre- and post-test measure
  • The measures are correlated
  • We need to take into account that measures come from the same unit when simulating

So how do we get correlated measures?

  • We need to increase the dimensions
  • So far, we’ve worked with one dimension: our dependent variable only
  • But if a person has multiple measures, that means we don’t just have one normal distribution
  • We have two correlated normal distributions: a multivariate normal distribution

How does that look like?

For univariate, we pick from a single value (left). For bivariate, we pick two values, or a point on the the plane (right).

For the left, we need a mean and SD. What do we need for the right?

What goes into a multivariate distribution

Everything’s double:

  • 2 means
  • 2 SDs
  • Correlation between variables
  • An SD for the entire “mountain”

SD = Variance-covariance matrix

The SD for the “mountain” is just the SDs and correlations between the two variables in one place so that we can draw our data from them.

\[ \begin{bmatrix} var & cov \\ cov & var \\ \end{bmatrix} \]

This isn’t new

All of you have done correlation tables: they’re just standardized versions of the variance-covariance matrix.

\[ \begin{bmatrix} SD & r \\ r & SD \\ \end{bmatrix} = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \\ \end{bmatrix} \]

So how do we make this?

\[ \begin{bmatrix} var & cov \\ cov & var \\ \end{bmatrix} = \begin{bmatrix} ? & ? \\ ? & ? \\ \end{bmatrix} \]

Our values

Say we have an experiment where people give us a baseline measure, then the treatment happens, and we get a post-treatment measures. The measures are normally distributed with means of 10 and 10.5 and SDs of 1.5 and 2. The pre- and post-measure are correlated with \(r\) = 0.4.

means <- c(pre = 10, post = 10.5)
pre_sd <- 1.5
post_sd <- 2
correlation <- 0.4

Getting variance and covariances

SD is just the square root of the variance. So we go \(Var = sd^2\) and we got our variance.

Covariance is just the correlation times the SDs. So we go \(covariance = r(pre, post) \times sd_{pre} \times sd_{post}\)

var_pre <- pre_sd**2
var_post <- post_sd**2

covariance <- correlation * pre_sd * post_sd

Now let’s combine all that into a matrix

our_matrix <- matrix(
  c(var_pre, covariance, 
    covariance, var_post),
  ncol = 2
)

our_matrix
     [,1] [,2]
[1,] 2.25  1.2
[2,] 1.20  4.0

Ready to simulate now

We use the mvrnorm function for, well, multivariate normal distributions, from the MASS package. Let’s get a massive sample of 10,000 people.

library(MASS)

d <- 
  mvrnorm(
    10000,
    means,
    our_matrix
  )
d <- as.data.frame(d)
head(d)
        pre      post
1  9.750371  7.936974
2 10.001516 10.519466
3 11.279047 10.630203
4 11.081401  8.563613
5 10.086696  9.591592
6 11.432761 10.631188

Let’s check that

Let’s first check the sample to see whether we can recover our numbers.

mean(d$pre); mean(d$post)
[1] 10.00843
[1] 10.48749
sd(d$pre); sd(d$post)
[1] 1.496285
[1] 2.004229
cor(d$pre, d$post)
[1] 0.4119397

Run our test

t.test(
  d$pre,
  d$post,
  paired = TRUE
)

    Paired t-test

data:  d$pre and d$post
t = -24.623, df = 9999, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.5171917 -0.4409192
sample estimates:
mean difference 
     -0.4790554 

The while operator

Logic

At this point, we’ve worked with for loops and went from a minimum to a maximum. If that maximum is large, that can take quite some time. You can also consider the while function to stop when you’ve reached the point you want to be at.

Source

Easy example

i <- 1

while (i < 5) {
  print(i)
  
  i <- i + 1
}
[1] 1
[1] 2
[1] 3
[1] 4

How would that work for our purposes?

draws <- 1e3
n <- 60
effect_size <- 0.5

d <- data.frame(
  sample_size = NULL,
  power = NULL
)

power <- 0

while (power<0.95) {
  
  pvalues <- NULL
  for (i in 1:draws) {
    control <- rnorm(n)
    treatment <- rnorm(n, effect_size)
    t <- t.test(control, treatment, alternative = "less")
    
    pvalues[i] <- t$p.value
  }
  
  power <- sum(pvalues<0.05) / length(pvalues)
  
  d <- rbind(
    d,
    data.frame(
      sample_size = n,
      power = power
    )
  )
  
  n <- n + 1
}

What did we just do?

head(d)
  sample_size power
1          60 0.845
2          61 0.856
3          62 0.872
4          63 0.854
5          64 0.885
6          65 0.874
plot(d$sample_size, d$power)
abline(h = 0.95)

Takeaways

  • Understand difference between independent and correlated measures
  • Know how to simulate correlated measures
  • Get to know the while logic

Let’s get back to simulating