sample(
x = ?,
size = how many samples?
)
Exercise I
Overview
This is the first set of exercises where you get familiar with basic commands that you’ll need for data simulation. It’ll make it easier later on when you do more complicated designs to have a good intuition how these commands work.
This part might be boring to you if you’re already a seasoned R user. In that case, I recommend you get a coffee, tea, and/or nap.
After some feedback from a previous iteration of this workshop, I’ll give quite a lot of tips. They’ll come in the form below, and I recommend you ignore them and try it out on your own first:
0.1 Block 1
0.1.1 Exercise
You have three groups. Name the groups, randomly sample ten cases (in total, so uneven numbers per group), and then create a simple data frame that contains a variable for the group called condition
.
<- letters[1:3]
groups <- sample(groups, 10, TRUE)
my_sample <- data.frame(condition = as.factor(my_sample)) d
0.1.2 Exercise
Same three groups. This time you want each case to have a 70% to be in the first group, a 20% to be in the second group, and a 10% to be in the third group. Get 100 participants (no need for a data frame). Use set.seed(1)
. How many are in the first group?
set.seed(1)
<- letters[1:3]
groups <-
my_sample sample(
groups,100,
TRUE,
prob = c(0.7, 0.2, 0.1)
)sum(my_sample==groups[1])
[1] 68
0.1.3 Exercise
Show that sample
with assigned probability (prob =
argument) is the same as rbinom
. Conduct 10 coin flips with a an unfair coin that has a 60% of landing heads. Remember to set a seed (tip: twice).
set.seed(1)
sample(0:1, 10, replace = TRUE, prob = c(0.4, 0.6))
[1] 1 1 1 0 1 0 0 0 0 1
set.seed(1)
rbinom(10, 1, 0.6)
[1] 1 1 1 0 1 0 0 0 0 1
0.1.4 Exercise
Draw random letters from the alphabet until the alphabet is empty.
sample(letters, length(letters))
[1] "s" "a" "u" "x" "j" "n" "v" "g" "i" "o" "e" "p" "r" "w" "q" "l" "b" "m" "d"
[20] "z" "y" "h" "c" "t" "k" "f"
0.1.5 Exercise
Draw all letters from the alphabet and explicitly assign the same probability for each letter (tip: repeat the same probability).
<- rep(1/length(letters), length(letters))
probs sample(letters, prob = probs)
[1] "r" "v" "o" "n" "w" "b" "l" "t" "s" "k" "y" "i" "f" "c" "e" "j" "q" "u" "m"
[20] "a" "g" "p" "h" "x" "d" "z"
0.1.6 Exercise
Create a data set. In the data set, each participant has an identifier (called id
), a group identifier (condition
), and an identifier of what number of measurement we have for this participant (trial
). There are 3 participants in each of three groups with 5 trials in each group.
0.1.7 Exercise
You have two groups, a control and a treatment group. In each group, there are 10 participants. Each participant flips a coin 10 times. The control group has a fair coin: 50% heads. The treatment group has an unfair coin: 70% heads. Create a data frame with a participant identifier (id
), group membership (condition
), and a total head count for that participant (heads
). Check that the two groups indeed have different means of how often they get heads (roughly corresponding to the two probabilities)
<- c("control", "treatment")
groups <- 20
n <- 10
flips <- 0.5
fair <- 0.7
unfair <- rep(c(fair, unfair), each = n/2)
probs
<-
d data.frame(
id = 1:20,
condition = rep(groups, each = n/2),
heads = rbinom(n, flips, prob = probs)
)
aggregate(d$heads, by = list(d$condition), mean)
Group.1 x
1 control 5.4
2 treatment 6.5
0.1.8 Exercise
You have 100 participants. Each participants reports their age which lies uniformly between 20 and 40. They also report grumpiness on a 100-point scale (with a 10-point SD). Each extra year predicts 0.5 higher grumpiness (Tip: Check the lecture slides; you’ll need to add some error). Create the two variables (no need for a data frame) and conduct a correlation with cor
. What’s the correlation?
<- 100
n <- runif(n, 20, 40)
age <- age*0.5 + rnorm(n, 100, 10)
grumpiness cor(age, grumpiness)
[1] 0.2127516
0.1.9 Exercise
We track how many calls you get during this exercise. Nobody calls anymore, so there’ll be very few. Create a data frame with 20 participants, a participant number and the number of calls per participant. Plot the calls and show that 0 is the most common value.
<- 20
n
<- data.frame(
d id = 1:n,
calls = rpois(n, 0.5)
)
plot(density(d$calls))
0.1.10 Exercise
Professors get more calls. Add 20 more participants who have much higher call numbers. Also include a condition
variable that marks whether participants are students (first 20 people) or professors (new 20 people). Conduct an independent sample t-test (t.test
) and also plot the different groups as a boxplot.
<- rbind(
d
d,data.frame(
id = n+1:2*n,
calls = rpois(n, 3)
)
)
$condition <-
drep(c("students", "profs"), each = 20)
t.test(d$calls ~ d$condition)
Welch Two Sample t-test
data: d$calls by d$condition
t = 5.8023, df = 26.202, p-value = 3.989e-06
alternative hypothesis: true difference in means between group profs and group students is not equal to 0
95 percent confidence interval:
1.679272 3.520728
sample estimates:
mean in group profs mean in group students
3.25 0.65
boxplot(d$calls ~ d$condition)
0.2 Block 2
In this section, you’ll use the basics from above to perform your first power analysis. You’ll apply repeated simulations over a range of values and extract and store results to summarize them.
0.2.1 Exercise
There are four groups. Each group comes from a different normal distribution. The means are c(100, 105, 107, 109)
. The SDs are c(9, 12, 10, 17)
. Each group should be 20 cases. Store the scores in a data frame and have a variable that indicates the group. Tip: Remember that R uses vectors, even for arguments in a function.
<- c(100, 105, 107, 109)
means <- c(9, 12, 10, 17)
sds <- 20
n
<-
d data.frame(
group = rep(c(letters[1:4]), times = n),
score = rnorm(n*4, means, sds)
)
0.2.2 Exercise
You need 5 samples. Each sample contains 10 unique letters from the alphabet. (Use replicate
.)
replicate(5, sample(letters, 10))
[,1] [,2] [,3] [,4] [,5]
[1,] "o" "u" "p" "l" "e"
[2,] "p" "e" "g" "j" "w"
[3,] "q" "l" "z" "a" "f"
[4,] "s" "c" "q" "f" "s"
[5,] "n" "r" "k" "q" "t"
[6,] "f" "n" "i" "m" "l"
[7,] "k" "m" "h" "b" "c"
[8,] "h" "o" "m" "o" "d"
[9,] "e" "b" "f" "d" "o"
[10,] "b" "t" "x" "z" "n"
0.2.3 Exercise
Same as before, but this time you need 10 cases from a normal distribution with a mean of 10 and an SD of 2. Use replicate
first, then a for
loop.
replicate(5, rnorm(10, 10, 2))
[,1] [,2] [,3] [,4] [,5]
[1,] 10.629034 10.077856 10.916251 11.087316 8.782663
[2,] 10.444436 10.029294 7.095943 12.082121 9.397606
[3,] 8.312769 9.627366 10.154685 10.395012 11.952405
[4,] 10.887611 12.801182 11.119791 6.740843 10.912017
[5,] 10.111594 10.036971 9.850107 10.242080 12.588816
[6,] 10.135945 10.498392 11.565395 6.725156 7.733596
[7,] 9.596077 10.298441 9.654663 8.937914 8.261079
[8,] 7.683969 8.073534 7.897412 11.907360 8.490059
[9,] 8.814513 9.867066 11.458903 6.558699 9.740729
[10,] 11.532131 12.573844 10.525330 10.212641 7.996397
for (i in 1:5) {
print(rnorm(10, 10, 2))
}
[1] 8.360263 8.050895 11.208619 11.097575 11.832865 15.323133 9.639486
[8] 11.370030 16.532829 11.121201
[1] 9.861965 8.055114 8.906827 6.622615 6.855255 9.190026 10.638573
[8] 10.080855 9.219981 6.361556
[1] 11.318361 10.919243 13.233253 6.287619 9.426352 13.500644 10.232827
[8] 12.768506 11.148442 10.272982
[1] 11.828432 6.398347 9.320239 11.212529 12.682261 11.534575 10.387451
[8] 12.281133 10.027730 7.789388
[1] 9.949675 9.672653 10.740119 9.238351 11.305905 14.122684 6.406710
[8] 11.168154 8.554494 8.741671
0.2.4 Exercise
Assume we know the population mean in height (168cm) and its standard deviation (20). Assume we draw 10,000 samples from this distribution. Each sample has 50 participants. The standard deviation of these 10,000 sample means is the standard error.
Simulate the standard error and compare it to the theoretical value: \(SE = \frac{\sigma}{\sqrt{n}}\). (\(\sigma\) is the standard deviation of the population.)
<- 50
n <- 168
mu <- 20
sd <- sd / sqrt(n)
se <- NULL
means <- 1e5
draws
for (i in 1:draws) {
<- mean(rnorm(n, mu, sd))
means[i]
}
cat("The real SE is:", round(se, digits = 2), ". The simulated SE is:", round(sd(means), digits = 2), ".")
The real SE is: 2.83 . The simulated SE is: 2.83 .
# with replicate
<-
means replicate(
draws,rnorm(n, mu, sd)
)
sd(colMeans(means))
[1] 2.827195
0.2.5 Exercise
Same population. Draw 1,000 observations for each sample size between 20 and 100. Calculate the standard error for each sample size (like you did above by taking the SD of sample means) and plot it against the sample size. (Tip: You’ll need to iterate over two things.)
<- 168
mu <- 20
sd <- 1e3
draws <- 20
min_sample <- 100
max_sample
<-
results data.frame(
sample_size = NULL,
se = NULL
)
for (i in min_sample:max_sample) {
<- NULL
sample_means
for (j in 1:draws) {
<- mean(rnorm(i, mu, sd))
sample_means[j]
}
<- sd(sample_means)
se
<- rbind(
results
results,data.frame(
sample_size = i,
se = se
)
)
}
plot(results$sample_size, results$se)
0.2.6 Exercise
Turn the above into a function so that you can change the population effect size, SD, number of simulations, and sample size range. The function should also return the plot from above.
<-
sample_against_se function(
mu = 168,
sd = 20,
draws = 1e3,
min_sample = 20,
max_sample = 100
) {<-
results data.frame(
sample_size = NULL,
se = NULL
)
for (i in min_sample:max_sample) {
<- NULL
sample_means
for (j in 1:draws) {
<- mean(rnorm(i, mu, sd))
sample_means[j]
}
<- sd(sample_means)
se
<- rbind(
results
results,data.frame(
sample_size = i,
se = se
)
)
}
return(plot(results$sample_size, results$se))
}
0.2.7 Exercise
Try out the function with two plots: when the population SD is 5 and when you do 10 draws. What changes?
sample_against_se(sd = 5)
sample_against_se(draws = 10)
0.2.8 Exercise
The average height of men in Spain is 173cm (says Wikipedia). The population standard deviation is probablay around 7cm (source).
You draw a sample of men and want to test whether they’re significantly different from that mean (our H0). In fact, these men you have sampled are truly French (175.6cm, our true “effect size”). In other words, can we reject the null hypothesis that these men we sampled come from the Spanish distribution in favor of our alternative hypothesis that the true population value is greater than the Spanish population mean?
You calculate the z-statistic, which is calculated as follows: \(\frac{\bar{X} - \mu_0}{\sigma/\sqrt{N}}\) This simply tells us how far from the population mean (well, the suspected population mean under the null hypothesis) our sample mean is in terms of standard errors. \(\bar{X}\) is the sample mean, \(\mu_0\) is the population mean under H0, \(\sigma\) is the population standard deviation, and \(N\) is the sample size.
Then we can look up the z-score to see what the probability is to score this high or higher (does that definition ring a bell?). In R, you can simply do that with a built-in function: pnorm()
. For example, if we have a z-score of 1.645, our probability of obtaining such a value (or higher) is pnorm(1.645, lower.tail = FALSE)
= 0.0499849 – our p-value for a one-sided test.
We can simulate the power of our statistical test (in this case, the z-statistic). Take a sample of 30 people from the French population, calculate the z-statistic, its p-value, and store the p-value. Do this 1,000 times. Plot the distribution of p-values. What can we conclude about the sample size?
# sample size
<- 30
n <- 173 # test against this null
h0 <- 175.6 # true population effect
h1 <- 7
sd <- 0.05
alpha <- 1e3
draws
<- NULL
pvalues
for (i in 1:draws) {
<- rnorm(n, h1, sd)
d <- (mean(d) - h0) / (sd / sqrt(n))
z <- pnorm(z, lower.tail = FALSE)
pvalues[i]
}
plot(density(pvalues))
0.2.9 Exercise
Now calculate the proportion of p-values that are below 0.05. That’s your power: The proportion of tests that will detect that there’s a true effect. In our case, that effect is a difference of 2.6cm. (Tip: Count or sum how many p-values are below 0.05 and divide by total number of p-values).
sum(pvalues < 0.05)/length(pvalues)
[1] 0.668
0.2.10 Exercise
Now let’s do what we did before: Put the loop from above inside another loop that iterates over different sample sizes. Then put that all into a function that let’s you set the parameters of interest (sample size range, h0, h1, etc.). Then simulate power (1,000 simulations each) for samples between 30 and 100. Plot the sample size against power.
<-
power_function function(
h0 = 173,
h1 = 175.6,
sd = 7,
alpha = 0.05,
draws = 1e3,
min_sample = 30,
max_sample = 100
) {<-
results data.frame(
sample_size = NULL,
power = NULL
)
for (i in min_sample:max_sample) {
<- NULL
pvalues
for (j in 1:draws) {
<- rnorm(i, h1, sd)
d <- (mean(d) - h0) / (sd / sqrt(i))
z <- pnorm(z, lower.tail = FALSE)
pvalues[j]
}
<- rbind(
results
results,data.frame(
sample_size = i,
power = sum(pvalues < 0.05)/length(pvalues)
)
)
}
return(plot(results$sample_size, results$power, type = "l"))
}
power_function()
0.2.11 Exercise
Now do the same thing with a one-sample t-test. (Tip: You only need to replace the z-scores with a t.test
from which you can extract the p-value). (Another tip: Use the $ sign on where you stored the t-test results.)
<-
power_function_t function(
h0 = 173,
h1 = 175.6,
sd = 7,
alpha = 0.05,
draws = 1e3,
min_sample = 30,
max_sample = 100
) {<-
results data.frame(
sample_size = NULL,
power = NULL
)
for (i in min_sample:max_sample) {
<- NULL
pvalues
for (j in 1:draws) {
<- rnorm(i, h1, sd)
d <- t.test(d, mu = h0,alternative = "greater")
t <- t$p.value
pvalues[j]
}
<- rbind(
results
results,data.frame(
sample_size = i,
power = sum(pvalues < 0.05)/length(pvalues)
)
)
}
return(plot(results$sample_size, results$power, type = "l"))
}
power_function()
0.2.12 Exercise
Just for funsies (and for our next session), see what happens when the true effect is only 1cm in difference.
power_function_t(h1 = h0+1)