sample(
x = ?,
size = how many samples?
)
Exercise I
Overview
This is the first set of exercises where you get familiar with basic commands that you’ll need for data simulation. It’ll make it easier later on when you do more complicated designs to have a good intuition how these commands work.
This part might be boring to you if you’re already a seasoned R user. In that case, I recommend you get a coffee, tea, and/or nap.
After some feedback from a previous iteration of this workshop, I’ll give quite a lot of tips. They’ll come in the form below, and I recommend you ignore them and try it out on your own first:
Tips will appear here, such as:
0.1 Block 1
0.1.1 Exercise
You have three groups. Name the groups, randomly sample ten cases (in total, so uneven numbers per group), and then create a simple data frame that contains a variable for the group called condition
.
<- letters[1:3]
groups <- sample(groups, 10, TRUE)
my_sample <- data.frame(condition = as.factor(my_sample)) d
First create an object that contains three names (can be anything, but I recommend using letters
). Then:
sample(
x = your groups,
size = ten cases,
replace = ?
)
Then use data.frame
.
0.1.2 Exercise
Same three groups. This time you want each case to have a 70% to be in the first group, a 20% to be in the second group, and a 10% to be in the third group. Get 100 participants (no need for a data frame). Use set.seed(1)
. How many are in the first group?
set.seed(1)
<- letters[1:3]
groups <-
my_sample sample(
groups,100,
TRUE,
prob = c(0.7, 0.2, 0.1)
)sum(my_sample==groups[1])
[1] 68
Check ?sample
and have a look at the prob
argument. You can count how many are in the first group by summing up (sum
) the groups or using a table (table
).
0.1.3 Exercise
Show that sample
with assigned probability (prob =
argument) is the same as rbinom
. Conduct 10 coin flips with a an unfair coin that has a 60% of landing heads. Remember to set a seed (tip: twice).
set.seed(1)
sample(0:1, 10, replace = TRUE, prob = c(0.4, 0.6))
[1] 1 1 1 0 1 0 0 0 0 1
set.seed(1)
rbinom(10, 1, 0.6)
[1] 1 1 1 0 1 0 0 0 0 1
Set the seed, then call sample
:
sample(
x = head or tails,
size = how many samples?,
replace = ?,
prob = c(0.5, 0.5)
)
Then set the seed again and call rbinom
. You need 10 “experiments” with only one outcome. Check ?rbinom
for the prob
argument. 0.5 would give you a 50% chance of getting the outcome.
0.1.4 Exercise
Draw random letters from the alphabet until the alphabet is empty.
sample(letters, length(letters))
[1] "s" "a" "u" "x" "j" "n" "v" "g" "i" "o" "e" "p" "r" "w" "q" "l" "b" "m" "d"
[20] "z" "y" "h" "c" "t" "k" "f"
Tell sample to draw as many letters as there are in the alphabet. Use letters
.
0.1.5 Exercise
Draw all letters from the alphabet and explicitly assign the same probability for each letter (tip: repeat the same probability).
<- rep(1/length(letters), length(letters))
probs sample(letters, prob = probs)
[1] "r" "v" "o" "n" "w" "b" "l" "t" "s" "k" "y" "i" "f" "c" "e" "j" "q" "u" "m"
[20] "a" "g" "p" "h" "x" "d" "z"
Use the prob
argument. You need one probability for each letter. That probability is the same for each letter, so in effect, you can just repeat the probability. If only there were an R command to repeat things.
0.1.6 Exercise
Create a data set. In the data set, each participant has an identifier (called id
), a group identifier (condition
), and an identifier of what number of measurement we have for this participant (trial
). There are 3 participants in each of three groups with 5 trials in each group.
Ultimately, you want to have something of the following form:
id condition trial
1 1 a 1
2 1 a 2
3 1 a 3
4 1 a 4
5 1 a 5
6 2 b 1
7 2 b 2
8 2 b 3
9 2 b 4
10 2 b 5
11 3 c 1
12 3 c 2
13 3 c 3
14 3 c 4
15 3 c 5
16 4 a 1
17 4 a 2
18 4 a 3
19 4 a 4
20 4 a 5
21 5 b 1
22 5 b 2
23 5 b 3
24 5 b 4
25 5 b 5
26 6 c 1
27 6 c 2
28 6 c 3
29 6 c 4
30 6 c 5
31 7 a 1
32 7 a 2
33 7 a 3
34 7 a 4
35 7 a 5
36 8 b 1
37 8 b 2
38 8 b 3
39 8 b 4
40 8 b 5
41 9 c 1
42 9 c 2
43 9 c 3
44 9 c 4
45 9 c 5
So first create the groups/conditions. Then decide how many participants you want and how many trials you want. Next, create a data frame:
data.frame(
id = # it should go id number for as many trials as there are: 11111 22222 etc.,
condition = # each id above needs to only have one condition as many times as there are trials: aaaaa bbbbb ccccc aaaaa etc.,
trial = # repeat 12345 as many times as there are participants
)
0.1.7 Exercise
You have two groups, a control and a treatment group. In each group, there are 10 participants. Each participant flips a coin 10 times. The control group has a fair coin: 50% heads. The treatment group has an unfair coin: 70% heads. Create a data frame with a participant identifier (id
), group membership (condition
), and a total head count for that participant (heads
). Check that the two groups indeed have different means of how often they get heads (roughly corresponding to the two probabilities)
<- c("control", "treatment")
groups <- 20
n <- 10
flips <- 0.5
fair <- 0.7
unfair <- rep(c(fair, unfair), each = n/2)
probs
<-
d data.frame(
id = 1:20,
condition = rep(groups, each = n/2),
heads = rbinom(n, flips, prob = probs)
)
aggregate(d$heads, by = list(d$condition), mean)
Group.1 x
1 control 5.4
2 treatment 6.5
I recommend to define your parameters at the beginning (sometimes called declaring your variables or assignment block):
<- #?
groups <- #?
n <- #?
flips <- #?
fair <- #?
unfair <- #? (Check how you assigned probabilities for the alphabet exercise above) probs
Then you can create a data frame like you did above, with IDs running from minimum to maximum, the first half having control assignment, the second half having treatment assignment, and a head count with rbinom
(which takes the number of experiments, how many flips there are per experiments, and the probability of heads for that experiments; the experiment is each participant).
Then use aggregate
or table
.
0.1.8 Exercise
You have 100 participants. Each participants reports their age which lies uniformly between 20 and 40. They also report grumpiness on a 100-point scale (with a 10-point SD). Each extra year predicts 0.5 higher grumpiness (Tip: Check the lecture slides; you’ll need to add some error). Create the two variables (no need for a data frame) and conduct a correlation with cor
. What’s the correlation?
<- 100
n <- runif(n, 20, 40)
age <- age*0.5 + rnorm(n, 100, 10)
grumpiness cor(age, grumpiness)
[1] 0.2127516
Go to slide 29 here and replace the error (norm
) with the one specified.
0.1.9 Exercise
We track how many calls you get during this exercise. Nobody calls anymore, so there’ll be very few. Create a data frame with 20 participants, a participant number and the number of calls per participant. Plot the calls and show that 0 is the most common value.
<- 20
n
<- data.frame(
d id = 1:n,
calls = rpois(n, 0.5)
)
plot(density(d$calls))
Create a data frame with an id
variable running from 1 to 20, then call rpois
and play around with the lambda parameter until plot(density(dataframe$calls))
shows a spike at zero.
0.1.10 Exercise
Professors get more calls. Add 20 more participants who have much higher call numbers. Also include a condition
variable that marks whether participants are students (first 20 people) or professors (new 20 people). Conduct an independent sample t-test (t.test
) and also plot the different groups as a boxplot.
<- rbind(
d
d,data.frame(
id = n+1:2*n,
calls = rpois(n, 3)
)
)
$condition <-
drep(c("students", "profs"), each = 20)
t.test(d$calls ~ d$condition)
Welch Two Sample t-test
data: d$calls by d$condition
t = 5.8023, df = 26.202, p-value = 3.989e-06
alternative hypothesis: true difference in means between group profs and group students is not equal to 0
95 percent confidence interval:
1.679272 3.520728
sample estimates:
mean in group profs mean in group students
3.25 0.65
boxplot(d$calls ~ d$condition)
Create a second data frame that has an id
variables that runs from 21 to 40 as well as a calls
variable that indicates the number of calls like you did before. Then use rbind
to combine the data frame with the students with the data frame you just created. Add a condition
variable:
$condition <-
drep(c("students", "profs"), ?)
Now run a t-test and a boxplot (boxplot
).
0.2 Block 2
In this section, you’ll use the basics from above to perform your first power analysis. You’ll apply repeated simulations over a range of values and extract and store results to summarize them.
0.2.1 Exercise
There are four groups. Each group comes from a different normal distribution. The means are c(100, 105, 107, 109)
. The SDs are c(9, 12, 10, 17)
. Each group should be 20 cases. Store the scores in a data frame and have a variable that indicates the group. Tip: Remember that R uses vectors, even for arguments in a function.
<- c(100, 105, 107, 109)
means <- c(9, 12, 10, 17)
sds <- 20
n
<-
d data.frame(
group = rep(c(letters[1:4]), times = n),
score = rnorm(n*4, means, sds)
)
<- ?
means <- ?
sds <- ?
n
<-
d data.frame(
group = rep(c(letters[1:4]), ?), # make sure condition matches how the score below is drawn
score = rnorm(n*4, means, sds)
)
0.2.2 Exercise
You need 5 samples. Each sample contains 10 unique letters from the alphabet. (Use replicate
.)
replicate(5, sample(letters, 10))
[,1] [,2] [,3] [,4] [,5]
[1,] "o" "u" "p" "l" "e"
[2,] "p" "e" "g" "j" "w"
[3,] "q" "l" "z" "a" "f"
[4,] "s" "c" "q" "f" "s"
[5,] "n" "r" "k" "q" "t"
[6,] "f" "n" "i" "m" "l"
[7,] "k" "m" "h" "b" "c"
[8,] "h" "o" "m" "o" "d"
[9,] "e" "b" "f" "d" "o"
[10,] "b" "t" "x" "z" "n"
0.2.3 Exercise
Same as before, but this time you need 10 cases from a normal distribution with a mean of 10 and an SD of 2. Use replicate
first, then a for
loop.
replicate(5, rnorm(10, 10, 2))
[,1] [,2] [,3] [,4] [,5]
[1,] 10.629034 10.077856 10.916251 11.087316 8.782663
[2,] 10.444436 10.029294 7.095943 12.082121 9.397606
[3,] 8.312769 9.627366 10.154685 10.395012 11.952405
[4,] 10.887611 12.801182 11.119791 6.740843 10.912017
[5,] 10.111594 10.036971 9.850107 10.242080 12.588816
[6,] 10.135945 10.498392 11.565395 6.725156 7.733596
[7,] 9.596077 10.298441 9.654663 8.937914 8.261079
[8,] 7.683969 8.073534 7.897412 11.907360 8.490059
[9,] 8.814513 9.867066 11.458903 6.558699 9.740729
[10,] 11.532131 12.573844 10.525330 10.212641 7.996397
for (i in 1:5) {
print(rnorm(10, 10, 2))
}
[1] 8.360263 8.050895 11.208619 11.097575 11.832865 15.323133 9.639486
[8] 11.370030 16.532829 11.121201
[1] 9.861965 8.055114 8.906827 6.622615 6.855255 9.190026 10.638573
[8] 10.080855 9.219981 6.361556
[1] 11.318361 10.919243 13.233253 6.287619 9.426352 13.500644 10.232827
[8] 12.768506 11.148442 10.272982
[1] 11.828432 6.398347 9.320239 11.212529 12.682261 11.534575 10.387451
[8] 12.281133 10.027730 7.789388
[1] 9.949675 9.672653 10.740119 9.238351 11.305905 14.122684 6.406710
[8] 11.168154 8.554494 8.741671
replicate(?, rnorm(?, ?, ?))
for (i in 1:5) {
print(rnorm(?, ?, ?))
}
0.2.4 Exercise
Assume we know the population mean in height (168cm) and its standard deviation (20). Assume we draw 10,000 samples from this distribution. Each sample has 50 participants. The standard deviation of these 10,000 sample means is the standard error.
Simulate the standard error and compare it to the theoretical value: \(SE = \frac{\sigma}{\sqrt{n}}\). (\(\sigma\) is the standard deviation of the population.)
<- 50
n <- 168
mu <- 20
sd <- sd / sqrt(n)
se <- NULL
means <- 1e5
draws
for (i in 1:draws) {
<- mean(rnorm(n, mu, sd))
means[i]
}
cat("The real SE is:", round(se, digits = 2), ". The simulated SE is:", round(sd(means), digits = 2), ".")
The real SE is: 2.83 . The simulated SE is: 2.83 .
# with replicate
<-
means replicate(
draws,rnorm(n, mu, sd)
)
sd(colMeans(means))
[1] 2.827195
Like we did earlier, first declare your variables:
<- ?
n <- ? # mu is the greek value for the true mean in the population
mu <- ?
sd <- sd / sqrt(n) # calculating standard error
se <- NULL # somewhere to store the means in
means <- 1e5 # how many samples? draws
Then we iterate over the number of samples, generate that sample, and take its mean.
for (i in 1:draws) {
<- rnorm(?, ?, ?)
current_sample <- mean(current_sample)
current_mean
# store
<- current_mean
means[?] }
Now you can compare the SD of means
with the standard error.
0.2.5 Exercise
Same population. Draw 1,000 observations for each sample size between 20 and 100. Calculate the standard error for each sample size (like you did above by taking the SD of sample means) and plot it against the sample size. (Tip: You’ll need to iterate over two things.)
<- 168
mu <- 20
sd <- 1e3
draws <- 20
min_sample <- 100
max_sample
<-
results data.frame(
sample_size = NULL,
se = NULL
)
for (i in min_sample:max_sample) {
<- NULL
sample_means
for (j in 1:draws) {
<- mean(rnorm(i, mu, sd))
sample_means[j]
}
<- sd(sample_means)
se
<- rbind(
results
results,data.frame(
sample_size = i,
se = se
)
)
}
plot(results$sample_size, results$se)
Like we did earlier, first declare your variables and generate an object in which to store the results:
<- ?
mu <- ?
sd <- ? # how many draws are we doing?
draws <- ?# minimum sample size
min_sample <- ? # maximum sample size
max_sample
# container
<-
results data.frame(
sample_size = NULL,
se = NULL
)
Then we iterate over the sample size, and for each sample size do 1,000 “draws”, then store the results for that sample size.
for (i in min_sample:max_sample) { # the sample sizes we want to iterate over
<- NULL # somewhere to store the means for each of the 1,000 samples we'll do for this sample size
sample_means
for (j in 1:draws) { # second loop where we draw 1,000 samples for each sample size
<- mean(rnorm(?, ?, ?)) # store the mean of this sample
sample_means[?]
}
<- sd(?) # once we're done with 1,000 draws, we need to get the SD of the sample means of this sample size
se
# store the results
<- rbind(
results # the previous results
results, data.frame( # a new data row
sample_size = ?, # what sample size did we do in this iteration of the loop? Where is it stored?
se = ? # you calculated this above
)
) }
That’s it, you can plot the results with plot(results$sample_size, results$se)
.
0.2.6 Exercise
Turn the above into a function so that you can change the population effect size, SD, number of simulations, and sample size range. The function should also return the plot from above.
<-
sample_against_se function(
mu = 168,
sd = 20,
draws = 1e3,
min_sample = 20,
max_sample = 100
) {<-
results data.frame(
sample_size = NULL,
se = NULL
)
for (i in min_sample:max_sample) {
<- NULL
sample_means
for (j in 1:draws) {
<- mean(rnorm(i, mu, sd))
sample_means[j]
}
<- sd(sample_means)
se
<- rbind(
results
results,data.frame(
sample_size = i,
se = se
)
)
}
return(plot(results$sample_size, results$se))
}
You basically do the variable declaration inside function()
, then copy-paste everything else into the body of the function:
<- # or whatever function name you want
sample_against_se function( # this should look familiar
mu = ?,
sd = ?,
draws = ?,
min_sample = ?,
max_sample = ?
) {# here you put everything from above in here
<-
results data.frame(
sample_size = NULL,
se = NULL
#etc.
)
# then don't forget to return
return(plot(results$sample_size, results$se))
}
0.2.7 Exercise
Try out the function with two plots: when the population SD is 5 and when you do 10 draws. What changes?
sample_against_se(sd = 5)
sample_against_se(draws = 10)
0.2.8 Exercise
The average height of men in Spain is 173cm (says Wikipedia). The population standard deviation is probablay around 7cm (source).
You draw a sample of men and want to test whether they’re significantly different from that mean (our H0). In fact, these men you have sampled are truly French (175.6cm, our true “effect size”). In other words, can we reject the null hypothesis that these men we sampled come from the Spanish distribution in favor of our alternative hypothesis that the true population value is greater than the Spanish population mean?
You calculate the z-statistic, which is calculated as follows: \(\frac{\bar{X} - \mu_0}{\sigma/\sqrt{N}}\) This simply tells us how far from the population mean (well, the suspected population mean under the null hypothesis) our sample mean is in terms of standard errors. \(\bar{X}\) is the sample mean, \(\mu_0\) is the population mean under H0, \(\sigma\) is the population standard deviation, and \(N\) is the sample size.
Then we can look up the z-score to see what the probability is to score this high or higher (does that definition ring a bell?). In R, you can simply do that with a built-in function: pnorm()
. For example, if we have a z-score of 1.645, our probability of obtaining such a value (or higher) is pnorm(1.645, lower.tail = FALSE)
= 0.0499849 – our p-value for a one-sided test.
We can simulate the power of our statistical test (in this case, the z-statistic). Take a sample of 30 people from the French population, calculate the z-statistic, its p-value, and store the p-value. Do this 1,000 times. Plot the distribution of p-values. What can we conclude about the sample size?
# sample size
<- 30
n <- 173 # test against this null
h0 <- 175.6 # true population effect
h1 <- 7
sd <- 0.05
alpha <- 1e3
draws
<- NULL
pvalues
for (i in 1:draws) {
<- rnorm(n, h1, sd)
d <- (mean(d) - h0) / (sd / sqrt(n))
z <- pnorm(z, lower.tail = FALSE)
pvalues[i]
}
plot(density(pvalues))
This all is much easier than it appears at first sight. Start with declaring your variables:
<- ? # sample size
n <- ? # test against this null (read the description again to know which height is the null)
h0 <- ? # true population effect (true height)
h1 <- ? # population SD
sd <- ? # how many draws? draws
Now you do what you’ve been doing this whole time: Use a loop to draw a sample, then calculate the z statistic (this could be any statistical test), extract the p-value for that z-statistic (or p-value for any statistical test), and store the p-value somewhere.
# somewhere to store pvalues
<- NULL
pvalues
# then our loop
for (i in 1:draws) {
<- rnorm(?, ?, ?) # get a sample with the variables you declared above
d <- # translate the formula above into R code (?sqrt)
z <- ? # get the p-value with the code from the exercise instructions, but replace 1.645 with the z-value you calculated
pvalues[?] }
That’s it. Have a look at how the p-values are distributed with plot(density(pvalues))
.
0.2.9 Exercise
Now calculate the proportion of p-values that are below 0.05. That’s your power: The proportion of tests that will detect that there’s a true effect. In our case, that effect is a difference of 2.6cm. (Tip: Count or sum how many p-values are below 0.05 and divide by total number of p-values).
sum(pvalues < 0.05)/length(pvalues)
[1] 0.668
0.2.10 Exercise
Now let’s do what we did before: Put the loop from above inside another loop that iterates over different sample sizes. Then put that all into a function that let’s you set the parameters of interest (sample size range, h0, h1, etc.). Then simulate power (1,000 simulations each) for samples between 30 and 100. Plot the sample size against power.
<-
power_function function(
h0 = 173,
h1 = 175.6,
sd = 7,
alpha = 0.05,
draws = 1e3,
min_sample = 30,
max_sample = 100
) {<-
results data.frame(
sample_size = NULL,
power = NULL
)
for (i in min_sample:max_sample) {
<- NULL
pvalues
for (j in 1:draws) {
<- rnorm(i, h1, sd)
d <- (mean(d) - h0) / (sd / sqrt(i))
z <- pnorm(z, lower.tail = FALSE)
pvalues[j]
}
<- rbind(
results
results,data.frame(
sample_size = i,
power = sum(pvalues < 0.05)/length(pvalues)
)
)
}
return(plot(results$sample_size, results$power, type = "l"))
}
power_function()
You basically do the variable declaration inside function()
like you’ve done before, then copy-paste everything else into the body of the function:
<- # name as you please
power_function function(
h0 = ?, # assign default values or scratch the "= ?" part
h1 = ?,
sd = ?,
alpha = ?,
draws = ?,
min_sample = ?, # the minimum sample size
max_sample = ? # the maximum sample size
) {
# create a data frame to store results in
<-
results data.frame(
sample_size = NULL,
power = NULL
)
# first loop
for (i in min_sample:max_sample) {
# store pvalues for each sample size
<- NULL
pvalues
# second loop
for (j in 1:draws) {
# do what you did above and store all pvalues for each draw in `pvalues`
}
# add the results for this sample size (aka pvalues)
<- rbind(
results
results,data.frame(
sample_size = ?,
power = ? # proportion of pvalues under 0.05
)
)
}
# don't forget to return the plot
return(plot(results$sample_size, results$power, type = "l"))
}
# call the function
power_function()
0.2.11 Exercise
Now do the same thing with a one-sample t-test. (Tip: You only need to replace the z-scores with a t.test
from which you can extract the p-value). (Another tip: Use the $ sign on where you stored the t-test results.)
<-
power_function_t function(
h0 = 173,
h1 = 175.6,
sd = 7,
alpha = 0.05,
draws = 1e3,
min_sample = 30,
max_sample = 100
) {<-
results data.frame(
sample_size = NULL,
power = NULL
)
for (i in min_sample:max_sample) {
<- NULL
pvalues
for (j in 1:draws) {
<- rnorm(i, h1, sd)
d <- t.test(d, mu = h0,alternative = "greater")
t <- t$p.value
pvalues[j]
}
<- rbind(
results
results,data.frame(
sample_size = i,
power = sum(pvalues < 0.05)/length(pvalues)
)
)
}
return(plot(results$sample_size, results$power, type = "l"))
}
power_function()
Instead of calculating the z-value yourself, you run t.test(data, mu = null hypothesis value, alternative = "greater")
, then extract the p-value with t.test$p.value
. All you need is to change the code from above by two lines.
0.2.12 Exercise
Just for funsies (and for our next session), see what happens when the true effect is only 1cm in difference.
power_function_t(h1 = h0+1)