CMBBHT Package Tutorial

Posted 5/10/2017 7:52:19 PM
Tags: R Statistics Bayesian

This is a tutorial that shows an in-depth example of using the R package CMBBHT. The package name is an (excessive) acronym that expands to Cell Means Based Bayesian Hypothesis Tests. You can download the main package manual as a PDF.

CMBBHT provides a reasonably simple interface for performing Bayesian hypothesis tests in designs where cell means, differences between cells, or something similar are estimated in a factorial design. Factorial designs are very common in psychology. Many studies analyze their data directly with analysis of variance (ANOVA). I tend to use various kinds of psychometric model, so instead of having data to analyze, I instead have parameter estimates from the fitted model. It isn’t good practice to use ANOVA on parameter estimates, so I needed some other way to perform tests of main effects and interactions in parameter values for my psychometric models.

I worked out some not-too-complex logic that allows for the estimation Bayes factors related to tests of main effects and interactions (generically, ANOVA effects or just effects). The logic is explained in the package manual (see link above) along with code examples that perform the steps in the procedure.

This document contains an in-depth tutorial showing how to use CMBBHT with an example model. I will specify the model, fit it to simulated data, and show how to use CMBBHT to test hypotheses about model parameters. Additional topics are discussed, including how to deal with different kinds of factorial design and how to analyze within-participants and between-participants data.


This tutorial works with a binomial model for a factorial within-participants (repeated measures) design, although I will also show how to analyze between-participants designs without changing the model. Each participant will perform multiple trials of a task in each of several different experimental conditions. Each trial will result in a success or a failure, so the data are binomial, with the number of successes and the number of trials for each participant by experimental condition cell being the dependent variable. We are primarily interested in the effect of experimental condition, but we want to also estimate baseline participant ability to account for differences between participants and to better account for the differences between experimental conditions.

The data level of the model is

\[ S_{ij} \sim \text{Binomial}(N_{ij}, P_{ij}) \]

where \(S_{ij}\) are the number of successes for the \(i\)th participant in the \(j\)th experimental condition, \(N_{ij}\) is the number of trials, and \(P_{ij}\) is the probability of a success. Rather than directly estimating \(P_{ij}\), it is much more parsimonious to estimate a participant parameter and a condition parameter and combine those two parameters to get \(P_{ij}\). This is done as follows

\[ P_{ij} = \text{logit}^{-1}(\rho_i + \delta_j) \] where \(\text{logit}^{-1}\) is the CDF of the logistic distribution (i.e. plogis in R), \(\rho_i\) is the participant parameter and \(\delta_j\) is the condition parameter. The use of the logit transform guarantees that \(P_{ij}\) is bounded between 0 and 1 while \(\rho_i\) and \(\delta_j\) can take on any value.

To obtain hierarchical constraint on the estimation of the participant parameters, \(\rho_i\), a hierarchical prior is used. Note that I parameterize Normal distributions in terms of variance, while stan and R parameterize them in terms of standard deviation.

\[ \begin{aligned} \rho_i &\sim \text{Normal}(\mu_\rho, \sigma^2_\rho) \\ \mu_\rho &\sim \text{Normal}(0, 2) \\ \sigma^2_\rho &\sim \text{Inverse Gamma}(2, 2) \end{aligned} \]

Notice that an additional advantage of using the logit transformation and allowing \(\rho_i\) to take on any value is that a Normal prior can be placed on \(\rho_i\). If \(\rho_i\) was bounded between 0 and 1, a different prior would have to be used, probably with loss of conjugacy. In this case, the prior on \(\rho_i\) is conjugate.

The condition parameters, \(\delta_j\), have a moderately informative Cauchy prior placed on them.

\[ \delta_j \sim \text{Cauchy}(0, 1) \]

Note that the priors on \(\rho_i\) and \(\delta_j\) are not uninformative, instead being moderately informative. The use of moderately informative priors, or at least not uninformative priors, is very important to the calcuation of meaningful Bayes factors. See the Effect of Uninformative Priors section of the CMBBHT manual for more information on the effect of uninformative priors. See the Choosing Priors on \(\delta\) section of this document for information on how this specific prior was chosen.

As the model is written so far, the mean of the \(\delta_j\) parameters and the mean of the \(\rho_i\) parameters could trade off perfectly. To allow for identification of the parameters, it is neccessary to place an additional constraint on those parameters. I will use my favorite constraint, which is a cornerstone parameterization in which one of the \(\delta_j\) is set to the constant value 0. For this tutorial, I will set

\[ \delta_1 = 0 \] to identify the parameters of the model.

Stan Model

The model was specified and fit using stan through the rstan package interface. The stan model specification is included here for completeness. You can probably skip it for now.

data {
  int<lower=0> I; // Number of participants
  int<lower=0> J; // Number of conditions

  int<lower=0> trials[I,J]; // Number of trials
  int<lower=0> success[I,J]; // Number of successes
  int<lower=0> nuc; // Number of unique conditions
  // A vector where a 0 means that condition is the cornerstone condition.
  // All matching nonzero values share the same delta parameter value.
  int deltaEq[J]; 
parameters {
  real rho[I];
  real delta_base[nuc]; // Does not include the cornerstone condition
  real mu_rho;
  real<lower=0.0001> var_rho;
transformed parameters {
  real delta[J]; // Includes the cornerstone condition
  real<lower=0.0001, upper=0.9999> P[I,J];
  for (j in 1:J) {
    // Copy from delta_base to delta, observing the cornerstone condition
    int deltaInd = deltaEq[j];
    if (deltaInd >= 1) {
      delta[j] = delta_base[deltaInd];
    } else {
      delta[j] = 0;
    for (i in 1:I) {
      P[i,j] = inv_logit(rho[i] + delta[j]);
model {

  mu_rho ~ normal(0, sqrt(2));
  var_rho ~ inv_gamma(2, 2);
  // Prior on the non-cornerstone deltas
  for (k in 1:nuc) {
    delta_base[k] ~ cauchy(0, 1);
  for (i in 1:I) {
    rho[i] ~ normal(mu_rho, sqrt(var_rho));
    for (j in 1:J) {
      success[i,j] ~ binomial(trials[i,j], P[i,j]);

Choosing Priors on \(\delta\)

One of the difficulties in choosing reasonable moderately-informative priors on the \(\delta_j\) is that fact that probabilities of success, \(P_{ij}\), are related to \(\delta_j\) with the following expression (restated from the model definition).

\[ P_{ij} = \text{logit}^{-1}(\rho_i + \delta_j) \] To help understand how \(P\) changes with \(\delta\), define

\[ \lambda = \text{logit}^{-1}(\rho + \delta) - \text{logit}^{-1}(\rho + 0) \] Thus, \(\lambda\) is the difference in the probability space between \(\delta\) having some value versus \(\delta\) being 0, i.e. how \(P\) changes as \(\delta\) changes. Note that \(\lambda\) is defined for some \(\rho\), as the effect that \(\delta\) has on \(P\) depends on \(\rho\).

lambda = function(rho, delta) {
    plogis(rho + delta) - plogis(rho)

Using \(\lambda\), we can plot how \(P\) changes with \(\delta\) for some value of \(\rho\), which is shown in the left panel below for \(\rho = 0\), which transforms to a \(P\) of 0.5 if \(\delta = 0\). We can also plot \(\lambda\) with respect to \(\rho\) for for a constant value of \(\delta\), which is shown in the right panel.


deltas = seq(-3, 3, 0.01)
plot(deltas, lambda(rho=0, deltas), type='l', xlab=bquote(delta), 
  ylab=bquote(lambda), main=bquote(rho*" = "*0))

rhos = seq(-3, 3, 0.01)
plot(rhos, lambda(rhos, delta=2), type='l', xlab=bquote(rho), 
  ylab=bquote(lambda), main=bquote(delta*" = "*1))

The left plot shows that \(\lambda\) is a nonlinear function of \(\delta\), with the shape indicating that extreme values of \(\delta\) result in diminishing returns. The right plot shows the interesting result that a fixed value of \(\delta\) affects \(P\) most strongly when \(\rho + \delta / 2 = 0\). Since the \(\delta_j\) are the same for all participants, this indicates that participants with different \(\rho_i\) will experience different changes in \(P\) for the same \(\delta\).

With the preceding information in mind, it should be clear why choosing a moderately informative prior on \(\delta\) requires a little thought.

The value of the location parameter of the prior on \(\delta\) can be easily chosen to be 0. The Cauchy distribution is symmetrical, so choosing a location of 0 indicates the belief that the most likely outcome is that there is no difference between any of the task conditions (the mode of the Cauchy is at its location). Thus, the focus is on the scale parameter of the Cauchy.

I will use the strategy of varying \(\rho\) and the prior scale, sampling from the prior on \(\delta\), and calculating \(\lambda\). Plotted below are histograms of \(\lambda\) for some different values of \(\rho\) and the prior scale of \(\delta\).

sampleLambdaPrior = function(rho, loc0, scale0, n) {
  delta = rcauchy(n, loc0, scale0)
  lambda(rho, delta)

par(mfrow=c(2,2), mar=c(4.5,4,1.5,1))

ce = sampleLambdaPrior(0, 0, 1, n=1e5)
hist(ce, main=bquote(rho*" = "*0*", scale = "*1), prob=TRUE, xlab=bquote(lambda))

ce = sampleLambdaPrior(2, 0, 1, n=1e5)
hist(ce, main=bquote(rho*" = "*2*", scale = "*1), prob=TRUE, xlab=bquote(lambda))

ce = sampleLambdaPrior(0, 0, 3, n=1e5)
hist(ce, main=bquote(rho*" = "*0*", scale = "*3), prob=TRUE, xlab=bquote(lambda))

ce = sampleLambdaPrior(2, 0, 3, n=1e5)
hist(ce, main=bquote(rho*" = "*2*", scale = "*3), prob=TRUE, xlab=bquote(lambda))

A \(\rho\) of 0 corresponds to a \(P\) of 0.5 and a \(\rho\) of 2 corresponds to a \(P\) of 0.88. I find the distributions with a scale of 3 to put too much mass on extreme values of \(\lambda\), while a scale of 1 places a more appropriate amount of mass of moderate values of \(\lambda\). In addition, a scale of 1 is wide enough that when \(\rho\) is high, that the prior density is still reasonably high for large absolute magnitude values of \(\lambda\).

Fitting the Model

The model is implemented in Stan, a Bayesian modeling language, using the rstan package. A link to the Stan model file can be found at the end of this post. In addition, I use a couple of helper functions, sampleData, fitModel, and sampleDeltaPriors, which I include here, but which you can probably skip without much loss.


sampleData = function(trueRho, trueDelta, trialsPerCell = 50) {
  I = length(trueRho)
  J = length(trueDelta)
  success = trials = matrix(trialsPerCell, nrow=I, ncol=J)
  for (i in 1:I) {
    for (j in 1:J) {
      p = plogis(trueRho[i] + trueDelta[j])
      success[i,j] = rbinom(1, trials[i,j], p)
  list(success=success, trials=trials)

fitModel = function(success, trials, deltaEq = NULL) {
  I = nrow(success)
  J = ncol(success)
  if  (is.null(deltaEq)) {
    deltaEq =, J - 1)
  mod_data = list(I=I, J=J, trials=trials, success=success,
                  deltaEq = deltaEq)
  mod_data$nuc = length(unique(mod_data$deltaEq)) - 1
  rval = list(mod_data = mod_data)
  rval$fit = stan(file = 'binomialModel.stan', data = mod_data, 
             iter = 5000, chains = 2, control = list(adapt_delta = 0.8))
  rval$param = extract(rval$fit, permuted=TRUE)
  rval$iterations = nrow(rval$param$delta)


sampleDataAndFitModel = function(trueRho, trueDelta, trialsPerCell = 50, deltaEq = NULL) {
  d = sampleData(trueRho, trueDelta, trialsPerCell)
  fitModel(d$success, d$trials, deltaEq)

Choose true parameter values and some other basic configuration for the data simulation.


# Sample true rhor values
trueRho = rnorm(20, 0, 0.3)

# Specify true delta values
trueDelta = c(0, 0.5, 0.5, 0.5)

# Number of trials per participant by condition cell
trialsPerCell = 50

Given the true parameter values, sample data and then fit the model with that data.

dat = sampleData(trueRho, trueDelta, trialsPerCell)
##       [,1] [,2] [,3] [,4]
##  [1,]   19   30   30   30
##  [2,]   21   34   33   31
##  [3,]   33   33   41   37
##  [4,]   22   35   31   34
##  [5,]   30   29   27   33
##  [6,]   30   41   38   39
##  [7,]   24   33   30   30
##  [8,]   23   27   24   25
##  [9,]   24   40   29   31
## [10,]   22   29   31   34
## [11,]   32   34   36   33
## [12,]   31   33   25   28
## [13,]   22   36   36   31
## [14,]   27   30   33   35
## [15,]   26   34   29   29
## [16,]   31   38   37   32
## [17,]   27   29   28   32
## [18,]   17   20   29   22
## [19,]   33   28   32   37
## [20,]   24   24   29   30
fit = fitModel(dat$success, dat$trials)

Perform some basic diagnostics on the fitted model. First, visually check convergence of some of the parameters, which looks great.

par(mfrow=c(2,2), mar=c(4.5, 4.5, 0.5, 0.5))
plot(fit$param$rho[,10], type='l')
plot(fit$param$delta[,2], type='l')
plot(fit$param$P[,5,3], type='l')
plot(fit$param$mu_rho, type='l')

Second, are the \(\rho_i\) values properly recovered? We know the true parameter values used to generate the data, so compare those values to the posterior means from the fitted model.

par(mar=c(5, 4, 0, 0))
recRho = apply(fit$param$rho, 2, mean)
cor(trueRho, recRho)
## [1] 0.9132035
plot(trueRho, recRho)

Third, as inference will focus on the effects of task condition, check that the \(\delta_j\) parameter values are properly recovered.

recDelta = apply(fit$param$delta, 2, mean)
## [1] 0.0 0.5 0.5 0.5
round(recDelta, 2)
## [1] 0.00 0.49 0.45 0.47

In all cases, the diagnostics suggest that the model fitting was successful within reasonable tolerances. Remeber that noise is added during the data sampling step, so the recovered parameter values are likely to vary somewhat from the true parameter values just due to data sampling noise.

Collecting the Pieces

I want to test the hypothesis that the conditions differed from one another. This is conceptually equivalent to a one-way ANOVA, just on model parameters rather than data. We are interested in the differences between task conditions. Those differences are accounted for in the model by the condition parameters, \(\delta_j\). Thus, inference will focus on those parameters. For this test, no information about the participant parameters is required as the participant parameters are the same in all task conditions.

As an outline, to test this hypothesis with CMBBHT, you need:

  1. A matrix of samples from the prior distribution of the parameters of interest (e.g. \(\delta_j\)). This is usually just a mechanical process that requires a few lines of code (10 or less for simple cases and perhaps 50 lines for a complex case with hierarchical priors).
  2. A matrix of samples from the posterior distribution of the parameters of interest. You already have to sample these parameters to do most Bayesian analyses, so the only work you have to do here is to put the parameters into a matrix if they are not already in a matrix.
  3. A data frame with a number of rows equal to the number of parameters of interest and a column for each factor of the design. Each row provides the levels of the factors that correspond to the parameters of interest. Examples of this will be given and, as you will see, this is typically very easy to create.

Piece 1: Prior Samples

In this case, sampling from the priors is very simple. The prior on \(\delta_j\) is a Cauchy with location 0 and scale 1, except for \(\delta_1\), which is set to 0. To sample from the prior, we will use the sampleDeltaPriors function defined below.

sampleDeltaPriors = function(nCond, iterations) {
  # Set matrix to 0 and skip over the cornerstone condition
  pr = matrix(0, nrow=iterations, ncol=nCond)
  for (j in 2:nCond) {
    pr[,j] = rcauchy(iterations, 0, 1)

prior = sampleDeltaPriors(ncol(fit$param$delta), fit$iterations)

## [1] 5000    4
##      [,1]        [,2]       [,3]        [,4]
## [1,]    0  -0.3898615 -0.1026647  -0.5934911
## [2,]    0  -0.4188521 13.1088383   1.2610299
## [3,]    0   1.8005084  1.4693157  -0.8646756
## [4,]    0 -23.8958263 -0.5687443 -17.7704449
## [5,]    0  52.4804025 -3.2195704  -2.8649549
## [6,]    0  -0.7585241 -2.7939867  -6.7815133

As you can see, the first column of prior is all 0s because that is the cornerstone condition.

Piece 2: Posterior Samples

The samples from the posterior of the \(\delta_j\) were taken during model fitting and are stored in fit$param$delta.

post = fit$param$delta

## [1] 5000    4
## iterations [,1]      [,2]      [,3]      [,4]
##       [1,]    0 0.4269326 0.3942929 0.4208992
##       [2,]    0 0.3782313 0.3426868 0.3615241
##       [3,]    0 0.5587148 0.5235968 0.4784190
##       [4,]    0 0.4845315 0.3089328 0.3378329
##       [5,]    0 0.6440234 0.5422284 0.4982239
##       [6,]    0 0.5507246 0.5368799 0.5900135

For the default method of estimating the Bayes factor, the dimensions of the prior and posterior matrices should be the same. It is possible to use different numbers of prior and posterior samples, but for the defaults of CMBBHT, the number of iterations should be the same. You will get a warning if they are not equal.

all(dim(prior) == dim(post))
## [1] TRUE

Piece 3: Factors

We also need to create a data frame that tells CMBBHT about what factor levels each condition is related to. We are assuming a one-factor design, so we will make a data frame with one factor, called f.

factors = data.frame(f = 1:ncol(prior))
##   f
## 1 1
## 2 2
## 3 3
## 4 4

Even though the design is just a one-factor design, it is still neccessary to specify the design. I will give examples of specifying factors for multi-factor designs later.

Testing the Hypothesis

With the three pieces prior, post, and factors set up, we can test a hypothesis about whether there is a main effect of the single factor. This test uses the function testHypothesis from CMBBHT.


ht = testHypothesis(prior, post, factors, "f")
## $success
## [1] TRUE
## $bf01
## [1] 0.01261597
## $bf10
## [1] 79.26461
## $pKept
## [1] 0.9

The result of testHypothesis is (by default) a list with some elements. bf01 and bf10 are Bayes factors related to the hypothesis being tested. For bf01, “01” means that the Bayes factor is for the null hypothesis (0) over the alternative hypothesis (1), thus it is a Bayes factor in favor of the null hypothesis. For bf10, it is for the alternative hypothesis (1) over the null (0), so bf10 is the Bayes factor in favor of the hypothesis that there is an effect.

The Bayes factor in favor of there being a main effect, bf10, is 79.3, which is clear evidence in favor of there being a main effect, which is plausible as an effect was built into the true parameter values.

Notice that this is the first place we have used anything from the CMBBHT package. Sampling from the priors and posteriors and specifying the factorial design are done independently of CMBBHT. You are free to specify and fit your model with few restrictions. The two most important restrictions are that 1) your model includes cell means or something like cell means that account for differences between conditions and 2) that the priors on the parameters of interest are not uninformative.

See the Effect Parameters section of the package manual for information on performing follow up tests, such as pairwise comparisons of conditions.

The Cornerstone Condition

In the test performed above, the cornerstone parameter values were included in the test. The \(\delta\) parameter in the cornerstone condition is the constant value 0. A parameter with constant value 0 doesn’t seem like it should be important, which leads to a temptation to exclude the cornerstone condition. However, it is wrong to exclude the cornerstone condition, for reasons that will be demonstrated in an example here.

The cornerstone condition is at index 1, so exclude the first \(\delta\) parameter.

prior = prior[ , -1]
post = post[ , -1]
factors = data.frame(f = 1:ncol(prior))

ht = testHypothesis(prior, post, factors, "f")
## [1] 498.8274

When we perform this test, we find strong evidence in favor of the null hypothesis of no effect. The reason for differing results depending on the inclusion or exclusion of the cornerstone condition is that there is no difference between the non-cornerstone conditions in this example. The true condition effects are in trueDelta. As we can see, the non-cornerstone conditions all have the same difference from the cornerstone condition.

## [1] 0.0 0.5 0.5 0.5

This means that the non-cornerstone conditions do not differ from one another. When a test is performed on only the non-cornerstone conditions, all of those conditions have the same value of \(\delta\). Thus, there is no main effect when only the non-cornerstone conditions are considered. When the cornerstone condition is included, however, it is possible to detect that there is a difference between the cornerstone condition and the non-cornerstone conditions. True effects can be missed if the cornerstone condition is not included, so be sure to include it.

If you use a different constraint than a cornerstone parameterization, such as a sums-to-zero constraint, make sure that you include all of the condition effect parameters, not just the estimated parameters. In general, you should include one parameter per cell of the experimental design.

Factorial Design

Let us imagine that our design with four levels is actually a 2x2 factorial design. We will get the priors and posteriors as before.

prior = sampleDeltaPriors(ncol(fit$param$delta), fit$iterations)
post = fit$param$delta

When we create the factors data frame, we create two factors, let and num.

factors = data.frame(
  let = c('a', 'a', 'b', 'b'), 
  num = c( 1,   2,   1,   2)

##   let num
## 1   a   1
## 2   a   2
## 3   b   1
## 4   b   2

The critical thing whenever you make factors is that the rows of factors correspond to the columns of the prior and post. The jth column of prior and post corresponds to the jth row of factors. Thus j = 3 corresponds to the b level of the let factor and the 1 level of the num factor.

Since there are two factors, we can test 1) the main effect of let, 2) the main effect of num, and 3) the interaction between let and num. We can do those tests as follows.

testHypothesis(prior, post, factors, "let")$bf10     # Main of let
testHypothesis(prior, post, factors, "num")$bf10     # Main of num
testHypothesis(prior, post, factors, "let:num")$bf10 # Interaction

Note that interactions are specified by putting a colon between the factor names. The order of the factors is irrelevant.

Alternately, you can use the testHypotheses (plural) convenience function to test multiple hypotheses at once.

testHypotheses(prior, post, factors, c("let", "num", "let:num"))
##   testName success       bf01      bf10 pKept
## 1      let    TRUE 0.11174380  8.949042   0.9
## 2      num    TRUE 0.02089819 47.851024   0.9
## 3  let:num    TRUE 0.34268182  2.918159   0.9

If you have many effects to test, you may want to leave the specific tests unspecified, in which case all possible effects are tested. Sometimes, however, you might want to perform only a subset of all possible tests.

Unbalanced Factorial Design

This seems like a bad example because it is too simple of a design.

Let us imagine that our design with four cells is a 2x3 factorial design, but with two missing cells. We will get the priors and posteriors as before.

prior = sampleDeltaPriors(ncol(fit$param$delta), fit$iterations)
post = fit$param$delta

factors = data.frame(
  let = c('a', 'b', 'b', 'b'), 
  num = c( 1,   1,   2,   3)

factors still has the same two factors, but the factors have different levels. Note that the design is such that an interaction cannot be estimated, only the two main effects.

Below is a table of the true \(\delta\) values and how they map onto the cells of the design.

/ 1 2 3
a 0 . .
b 0.5 0.5 0.5

This table implies that we should find a main effect of let because the \(\delta\) is 0 for let = a and 0.5 for let = b. We may find a main effect of num because the marginal mean \(\delta\)s for num 1, 2, and 3 are 0.25, 0.5, and 0.5, respectively, but it should be a weak effect if present. We find:

testHypothesis(prior, post, factors, "let")$bf10
## [1] 56498.76
testHypothesis(prior, post, factors, "num")$bf10
## [1] 0.353742

If you were totry to test the interaction, you would get a helpful error message telling you that the design lacks the appropriate cells to be able to estimate interaction terms.

Given that the design is unbalanced, the result of the tests depends on what design matrix you choose to use. For more information about unbalanced designs, see the Non-Fully Crossed/Unbalanced Designs section of the CMBBHT manual. For example, you can use a design matrix with both main effects.

testHypothesis(prior, post, factors, "let", dmFactors = "let")$bf10
## [1] 56498.76
testHypothesis(prior, post, factors, "let", dmFactors = " ~ let + num")$bf10
## [1] 3020.122

Naturally, changing the design matrix affects the results when using non-orthogonal contrasts.

Between-Participants Design

Assume that we now have a mixed design with a between-participants factor with two levels and a within-participants factor with three levels. In this case, the analysis becomes more complex. It is possible, however, to analyze this design without modifying the model, which is convenient, but requires a minor fudge as will be explained.

Choosing the \(\rho\) Prior

One important difference between fully within-participants designs and between-participants or mixed between/within designs is that the prior on \(\rho\), and neccessarily also \(\mu_\rho\), becomes relevant. With a fully within-participants design, the only information needed to test the main effect of the within-participants factor are the \(\delta_j\). With a between-participants or mixed design, the mean of \(\rho\) is also important to determine which groups differ from one another. It is not useful to use uninformative priors on \(\rho\) (or \(\mu_\rho\)), so I will choose moderately-informative priors.

The following code allows you to play around with the values of the priors on \(\mu_\rho\) and \(\sigma^2_\rho\), both of which contribute to the prior on \(\rho\). Since \(\rho\) is transformed to the probability space to become the manifest quantity \(P\), \(\rho\) is harder to interpret than \(P\), so I will plot \(P\) rather than \(\rho\). The priors I have used result in an approximately uniform prior on \(P\), which can be seen in the left panel below.

sampleRhoPrior = function(mu0, var0, a0, b0, n = 1e5) {
  mu_rho = rnorm(n, mu0, sqrt(var0))
    var_rho = MCMCpack::rinvgamma(n, a0, b0)
    rnorm(n, mu_rho, sqrt(var_rho))

par(mfrow=c(1,2), mar=c(5, 4, 0.25, 0.5) )
rhoPrior = sampleRhoPrior(mu0=0, var0=sqrt(2), a0=2, b0=2)
hist(plogis(rhoPrior), main="", prob=TRUE, xlab="P")

rhoPrior = sampleRhoPrior(mu0=2, var0=1^2, a0=3, b0=0.5)
hist(plogis(rhoPrior), main="", prob=TRUE, xlab="P")

If you don’t like the shape of the prior I used, I encourage you to play around with the prior parameter values until you find a shape that you like. For example, if chance performance in your task results in 50% accuracy, you might want there to be only limited prior mass below 0.5, such as shown in the right panel above.

Fitting the Model

Once priors have been chosen, the next step is to fit the model separately for each group of participants. Note that the two groups’ \(\rho\) parameters are sampled with a different mean which should result in there being a main effect of group. In addition, within each group, there are three levels of some within-participants factor. Thus, this is a 2 (group) X 3 (condition) design.


# Group A
trueRho = rnorm(20, 0, 0.3)
trueDelta = c(0, -0.5, 0.5)

trialsPerCell = 50

grpA = sampleDataAndFitModel(trueRho, trueDelta, trialsPerCell)

# Group B
trueRho = rnorm(20, 0.5, 0.3)
trueDelta = c(0, 0.5, -0.5)

grpB = sampleDataAndFitModel(trueRho, trueDelta, trialsPerCell)

fitList = list(A = grpA, B = grpB)

Combining Group Means and Condition Effects

Once the model has been fitted, the next step is to collect prior and posterior parameter matrices. This is more complex than the within-participants designs because the analysis depnds on not only the condition parameters, \(\delta_j\), but also the value of \(\rho\) in each group.

We are interested in whether \(P\) depends on group. In order to calculate \(P\) in each group by condition cell of the design, we must combine the value of \(\rho\) of that group with the \(\delta_j\) for that condition. But what value of \(\rho\) should we use: The individual \(\rho_i\) or the mean of the \(\rho\) parameters, \(\mu_\rho\)?

As I will show later, using the individual \(\rho_i\) is equivalent to testing whether the sample means of \(\rho\) differ while using \(\mu_\rho\) is equivalent to testing whether the population means of \(\rho\) differ. We obviously want to test whether the population means of \(\rho\) differ, so we will use \(\mu_\rho\).

The function betweenParticipantsPieces collects the neccessary information to test hypotheses using CMBBHT for this model. The process is fairly mechanical, so I will not go into much detail about it here. The key point is that we are using \(\mu_\rho\) rather than the individual participant \(\rho\) to calculate \(P\). Thus, the equation for \(P\) becomes

\[ \bar{P}_{jk} = logit^{-1}(\mu_{\rho k} + \delta_{jk}) \] where \(\bar{P}_{jk}\) is the mean \(P\) in the \(j\)th within-participants condition and \(k\)th between-participants group.

betweenParticipantsPieces = function(fitList) {
    # Create variables to be added to
    factors = prior = post = NULL
    for (grp in names(fitList)) {
        fit = fitList[[grp]]
        J = ncol(fit$param$delta)
        ## 3: Factors
        fact = data.frame(grp=grp, cond=1:J)
        factors = rbind(factors, fact)
        # Name the matrix columns for clarity
        cellNames = paste(grp, fact$cond, sep=":")
        ## 2: Posterior samples
        postP = matrix(NA, nrow=fit$iterations, ncol=J)
        for (j in 1:J) {
            # Note that the plogis transfomration is not strictly neccessary here 
            # or below for the prior. Normally, I would omit unneccessary 
            # transformations, but include it here for clarity purposes.
            postP[,j] = plogis(fit$param$mu_rho + fit$param$delta[,j])
        colnames(postP) = cellNames
        post = cbind(post, postP)
        ## 1: Prior samples
        prior_mu_rho = rnorm(fit$iterations, 0, sqrt(2))
        priorDelta = sampleDeltaPriors(J, fit$iterations)
        priorP = priorDelta # Same dimensions
        for (j in 1:J) {
            priorP[,j] = plogis(prior_mu_rho + priorDelta[,j])
        colnames(priorP) = cellNames
        prior = cbind(prior, priorP)
    list(prior=prior, post=post, factors=factors)

# Get the pieces and copy them out of the list
pieces = betweenParticipantsPieces(fitList)

prior = pieces$prior
post = pieces$post
factors = pieces$factors

The completed prior and posterior matrices have been collected, the resulting matrices have one parameter per cell of this 2 X 3 design.

##            A:1       A:2       A:3       B:1       B:2       B:3
## [1,] 0.5376251 0.3600798 0.6414104 0.6535019 0.7389250 0.5279756
## [2,] 0.4790691 0.3480230 0.6527458 0.6168422 0.7276099 0.5256906
## [3,] 0.5539470 0.3966356 0.6762792 0.7021535 0.7275307 0.5300302
## [4,] 0.4923122 0.3220939 0.6105332 0.6710803 0.7675352 0.5339942
## [5,] 0.4792395 0.3218828 0.5698502 0.6640706 0.7405649 0.5426193
## [6,] 0.4890836 0.3875872 0.6368493 0.6735276 0.7424994 0.5063129

In addition, we can verify that the columns of prior and post correspond to the rows of factors.

##   grp cond
## 1   A    1
## 2   A    2
## 3   A    3
## 4   B    1
## 5   B    2
## 6   B    3

Once the three pieces have been collected, testing hypotheses is as easy as usual. Note that when the effects to be tested are not specified, all possible effects are tested.

testHypotheses(prior, post, factors)
##   testName success         bf01         bf10  pKept
## 1      grp    TRUE 8.379104e-02 1.193445e+01 1.0000
## 2     cond    TRUE 1.370797e+01 7.295026e-02 0.9000
## 3 grp:cond    TRUE 3.535645e-11 2.828338e+10 0.9994

As usual, you can do follow-up tests, etc., with post and factors. Here I examine the grouping of the levels of the within-participants cond factor.

ep = getEffectParameters(post, factors, "cond")
## cond.2 A  
## cond.3 A B
## cond.1   B

Whether to use \(\rho_i\) or \(\mu_\rho\)

Earlier, I claimed that using the individual \(\rho_i\) parameters rather than \(\mu_\rho\) was equivalent to testing whether the sample means of \(\rho\) differed between groups. To be more precise, using the \(\rho_i\) is equivalent to testing that the sample means differ in the asymptote as the amount of data per participant increases. I will now argue that point. I will present an example that makes the point for me and then explain why the example generalizes.

I will do two things in this example:

  1. Assume a large, near infinite amout of data per participant.
  2. Sample true \(\rho\) values for the two groups from the same population.

In addition, I will make the \(\delta_j\) equal between the two groups to prevent any difference in \(\delta\) being picked up as a difference between the groups, for example if the \(\delta_j\) do not sum to the same value in both groups.


# Infinite data
trialsPerCell = 1e5

# Group A
trueRho_A = rnorm(20, 0, 0.3)
trueDelta_A = c(0, 0.5, -0.5)

grpA = sampleDataAndFitModel(trueRho_A, trueDelta_A, trialsPerCell)

# Group B
trueRho_B = rnorm(20, 0, 0.3)
trueDelta_B = c(0, 0.5, -0.5)

grpB = sampleDataAndFitModel(trueRho_B, trueDelta_B, trialsPerCell)

fitList = list(A = grpA, B = grpB)

As the boxplot below indicicates, the two groups’ \(\rho\) parameters have very similar distributions and differ slightly in terms of the median value.

boxplot(trueRho_A, trueRho_B, names = c("A", "B"), ylab=expression(rho) )

## [1] 0.04248714
## [1] 0.007129911

The mean values of \(\rho\) are also very close. No trained statistician would look at the boxplot and think that the population means are probably different from one another, with their logic being that the individual variability is substantially larger than the difference between the means.

The function below is performs the same operations as betweenParticipantsPieces except that it uses \(\rho_i\) rather than \(\mu_\rho\).

# This function uses indivudal rho parameters
betweenParticipantsPieces_participant = function(fitList) {

    prior = post = factors = NULL
    for (fn in names(fitList)) {
        fit = fitList[[fn]]
        I = ncol(fit$param$rho)
        J = ncol(fit$param$delta)
        ## 3: Factors
        # Cross participant by condition
        fact = expand.grid(part=1:I, cond=1:J)
        fact$grp = fn
        factors = rbind(factors, fact)
        cellNames = paste(fn, fact$part, fact$cond, sep=":")
        ## 2: Posterior samples
        postP = matrix(NA, nrow=fit$iterations, ncol=nrow(fact))
        for (rr in 1:nrow(fact)) {
            i = fact$part[rr]
            j = fact$cond[rr]
            postP[,rr] = plogis(fit$param$rho[,i] + fit$param$delta[,j])
        colnames(postP) = cellNames
        post = cbind(post, postP)

        ## 1: Prior samples
        # Sample prior rho
        # First sample from priors on mu_rho and var_rho
        mu_rho = rnorm(fit$iterations, 0, sqrt(2))
        var_rho = MCMCpack::rinvgamma(fit$iterations, 2, 2)
        # Then given samples from mu_rho and var_rho, sample from prior on rho
        priorRho = matrix(NA, nrow=fit$iterations, ncol=I)
        for (i in 1:I) {
            priorRho[,i] = rnorm(fit$iterations, mu_rho, sqrt(var_rho))
        # Sample prior delta
        priorDelta = sampleDeltaPriors(J, fit$iterations)
        # Combine rho and delta
        priorP = matrix(NA, nrow=fit$iterations, ncol=nrow(fact))
        for (rr in 1:nrow(fact)) {
            i = fact$part[rr]
            j = fact$cond[rr]
            priorP[,rr] = plogis(priorRho[,i] + priorDelta[,j])
        colnames(priorP) = cellNames
        prior = cbind(prior, priorP)
    list(factors = factors, prior = prior, post = post)

Using the same data and fitted models, use the \(\rho_i\) approach and the \(\mu_\rho\) approach.

bp_rho = betweenParticipantsPieces_participant(fitList)

bp_mu = betweenParticipantsPieces(fitList)

One difference between the approaches is that when using \(\rho_i\), you end up with \(I * J * K\) (20 * 3 * 2 = 120) \(P\) parameters versus the \(J * K\) (3 * 2 = 6) mean \(P\) parameters when using \(\mu_\rho\). Correspondingly, when using \(\rho_i\), factors has three columns, including a participant column (part).

## [1] 5000  120
## [1] 5000    6
bp_rho$post[1:5, 1:6]
##          A:1:1     A:2:1     A:3:1     A:4:1     A:5:1     A:6:1
## [1,] 0.4563319 0.4851040 0.6153142 0.5038579 0.5121217 0.6246215
## [2,] 0.4573581 0.4821227 0.6146024 0.5059636 0.5130461 0.6246294
## [3,] 0.4584233 0.4816291 0.6140732 0.5025642 0.5123754 0.6245598
## [4,] 0.4577095 0.4832275 0.6155299 0.5036245 0.5128561 0.6249182
## [5,] 0.4579964 0.4841479 0.6144522 0.5046210 0.5123991 0.6242278
##   part cond grp
## 1    1    1   A
## 2    2    1   A
## 3    3    1   A
## 4    4    1   A
## 5    5    1   A

Using both approaches, let’s test a main effect of the grp factor. In both cases, I will just print bf10.

testHypothesis(bp_rho$prior, bp_rho$post, bp_rho$factors, "grp")$bf10
## Warning in polspline::logspline(x, lbound = lbound, ubound = ubound): too
## much data close together
## Warning in polspline::logspline(x, lbound = lbound, ubound = ubound): re-
## ran with oldlogspline
## [1] 1.706966e+27
testHypothesis(bp_mu$prior, bp_mu$post, bp_mu$factors, "grp")$bf10
## [1] 0.1115926

When using \(\rho_i\), we find a very substantial Bayes factor in favor of there being a difference between the groups. This goes against statistical common sense based on the boxplot of true \(\rho\) parameters. When using \(\mu_\rho\), we instead find a Bayes factor in favor of there being no difference between the groups, which is more sensible.

To examine why these results differ so dramatically, I will examine the effect parameters related to the grp main effect.

# rho
ep_rho = getEffectParameters(bp_rho$post, bp_rho$factors, "grp")
eps_rho = summarizeEffectParameters(ep_rho)
##   effect         mean         2.5%          50%        97.5%
## 1  grp.A  0.004126142  0.003845378  0.004128266  0.004396831
## 2  grp.B -0.004126142 -0.004396831 -0.004128266 -0.003845378
# mu_rho
ep_mu = getEffectParameters(bp_mu$post, bp_mu$factors, "grp")
eps_mu = summarizeEffectParameters(ep_mu)
##   effect        mean        2.5%         50%      97.5%
## 1  grp.A  0.00415496 -0.03440798  0.00406272 0.04292497
## 2  grp.B -0.00415496 -0.04292497 -0.00406272 0.03440798

Notice that the posterior mean of the effect parameters is very similar regardless of whether \(\rho_i\) or \(\mu_\rho\) is used. This makes sense because \(\mu_\rho\) estimates the mean of the \(\rho_i\).

As shown by the following plots, however, the credible intervals for the effect parameters are strikingly different.


The top panel uses \(\rho_i\) and the bottom panel uses \(\mu_\rho\). When using \(\mu_\rho\), the credible intervals for the grp main effect parameters are much wider than when \(\rho_i\) is used. This difference in credible interval widths is directly comparable to the difference in the Bayes factors. Again, using \(\mu_\rho\) produces plausible results while using \(\rho_i\) produces implausible results. In particular, using \(\rho_i\) produces results that seem to be an answer to the question of whether the sample means differ.

The logic of why using \(\rho_i\) tests whether the sample means differ but using \(\mu_\rho\) tests whether the population means differ goes as follows.

  1. By using a near infinite amount of data per participant, we are able to know each \(\rho_i\) with arbitrary precision. As such, the posterior distributions of each \(\rho_i\) will be arbitrarily narrow, approximately a point mass.
  2. When calculating the condition effects for the grp main effect, the individual \(\rho_i\) are used. If, from point 1, all of the \(\rho_i\) are essentially point masses, the arithmetic mean of the \(\rho_i\) will also be essentially constant.
  3. The grp effect parameters are calculated as a weighted sum of the \(\rho_i\), which is essentially taking the arithmetic mean of the \(\rho_i\). Because the \(\rho_i\) are essentially constant, the grp effect parameters will also be essentially constant.
  4. If the grp effect parameters are essentially constant, then any difference in the effect parameters between the groups will be very large with respect to the variability of the effect parameters. As a result, there will be very strong evidence in favor of there being a main effect of grp.
  5. Given the preceding logic, using the \(\rho_i\) when testing for a main effect of grp is a test of whether the sample means of \(\rho_i\) differ between the groups.

Now I turn to the argument that using \(\mu_\rho\) allows for a test about differences in the population.

  1. Regardless of how precisely each individual \(\rho_i\) is known, there is still some degree of uncertainty about the means of the populations of \(\rho\).
  2. Imagine for a moment that you have a single observation, \(y_i\) of a real-valued variable for each of some number of participants. The typical statistical approach is to treat \(y_i\) as known and fixed.
  3. Even if all of the \(y_i\) are known and fixed, there is still uncertainty about the mean of the population that the \(y_i\) were sampled from.
  4. Having arbitrarily precise posterior knowledge about \(\rho_i\) is analogous to having known, fixed \(y_i\).
  5. Thus, even if the \(\rho_i\) are all known perfectly, you still don’t have complete certainty about the population mean of \(\rho_i\).
  6. Uncertainty about the population mean of \(\rho\) is captured by the posterior uncertainty of \(\mu_\rho\). This (loosely) follows from the fact that \(\mu_\rho\) is not calculated from the \(\rho_i\) but instead sampled based on the values of the \(\rho_i\).
  7. Given the preceding logic, using \(\mu_\rho\) allows inferences about differences between groups to be made while accounting for uncertainty about the population mean of \(\rho\). In other words, using \(\mu_\rho\) allows for inferences about the populations that participants were sampled from.

Problems with using \(\mu_\rho\)

Although using the individual \(\rho_i\) is very problematic, that does not prove that using \(\mu_\rho\) is correct. In fact, there is an issue with using \(\mu_\rho\) that I would like to address. That issue is that the model nowhere states

\[ \bar{P}_{j} = logit^{-1}(\mu_\rho + \delta_j) \] where \(\bar{P}_{j}\) is the mean \(P\) in a cell. The model states that

\[ P_{ij} = logit^{-1}(\rho_i + \delta_j) \]

and given that

\[ \bar{P}_j = \frac{1}{I} \sum_i P_{ij} \]

it follows, via substitution, that the model states that

\[ \bar{P}_{j} = \frac{1}{I} \sum_i logit^{-1}(\rho_i + \delta_j) \]

It would be nice if the value of \(\bar{P}_j\) would have the same value regardless of whether it is calculated in the exact way that the model states or in terms of \(\mu_\rho\).

The logit transformation is nonlinear, however, which means that there is no reason to believe that the following equality holds (even if \(\mu_\rho\) is equal to the arithmetic mean of \(\rho_i\), which it is not).

\[ logit^{-1}(\mu_\rho + \delta_j) = \frac{1}{I} \sum_i logit^{-1}(\rho_i + \delta_j) \]

Thus, the value of \(\bar{P}_j\) according to the model is not the same as the value of \(\bar{P}_j\) that I used to test hypotheses about differences between groups.

There are at least three ways to try to address this problem.

  1. Approximate Value: Show that \(\bar{P}_j\) has approximately the same value, to sufficient tolerance, regardless of how it is calculated.
  2. Dominance: Show that \(\bar{P}_j\) is greater for one group than for another group regardless of how it is calculated.
  3. Omitting Transformations: Do not use the logit transformation when performing tests, to work with latent parameters rather than manifest parameters.

These potention solutions are discussed in the following sections.

Approximate Value

The first option is to show that \(\bar{P}_j\) has approximately the same value regardless of how it is calculated. The comparison is between two options: Taking the mean of the transformed values, as you do when the \(\rho_i\) are used, or transforming the mean, which is what you do when \(\mu_\rho\) is used. For this approach, I will begin by taking many samples of latent parameter values (i.e. \(\rho_i\)) for a group.


n = 1e4 # Number of samples
nPart = 20 # Participants per group

lat1 = matrix(0, nrow=n, ncol=nPart)
for (i in 1:nPart) {
    lat1[,i] = rnorm(n, 0, 0.5)

I will calculate the mean of the transformed parameters (mt; what we are ideally interested in) and the transformation of the mean value (tm; not actually stated by the model).

trans = plogis

# Like using rho_i
meanTrans = function(x) {

# Like using mu_rho
transMean = function(x) {

mt1 = apply(lat1, 1, meanTrans)
tm1 = apply(lat1, 1, transMean)

I will plot tm against mt (left panel) to show how similar the overall values are. In addition, I plot the difference between tm and mt against the mean of tm and mt (right panel) to clearly show how the difference clearly depends on the magnitude.


plot(mt1, tm1)
abline(0, 1, col="red")

plot((mt1 + tm1) / 2, mt1 - tm1)
abline(0, 0, col="red")

Whether or not you consider tm and mt to have approximately the same value is up to you and depends to some extent on the magnitude of differences between groups that you are expecting. In addition, you might want to try different means and standard deviations of the populations that are sampled from to better approximate what you expect your observed parameter values to look like.


Moving on to dominance, we want to answer the following question: If mt for group 1 is greater than mt for group 2, is tm for group 1 greater than tm for group 2? If the answer to this question is “yes” nearly all of the time, then it may be the case that inferences about differences between group do not depend very much on whether mt or tm is used.

I will begin by sampling true latent parameter values (i.e. \(\rho_i\)) from another group. The first group had a population mean of 0, while this second group has a population mean of 0.3.

lat2 = matrix(0, nrow=n, ncol=nPart)
for (i in 1:nPart) {
    lat2[,i] = rnorm(n, 0.3, 0.5)

mt2 = apply(lat2, 1, meanTrans)
tm2 = apply(lat2, 1, transMean)

With samples from the second group, we can determine which cases in which mt is greater for group 1 than for group 2, and the same for tm.

mt_gt = mt1 > mt2
tm_gt = tm1 > tm2

mean(mt_gt == tm_gt)
## [1] 0.9976

At least for the settings used to sample the true parameter values that I used, it appears that both mt and tm go the same direction nearly all of the time.

When there is disagreement, the difference between the groups is very small compared to when there is agreement, as shown below.

whichDisagree = which(mt_gt != tm_gt)

sd((mt1 - mt2)[-whichDisagree])
## [1] 0.03725513
sd((mt1 - mt2)[whichDisagree])
## [1] 0.001217672
sd((tm1 - tm2)[-whichDisagree])
## [1] 0.03923604
sd((tm1 - tm2)[whichDisagree])
## [1] 0.002214189

This suggests that even when there is disagreement about which of tm or mt is greater for the two groups, the difference between the groups is small enough that a test of group differences is likely to support the hypothesis that the groups do not differ.

Omitting Transformations

Another way to be able to use \(\mu_\rho\) is not using the logit transformation when performing tests. This approach works on the logic that we don’t really care about whether \(P\) per se differs between groups, but just whether some quantity that combines information about \(\rho\) and \(\delta\) differs between groups.

If we let

\[ \eta_{ij} = \rho_i + \delta_j \]


\[ P_{ij} = logit^{-1}(\eta_{ij}) \]

without changing the model in any way. Rather than focus on \(P\), the inferential focus can be on \(\eta\). Focusing on \(\eta\) could be a loss in some cases, given that \(\eta = 10\) and \(\eta = 100\) both result in \(P \approx 1\). On the other hand, the equality

\[ \mu_\rho + \delta_j = \frac{1}{I} \sum_i (\rho_i + \delta_j) \]

is more likely to hold, or hold more accurately, than an equality in which the logit transformation is used (it would hold exactly if \(\mu_\rho\) were equal to the arithmetic mean of the \(\rho_i\), but that is typically not exactly the case). By working with \(\eta\) rather than \(P\), it is possible for \(\mu_\rho\) to fairly accurately substitute for the \(\rho_i\).