Updated: 2025-03-26
To run an allelic series test, there are 4 key inputs:
A numeric annotation vector, of the same length as the number of variants, coded as 1 for benign missense variants (BMVs), 2 for damaging missense variants (DMVs), and 3 for protein truncating variants (PTVs).
A covariates matrix, with as many rows as subjects and including columns such as age and sex. If omitted, defaults to an intercept only. If provided, an intercept will automatically be added if not detected.
A genotype matrix, with subjects as rows and variants as columns. The number of columns should correspond to the length of the annotation vector.
A numeric phenotype vector, either continuous or binary.
The example data used below were generated using the DGP
function provided with the package. The data set includes 100 subjects,
300 variants, and a continuous phenotype. The true effect sizes follow
an allelic series, with magnitudes proportional to
c(1, 2, 3) for BMVs, DMVs, and PTVs respectively.
set.seed(101)
n <- 100
data <- AllelicSeries::DGP(
  n = n,
  snps = 300,
  beta = c(1, 2, 3) / sqrt(n),
)
# Annotations.
anno <- data$anno
head(anno)## rs1 rs2 rs3 rs4 rs5 rs6 
##   1   1   2   2   1   2##      intercept         age sex          pc1        pc2         pc3
## [1,]         1 -0.77465210   1  0.539531581  1.2134116  0.63478280
## [2,]         1 -0.03492052   0  0.009270313  1.5256400 -0.62027221
## [3,]         1 -0.20782643   1 -0.777242512  1.2319750 -0.73311857
## [4,]         1  0.69939529   0 -0.357743663  0.1747664 -0.29754242
## [5,]         1  0.47420055   1 -1.083115262 -1.3005348 -0.68338808
## [6,]         1 -1.15038152   0 -0.539141451 -0.5159589  0.01971706##      rs1 rs2 rs3 rs4 rs5
## [1,]   0   0   0   0   0
## [2,]   0   0   0   0   0
## [3,]   0   0   0   0   0
## [4,]   0   0   0   0   0
## [5,]   0   0   0   0   0
## [6,]   0   1   0   0   0## [1] 0.2958143 1.7294107 0.4886318 1.0201730 1.0304183 1.9704552The example data generated by the preceding are available under
vignettes/vignette_data.
The COding-variant Allelic Series Test (COAST) is run using the
COAST function. By default, the output of
COAST includes a data.frame of counts showing the number of
alleles, variants, and carriers in each class that contributed to the
test, and a data.frame of p-values, with the omni test
denoting the final, overall p-value. Inspection of the component
p-values may be useful for determining which model(s) detected evidence
of an association. In the present case, the baseline count model
provided the greatest power.
results <- AllelicSeries::COAST(
  anno = anno,
  geno = geno,
  pheno = pheno,
  covar = covar
)
show(results)## Effect Sizes:
##         test beta    se
## 1       base 0.05 0.068
## 2       base 0.06 0.079
## 3       base 0.59 0.175
## 4        ind 0.31 0.256
## 5        ind 0.22 0.216
## 6        ind 0.66 0.218
## 7  max_count 0.06 0.048
## 8    max_ind 0.44 0.130
## 9  sum_count 0.07 0.031
## 10   sum_ind 0.19 0.058
## 
## 
## Counts:
##   anno alleles variants carriers
## 1    1     186      164       84
## 2    2     127      110       74
## 3    3      27       26       23
## 
## 
## P-values:
##           test   type     pval
## 1     baseline burden 6.62e-03
## 2          ind burden 9.11e-03
## 3    max_count burden 2.32e-01
## 4      max_ind burden 8.15e-04
## 5    sum_count burden 1.48e-02
## 6      sum_ind burden 1.09e-03
## 7 allelic_skat   skat 6.83e-02
## 8         omni   omni 4.68e-03The effect sizes data.frame is accessed via:
##         test       beta         se
## 1       base 0.05309214 0.06836091
## 2       base 0.06427266 0.07917711
## 3       base 0.58612733 0.17472435
## 4        ind 0.30597919 0.25595872
## 5        ind 0.21943023 0.21607933
## 6        ind 0.66475646 0.21829831
## 7  max_count 0.05719340 0.04783470
## 8    max_ind 0.43655519 0.13041078
## 9  sum_count 0.07437087 0.03052869
## 10   sum_ind 0.19038906 0.05828998the counts data.frame via:
##   anno alleles variants carriers
## 1    1     186      164       84
## 2    2     127      110       74
## 3    3      27       26       23and the p-values data.frame via:
##           test   type        pval
## 1     baseline burden 0.006618953
## 2          ind burden 0.009107121
## 3    max_count burden 0.231834498
## 4      max_ind burden 0.000815325
## 5    sum_count burden 0.014846685
## 6      sum_ind burden 0.001089858
## 7 allelic_skat   skat 0.068349669
## 8         omni   omni 0.004683243Effect size estimation for ultra-rare variants is challenging due to
the scarcity of carriers on which to base the estimate. Following works
such as SAIGE-GENE+, we
recommend collapsing variants with a minor allele count (MAC) \(\leq 10\) into an aggregate variant,
separately within each annotation category. The
CollapseGeno utility is provided for this purpose. The
min_mac specifies the threshold for retention as an
individual variant (e.g. min_mac = 11 will collapse
variants with a MAC \(\leq 10\)). The
output is a list containing the annotation vector anno and
genotype matrix geno after collapsing. In addition, a
mapping vars is provided showing which variants were
collapsed to create each aggregate variant. Note that:
CollapseGeno will change the order of the variants
in the genotype matrix.
No aggregate variant will be created within annotation categories where no variants have a MAC below the threshold.
Collapsing ultra-rare variants does not guarantee that the
resulting aggregate variant will itself have a MAC \(\geq\) the threshold. To ensure that only
variants with a minimum MAC enter the association test, a
min_mac threshold should also be supplied to
COAST.
set.seed(102)
# Generate data.
n <- 1e3
data <- AllelicSeries::DGP(
  n = n,
  snps = 10,
  prop_anno = c(1, 1, 1) / 3
)
# Collapse ultra-rare variants.
collapsed <- AllelicSeries::CollapseGeno(
  anno = data$anno, 
  geno = data$geno,
  min_mac = 11
)
# Variants collapsed to form each aggregate variant.
head(collapsed$vars)##   anno            vars
## 1    1     rs1;rs4;rs5
## 2    2             rs8
## 3    3 rs2;rs3;rs7;rs9COAST was originally intended to operate on the benign
missense variants, damaging missense variants, and protein truncating
variants within a gene, but it has been generalized to allow for an
arbitrary number of discrete annotation categories. The following
example simulates and analyzes data with 4 annotation categories. The
main difference when analyzing a different number of annotation
categories is that the weight vector should be specified,
and should have length equal to the number of possible
annotation categories. COAST will run, albeit with a
warning, if there are possible annotation categories to which no
variants are assigned (e.g. a gene contains no PTVs).
set.seed(102)
# Generate data.
n <- 1e2
data <- AllelicSeries::DGP(
  n = n,
  snps = 400,
  beta = c(1, 2, 3, 4) / sqrt(n),
  prop_anno = c(0.4, 0.3, 0.2, 0.1),
  weights = c(1, 1, 1, 1)
)
# Run COAST-SS.
results <- AllelicSeries::COAST(
  anno = data$anno,
  covar = data$covar,
  geno = data$geno,
  pheno = data$pheno,
  weights = c(1, 2, 3, 4)
)
show(results)## Effect Sizes:
##         test  beta    se
## 1       base  0.01 0.065
## 2       base  0.23 0.062
## 3       base  0.19 0.096
## 4       base  0.61 0.116
## 5        ind -0.10 0.253
## 6        ind  0.53 0.205
## 7        ind  0.31 0.169
## 8        ind  0.96 0.186
## 9  max_count  0.18 0.035
## 10   max_ind  0.47 0.090
## 11 sum_count  0.11 0.018
## 12   sum_ind  0.19 0.035
## 
## 
## Counts:
##   anno alleles variants carriers
## 1    1     194      169       86
## 2    2     159      131       78
## 3    3      71       61       51
## 4    4      41       39       31
## 
## 
## P-values:
##           test   type     pval
## 1     baseline burden 9.67e-10
## 2          ind burden 5.86e-07
## 3    max_count burden 4.26e-07
## 4      max_ind burden 1.65e-07
## 5    sum_count burden 5.37e-10
## 6      sum_ind burden 8.24e-08
## 7 allelic_skat   skat 3.27e-06
## 8         omni   omni 4.11e-09apply_int = TRUE applies the rank-based inverse normal
transformation from the RNOmni package.
This transformation is expected to improve power for phenotypes that
have a skewed or kurtotic (e.g. long-tailed) distribution. It is applied
by default in the case of continuous phenotype, and is ignored in the
case of a binary phenotype.include_orig_skato_all = TRUE includes standard SKAT-O
applied to all variants as a component of the omnibus test, while
include_orig_skato_ptv = TRUE includes standard SKAT-O
applied to PTVs only. In cases where other annotation schemes are used,
the annotation for the PTV class can be specified via
ptv_anno. Including standard SKAT-O as a component of the
omnibus test can improve power to detect associations between the
phenotype and genes that may not be allelic series.AllelicSeries::COAST(
  anno = anno,
  geno = geno,
  pheno = pheno,
  covar = covar,
  include_orig_skato_all = TRUE,
  include_orig_skato_ptv = TRUE,
  ptv_anno = 3
)is_pheno_binary = TRUE is required to indicate that the
supplied phenotype is binary, and should be analyzed using a logistic
regression model.AllelicSeries::COAST(
  anno = anno,
  geno = geno,
  pheno = 1 * (pheno > 0),
  covar = covar,
  is_pheno_binary = TRUE
)min_mac is used to filter the variant set to those
containing a minimum minor allele count (MAC). The following example
filters to only those variants with a MAC of at least 2:return_omni_only = TRUE is used to return
p_omni only when the component p-values are not of
interest:AllelicSeries::COAST(
  anno = anno,
  geno = geno,
  pheno = pheno,
  covar = covar,
  return_omni_only = TRUE
)score_test = TRUE specifies the use of a score-type
allelic series burden test. The default of
score_test = FALSE specifies a Wald-type allelic series
burden test.weights specifies the relative phenotypic effects of
BMVs, DMVs, and PTVs. An increasing pattern such as the default setting
of weights = c(1, 2, 3) targets allelic series. Setting
weights = c(1, 1, 1) would target a genetic architecture
where all variants have equivalent expected magnitudes.