The package fitPoly
includes several helper functions to
input data from different SNP array providers as well as utilities to
analyse the quality of genotype scores obtained by the main model of
fitPoly
. This vignette shows how to use the second set of
functions and explains their output; another vignette deals with the data import and preparation.
The package offers more functions than shown here, and the functions we
use have additional parameters; check the help for extra
information.
In this vignette we will take the output of fitPoly for one Full-Sib family and its parents, and analyze it:
Load the package and the scores data.frame produced by fitPoly:
Function fitMarkers
in package fitPoly
returns (a.o.) a file with dosage scores for all markers where a model
could be fitted. These data
are in data.frame scores. , of
which we show a part:
## marker MarkerName SampleName ratio P0 P1 P2 P3 P4 maxgeno maxP geno
## 637 5 RhK5_1000_451Q K001 0.343 0 1 0.000 0.000 0 1 1.000 1
## 638 5 RhK5_1000_451Q K002 0.520 0 0 1.000 0.000 0 2 1.000 2
## 639 5 RhK5_1000_451Q K003 0.341 0 1 0.000 0.000 0 1 1.000 1
## 640 5 RhK5_1000_451Q K004 0.361 0 1 0.000 0.000 0 1 1.000 1
## 641 5 RhK5_1000_451Q K007 0.609 0 0 0.999 0.001 0 2 0.999 2
## 642 5 RhK5_1000_451Q K008 0.345 0 1 0.000 0.000 0 1 1.000 1
Columns P0 … P4 show the probabilities that the dosage is nulliplex …
quadruplex. The last column, geno, is the dosage assigned; it is NA if
the probability of the most probable dosage is less than 0.99 (the
threshold used in the fitPoly
analysis in this case). In
the first part of this vignette we will only use the geno values, but at
some point the P-values will also be considered.
The initial check for segregation type is carried out by function
checkF1
:
# specify parental and F1 samples:
par1 <- "P540a1"
par2 <- c("P867a1","P867a2","P867b")
F1 <- levels(scores$SampleName)[substr(levels(scores$SampleName),1,1)=="K"]
chk1 <- checkF1(scores, par1, par2, F1,
polysomic=TRUE, disomic=TRUE, mixed=FALSE,
ploidy=4, outfile=NA)
#some parts of the result:
knitr::kable(chk1[1:4,1:14])
m | MarkerName | parent1 | parent2 | F1_0 | F1_1 | F1_2 | F1_3 | F1_4 | F1_NA | P540a1 | P867a1 | P867a2 | P867b |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | RhK5_1000_451Q | 3 | 0 | 2 | 57 | 94 | 1 | 0 | 1 | 3 | 0 | 0 | 0 |
6 | RhK5_1005_2503P | 2 | 3 | 0 | 21 | 62 | 52 | 12 | 8 | 2 | 3 | 3 | 3 |
7 | RhK5_1005_2503Q | 2 | NA | 0 | 21 | 59 | 51 | 12 | 12 | 2 | 3 | 3 | 2 |
8 | RhK5_1007_811P | 3 | 2 | 0 | 12 | 53 | 53 | 13 | 24 | 3 | 2 | 2 | NA |
m | MarkerName | bestParentfit | frqInvalid_bestParentfit | Pvalue_bestParentfit | matchParent_bestParentfit | qall_mult |
---|---|---|---|---|---|---|
5 | RhK5_1000_451Q | 11_1 | 0.0195 | 0.0026 | Yes | 0.1428 |
6 | RhK5_1005_2503P | 1331_1 | 0.0000 | 0.3054 | Yes | 0.8710 |
7 | RhK5_1005_2503Q | 1331_1 | 0.0000 | 0.3699 | OneOK | 0.4032 |
8 | RhK5_1007_811P | 1551_1 | 0.0000 | 0.8971 | Yes | 0.5108 |
This is a rose (Rosa x hybrida) dataset and we expect that
both polysomic and disomic inheritance may occur. By setting both
corresponding parameters to TRUE we consider both sets of segregation
types. The result is a data.frame with one row for each marker. The
first columns list the marker name, the inferred parental dosages, the
observed F1 segregation and the observed parental sample dosages. The
bestParentfit
column lists the selected segregation type
and the next 3 columns list 3 parameters describing how well this
segregation fits with the observations. The qall_mult
column is an overall quality assessment, from poor (0) to perfect (1).
There are more columns in chk1
, and with other parameter
settings even more columns will be produced; for a full explanation
refer to the help. For this demo we set outfile
to NA, but
we could also have specified a filename. In that case a tab-separated
text file would be saved which can be imported into Excel for easy
viewing.
The segregation codes need some explanation. They consist of 2 parts: the first part has the segregation ratios (e.g. 11 means 1:1, 141 means 1:4:1 and so on; a single 1 means no segregation, all F1 have the same dosage). An underscore separates the two parts and the second part is a single number specifying the lowest dosage occurring in the segregation. For example: 121_0 means 1 nulliplex : 2 simplex : 2 duplex; 11_3 means 1 triplex : 1 quadruplex. There is one tetrasomic segregation type 1:8:18:8:1 where one of the ratios is larger than 9 and cannot be represented by a digit; in this and similar cases we use letters A-Z to represent ratios 10-35, so this segregation is represented as 18I81_0, with the I meaning 18. For higher ploidy levels this system is extended, see the help for a full explanation.
When not all dosages are present among the samples it is possible
that fitPoly
assigns the peaks in the signal ratio
histogram to the wrong dosages. For example, if parents in reality have
dosages 0 and 1, also the F1 progeny will (almost) only have dosages 0
or 1. If no other samples are analyzed, only two of the five dosages are
present, and these might be mis-interpreted as beings dosages 1 and 2
instead of 0 and 1. These mis-assignments are called “shifts”. fitPoly
checks against such possible shifts whether or not parents and F1
populations are defined but it doesn’t catch them all.
The function correctDosages
is provided that based on
the result of checkF1 checks if shifting the dosages up or down by 1
would be worth to try. The result is a set of suggested shifts, that
should then again (using checkF1) be checked for their actual fit.
cordos <- correctDosages(chk1, scores, par1, par2, ploidy=4,
polysomic=TRUE, disomic=TRUE, mixed=FALSE)
# show the nonzero shifts:
cordos[cordos$shift!=0,]
## MarkerName segtype parent1 parent2 shift fracNotOk ParNA
## 10 RhK5_10102_270P 11_2 2 NA 1 0.000000000 1
## 15 RhK5_1011_3044Q 11_1 1 2 -1 0.000000000 0
## 22 RhK5_101_3700P 11_1 1 NA -1 0.012820513 1
## 25 RhK5_1020_824Q 11_1 1 2 -1 0.000000000 0
## 27 RhK5_1021_984Q 11_1 1 2 -1 0.000000000 0
## 35 RhK5_10334_506Q 11_2 NA 2 1 0.000000000 1
## 45 RhK5_10484_448Q 11_1 1 NA -1 0.007936508 1
## 48 RhK5_1054_473P 11_1 2 1 -1 0.000000000 0
## 55 RhK5_10619_199Q 11_1 1 2 -1 0.000000000 0
## 63 RhK5_10728_511P 11_1 2 1 -1 0.012658228 0
## 92 RhK5_11250_511P 11_1 2 1 -1 0.013071895 0
## 99 RhK5_11334_499Q 11_1 1 2 -1 0.000000000 0
## 101 RhK5_1145_1123Q 11_1 1 2 -1 0.000000000 0
The column shift
contains the suggested shifts to try: a
0 means no shift suggested, a 1 means: shift all dosages assigned by
fitPoly up (0 becomes 1, 1 becomes 2, 2 becomes 3, 3 and 4 become 4 (3
and 4 are merged)). A shift of -1 means that all dosages should be
decreased by 1 (merging previous dosages 0 and 1 into dosage 0). Note
that this is a rather simplistic way to check for possibly shifted
markers. If there are more samples available than only the F1 population
and its parents, those might also be informative about possible shifts.
Also, sometimes both the unshifted and the shifted version of the marker
might be acceptable. In such cases it may be useful to keep both
versions and decide later on (after mapping or haplotyping steps) which
version fits best.
The next step is to run checkF1
again with as additional
parameter these shifts; it will then perform the indicated dosage shift
on all samples (parental and F1) before selecting the most likely
segregation type and quality assessment:
#select the markers where a shift should be tried:
cordos <- cordos[cordos$shift != 0,]
subscores <- scores[scores$MarkerName %in% as.character(cordos$MarkerName),]
chk2 <- checkF1(subscores, par1, par2, F1,
polysomic=TRUE, disomic=TRUE, mixed=FALSE,
ploidy=4, outfile=NA, shiftmarkers=cordos)
Data.frame chk2
has the same format as ck1 but has an
additional column “shift” at the end. For the next steps it is useful to
merge chk1
and chk2
. In order to do that, chk1
needs also to get this extra column:
The data.frame chk
now has all unshifted and shifted
versions of the markers. Each of the (versions of the) markers has been
checked for its quality, which is (a.o.) expressed in its value of
qall_mult. We would like to select only those (versions of) markers that
are of sufficient quality. As a rule of thumb, all scores of markers
with qall_mult==0 are bad and all others might be acceptable, but
perhaps we can raise the threshold for qall_mult a bit for a more
stringent selection, without losing valuable markers. The approach is to
study XY-scatterplots for several markers at different levels of
qall_mult and see from which level good (well-scored) markers appear.
This approach is similar to that used to exclude markers based on total
signal intensity, as discussed in the vignette on data import and
preparation. In order to produce scatterplots with dots colored
according to assigned dosages we first combine the X and Y signals with
the geno (dosage) values.
data("XYdat")
XYgeno <- combineFiles(XYdata=XYdat, scores=scores)
# define qall levels of 0, 0.05, 0.10 up to 1 where we want to inspect some SNPs:
qall.levels <- seq(0, 1, by=0.05)
# select six SNPs with qall values near each of these values and draw their plots:
chkx <- selMarkers_qall(chk, qall.levels, mrkperlevel=6)
drawXYplots(dat=XYgeno, markers=chkx,
out="your-path-and filename",
genocol=get.genocol(ploidy=4), sample.groups=list(par1, par2),
groups.col=c("red", "blue"), ploidy=4)
This generates a series of pages, each with 6 plots; here is part of
one such page:
The drawXYplots
command here is quite complex. A full
explanation of all parameters can be found in the help. Some things to
note here: genocol
sets the (pastel) colors assigned to
each of the dosages, from red (nulliplex) through blue (duplex) to green
(quadruplex), and grey for samples with missing dosage score. The
parents are highlighted in red and blue via parameters sample.groups and
groups.col. The two plots show a duplex x duplex (1:8:18:8:1) and a
simplex x simplex (1:2:1) marker. A conclusion from this series of plots
is that already at qall_mult==0.05 most of the markers appear to be
well-scored. A next step could be to try to find a qall_mult threshold
between 0 and 0.05 to separate the well-scored and badly scored markers;
we leave that for the reader and for now we will accept all marker
(versions) with qall > 0, and save them to a dosage file:
#select only the markers with qall_mult > 0:
chk <- chk[!is.na(chk$qall_mult) & chk$qall_mult > 0,]
#write the dosage file, applying the shifts as listed in chk:
dosages <- writeDosagefile(chk, scores, par1, par2, F1,
polysomic=TRUE, disomic=TRUE, mixed=FALSE,
ploidy=4, scorefile=NA)
## batch 1 of 1
MarkerName | segtype | P540a1 | P867a1 | P867a2 | P867b | parent1 | parent2 | K001 | K002 | K003 | K004 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | RhK5_10102_270Pn | 11_2 | 2 | 3 | NA | 3 | 2 | 3 | 2 | 2 | 2 | 2 |
11 | RhK5_10102_270Ps | 11_3 | 3 | 4 | NA | 4 | 3 | 4 | 3 | 3 | 3 | 3 |
12 | RhK5_10107_247Pn | 121_0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
13 | RhK5_10107_247Qn | 121_0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
14 | RhK5_1011_3044Pn | 11_0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
Normally we would specify a filename for scorefile
, to
get the results also as a tab-separated text file. The dosages
data.frame (and file) list the marker, segregation type, the scores of
the parental samples, the consensus/inferred parental dosages and the
dosages of the F1 individuals. If chk
contains a column
“shift”, as here, the shift is applied to the dosages as listed in the
score data.frame. In that case also an n
or s
is appended to the MarkerNames
to indicate they were not
shifted (shift==0) or shifted (shift!=0); in this way we avoid having
duplicated markers if both the nonshifted and the shifted version passed
the quality filter.
Here we see that the two probes (P and Q) of marker RhK5_101_3700
were initially scored with a difference of 1 in the dosages (versions Pn
vs Qn, where n means non-shifted). Function correctDosages
found that the P probe should possibly be shifted and this resulted in
version Ps (s for shifted). Versions Ps and Qn generally have the same
scores, and both fit segregation type 11_0 (1:1 nulliplex:simplex).
Similar for the non-shifted and shifted versions of probe P of marker
RhK5_10102_270; for that marker the Q probe was rejected.
The Affymetrix Axiom array needs only one probe to detect a SNP. This
means that the same SNP can be detected by two different probes,
flanking the SNP on either side. If both probes are present on the array
they are initially treated as two separate markers by the array software
and fitPoly
. If all is well the two probes should produce
the same results (since they detect the same SNP) but due to various
factors a probe may fail entirely or for a subset of the samples.
Results that are identical between the probes validate each other, and
missing values from one probe might be filled in by the other. Also,
fitPoly
might produce shifted dosages for one or both
probes and a comparison might help to decide the correct version.
In order to compare the results of both probes we use function
compareProbes
. This function assumes that the
MarkerNames
of the two probes for the same SNP are
identical except for a short suffix; by default (and in the example)
these suffixes are P and Q.
cpp <- compareProbes(chk, scores, parent1=par1, parent2=par2, F1=F1,
polysomic=TRUE, disomic=TRUE, mixed=FALSE,
ploidy=4, compfile=NA, combscorefile=NA)
## batch 1 of 1
SNPname | segtypePn | qallPn | segtypePs | qallPs | segtypeQn | qallQn | segtypeQs | qallQs | segtypeRnn | segtypeRns | segtypeRsn | segtypeRss | countP | countQ | countR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RhK5_1000_451 | NA | NA | NA | NA | 11_1 | 0.1428 | NA | NA | NA | NA | NA | NA | 0 | 1 | 0 |
RhK5_1005_2503 | 1331_1 | 0.8710 | NA | NA | 1331_1 | 0.4032 | NA | NA | 1331_1 | NA | NA | NA | 1 | 1 | 1 |
RhK5_1007_811 | 1551_1 | 0.5108 | NA | NA | 1551_0 | 0.5000 | NA | NA | NA | NA | NA | NA | 1 | 1 | 0 |
RhK5_100_3510 | 11_0 | 1.0000 | NA | NA | 11_0 | 0.8333 | NA | NA | 11_0 | NA | NA | NA | 1 | 1 | 1 |
RhK5_100_531 | 141_2 | 0.6452 | NA | NA | 141_2 | 0.7742 | NA | NA | 141_2 | NA | NA | NA | 1 | 1 | 1 |
RhK5_10102_270 | 11_2 | 0.6845 | 11_3 | 0.6845 | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 0 | 0 |
MarkerName | segtype | P540a1 | P867a1 | P867a2 | P867b | parent1 | parent2 | K001 | K002 | K003 | K004 |
---|---|---|---|---|---|---|---|---|---|---|---|
RhK5_1000_451Qn | 11_1 | 3 | 0 | 0 | 0 | 3 | 0 | 1 | 2 | 1 | 1 |
RhK5_1005_2503Pn | 1331_1 | 2 | 3 | 3 | 3 | 2 | 3 | 2 | 1 | 3 | 1 |
RhK5_1005_2503Qn | 1331_1 | 2 | 3 | 3 | 2 | 2 | 3 | 2 | 1 | 3 | 1 |
RhK5_1005_2503Rnn | 1331_1 | 2 | 3 | 3 | 3 | 2 | 3 | 2 | 1 | 3 | 1 |
compareProbes
returns a list with two components
compstat
and combscores
, each a data.frame.
These are also saved as tab-separated text files if
compfile
and combscorefile
are specified.
Data.frame compstat
has an overview of the different
versions of each SNP (two probes, each unshifted and/or shifted). For
each of these (suffixed with P or Q for the probe and n or s for
(non-)shifted) the segregation type and qall_mult
quality
score is listed. If (a version of) the P and the Q probe are
sufficiently similar, a combined version of the SNP is produced which
gets the suffix R, and a further suffix nn
,
ns
, sn
or ss
depending on whether
the nonshifted or the shifted version of the P and Q probe was used for
merging. Also for such a merged (R) marker the segregation type is
given, and finally the number of versions of the P, Q and R markers. Two
probes are considered sufficiently similar to merge if they have the
same segregation type and if there are (by default) not more than 4%
conflicting sample scores. For more details and additional parameters
see the help.
The other component of the compareProbes
result is
data.frame combscores
. This lists the segregation type and
dosages for all markers (the original P and Q markers and the combined R
markers) in the same format as produced by writeDosagefile
(exercise 3). In the second table we can see that there is redundancy:
for the first SNP the P, Q and R markers are all present, but they are
almost completely identical (as expected, since a combined marker is
produced). If we are happy to accept the merging we need to remove the P
and Q marker data if an R marker is present. This is done by function
removeRedundant:
rr <- removeRedundant(compstat=cpp$compstat, combscores=cpp$combscores,
compfile=NA, combscorefile=NA)
knitr::kable(rr$combscores[1:4, 1:12])
MarkerName | segtype | P540a1 | P867a1 | P867a2 | P867b | parent1 | parent2 | K001 | K002 | K003 | K004 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | RhK5_1000_451Qn | 11_1 | 3 | 0 | 0 | 0 | 3 | 0 | 1 | 2 | 1 | 1 |
4 | RhK5_1005_2503Rnn | 1331_1 | 2 | 3 | 3 | 3 | 2 | 3 | 2 | 1 | 3 | 1 |
5 | RhK5_1007_811Pn | 1551_1 | 3 | 2 | 2 | NA | 3 | 2 | 3 | 2 | 3 | 2 |
6 | RhK5_1007_811Qn | 1551_0 | 2 | 1 | 0 | 1 | 2 | 1 | 2 | 1 | 2 | 1 |
removeRedundant
produces the same type of output as
compareProbes
. The compstat
data.frame is not
very interesting but the combined scores in data.frame
rr$combscores
are the result that we will want to use in
mapping studies and further genetic analyses.