library(andrews)
#> See the package vignette with `vignette("andrews")`
The andrews routine, implemented by Jaroslav Myslivec,
has been rewritten and extended. The old routine is still available as
andrews0 and the rewritten routine should give the same
graphics with the same parameter values.
op <- par(mfrow=c(1,2))
andrews0(iris, main="andrews0")
andrews(iris, main="andrews")
par(op)
Some more types of curves has been added. To see more comparisons of
andrews and andrews0, run interactively
zzz()
Full parameter list for andrews is:
andrews (df, type = 1, clr = NULL, step = 100, ymax = 10, # old parameters
alpha = NULL, palcol = NULL, lwd = 1, lty = "solid", ...) # new parameters
The parameters in the call to andrews that come in
... are passed to plot to draw the base
window:
x, and y cannot be set by
the user through ....main, sub,
xlim, ylim, xlab,
ylab, axes, asp and possibly
other parameters.Note that the setting of ylim has priority over the
setting of ymax.
The height of the window can be scaled by setting ymax
or ylim.
ymax is not set then ylim=c(-10,10) is
used.ymax is set to NA, then
ylim=c(-lim,lim), where the maximum height lim
is calculated from the curves. Note that this option doubles the
calculation time.ylim directly.op<-par(mfrow=c(1,3))
andrews(iris, main="no ymax")
andrews(iris, ymax=NA, main="ymax=NA")
andrews(iris, ylim=c(-1,3), main="ylim=c(-1,3)")
par(op)
The width of the window can be scaled by setting xlim,
which is particularly useful for type==3 or
type==5.
andrews(iris, type=3, xlim=c(0,6*pi), ymax=NA)
The parameters lty and lwd determine the
linetype and width. The length of the parameters must be either
1 or nrow(df). In the first case the values
are applied to each curve, in the second case lty[i] and
lwd[i] are used for the ith line.
andrews(iris, ymax=NA, lwd=3)
The curves can be coloured individually if
length(clr)==nrow(df).
andrews(iris, ymax=NA, clr=rainbow(nrow(iris)))
By default, the curves are coloured by a variable of the data frame.
andrews(iris, ymax=NA, clr=5) # iris$Species
If is.numeric(df[,clr]) is FALSE then
rainbow(nuv) is used for the colours, where
nuv is the number of unique values
(nuv = length(unique(df[,clr])). You can use
palcol to select another palette that gives
nuv colours.
andrews(iris, ymax=NA, clr=5, palcol=hcl.colors) # iris$Species
If is.numeric(df[,clr]) is TRUE, the
variable color(df[,clr]) (scaled to the interval \([0, 1]\)) is used. The default is
function(v) { hsv(1,1,v,1) }, but any other colouring
function can be used.
andrews(iris, ymax=NA, clr=1) # iris$Sepal.Length
andrews(iris, ymax=NA, clr=1, palcol=function(v) { gray(v) }) # iris$Sepal.Length
or alternatively
andrews(iris, ymax=NA, clr=1, palcol=gray) # iris$Sepal.Length
If you plot many curves then the curve details disappear by
overplotting. If the output device supports the alpha channel, the
parameter alpha (\(0<\alpha<1\)) is applied to all
curves (length(alpha)==1) or one for each observation
(length(alpha)==nrow(df)).
andrews(iris, ymax=NA, alpha=0.1)
andrews(iris, ymax=NA, clr=5, alpha=0.1, lwd=2)
All types of Andrews curves can be written as
\[ f_i(t) = \sum_{k=1}^p x_{ik} g_k(t). \]
Currently, six types of curves are implemented, which are determined
by the parameter type:
With deftype the list of types can be queried by
deftype(1)
#> $fun
#> function (n, t)
#> {
#> n <- as.integer(if (n < 1) 1 else n)
#> m <- matrix(1/sqrt(2), nrow = length(t), ncol = n)
#> if (n > 1) {
#> for (i in 2:n) {
#> m[, i] <- if (i%%2)
#> cos((i%/%2) * t)
#> else sin((i%/%2) * t)
#> }
#> }
#> m
#> }
#> <bytecode: 0x563c84433f98>
#> <environment: 0x563c8443fda0>
#>
#> $range
#> [1] -3.141593 3.141593
A curve type consists of a function function(n, t) which
computes a matrix of \((g_1(t), \ldots,
g_n(t))\) and a default range for plotting, usually \(-\pi\) till \(\pi\).
A new curve type and a default plot range can be implemented by number or name:
deftype("sine", xlim = c(-pi, pi), function(n, t) {
n <- as.integer(if (n<1) 1 else n)
m <- matrix(NA, nrow=length(t), ncol=n)
for (i in 1:n) m[,i] <- sin(i*t)
m
})
andrews(iris, "sine", ymax=NA, clr=5)
The representation of a set of multivariate observations as curves was proposed by Andrews (1972). For a set of orthonormal basis functions \(g_i(t)\) with \(i=1, 2, 3, \ldots\) in an interval \([a, b]\) it holds:
\[\int_a^b g_i(t) g_j(t) dt = \delta_{ij} = \begin{cases} 1 & \text{ if } i=j \\ 0 & \text{ if } i\neq j \end{cases}\]
If one creates from multivariate observations \(x_i=(x_{i1}, \ldots, x_{ip})\) with \(i=1,2,\ldots,n\) a function
\[ f_i(t) = \sum_{k=1}^p x_{ik} g_k(t) \]
then it follows:
Property 1 - The integral of the squared difference of two curves is equal to the squared Euclidean distance between the data points \(x_i\) and \(x_j\): \[\begin{align*} \int_a^b (f_i(t)-f_j(t))^2 dt & = \int_a^b \left(\sum_{k=1}^p (x_{ik}- x_{jk}) g_k(t)\right)^2 dt \\ &= \sum_{k=1}^p \sum_{l=1}^p (x_{ik}- x_{jk}) (x_{il}- x_{jl}) \underbrace{\int_a^b g_k(t)g_l(t) dt}_{= \delta_{kl}}\\ &= \sum_{k=1}^p (x_{ik}- x_{jk})^2 \end{align*}\]
This makes the Andrews curves useful for visualising multivariate outliers and clusters in the data.
None of the basis functions of the six types of Andrews curves form
an orthonormal system, but two (type %in% c(1,2)) of them
form an orthogonal system:
\[\int_{-\pi}^{\pi} g_i(t)g_j(t) dt = \begin{cases} \pi & \text{ if } i=j \\ 0 & \text{ if } i\neq j \end{cases}\] Thus, property 1 changes to
\[ \int_{-\pi}^{\pi} (f_i(t)-f_j(t))^2 dt = \pi \sum_{k=1}^p (x_{ik}- x_{jk})^2\]
Property 2 - The average curve is the mean value of the observations:
\[\begin{align*} \bar{f}(t) &= \frac1n \sum_{i=1}^n f_i(t) =\frac1n \sum_{i=1}^n \sum_{k=1}^p x_{ik} g_k(t)\\ &=\sum_{k=1}^p g_k(t) \left(\frac1n \sum_{i=1}^n x_{ik}\right)\\ &=\sum_{k=1}^p g_k(t) \bar{x}_k \end{align*}\]
Property 3 - For a given \(t\), the variable \(P_t=\sum_{k=1}^p X_k w_{kt}\) can be considered as a univariate projection of the variables \(X_1\), \(X_2\), …, \(X_p\). If \(Var(X_k)=\sigma\) and \(Cov(X_k,X_l)=\delta_{kl}\) hold then
\[ Var(P_t)= Var\left(\sum_{k=1}^p X_k
w_{kt}\right) = \sum_{k=1}^p w_{kt}^2 Var(X_k)\] For the curves
with type %in% c(1,2,6) \(Var(P_t)\approx \sigma^2 p/2\) holds, since
\(\sin^2(t)+\cos^2(t)=1\). For these
curves, however, we only cover a subset of the possible one-dimensional
projections.
For type %in% c(3,5), the choice of \(\sqrt{p_i}\) ensures that all possible
one-dimensional projections are used, but the range \([a,b]\) is not fixed. Usually \(a=0\) is chosen and \(b\) “appropriate” (in practice
\(b=4\pi\)).