While R is best known as an environment for statistical computing, it is also a great tool for numerical analysis (optimization, integration, interpolation, matrix operations, differential equations etc). Here is a flavour of the capabilities that R offers in analysing data.

For a basic introduction, see the analysis section of the getting started page. For a more thorough overview of regression using R in an astronomical context, see this tutorial here.

# Parameter optimization

• To find the minimum value of a function within some interval, use optimize (optimise is a valid alias):
```fun <- function(x) x^2 + x - 1                   # create function
curve(fun, xlim=c(-2, 1))                        # plot f unction
( res <- optimize(fun, interval=c(-10, 10)) )
points(res\$minimum, res\$objective)               # plot point at minimum
```
Now, let's say you want to find the x value of the function at which the y value equals some number (1.234, say):
```#--Define an auxiliary minimizing function:
fun.aux <- function(x, target) ( fun(x) - target )^2
( res <- optimize(fun.aux, interval=c(-10, 10), target=1.234) )
fun(res\$minimum)   # close enough
```
Of course, there are 2 solutions in this case, as seen by plotting the function to be minimized:
```curve(fun.aux(x, target=1.234), xlim=c(-3, 2))
points(res\$minimum, res\$objective)
```
We can get the other solution by giving a skewed search interval (see ?optimize for how the start point is determined):
```res2 <- optimize(fun.aux, interval=c(-10, 100), target=1.234)  # force higher start value
points(res2\$minimum, res2\$objective)    # plot other minimum

#--Show target values plotted with original function:
curve(fun, xlim=c(-3, 2))
abline(h=1.234, lty=2)
abline(v=c(res\$minimum, res2\$minimum), lty=3)
```
• For more general-purpose optimization, use nlm, optim or nlminb (which I've found to be the most robust)
• The CRAN task views webpage has a very thorough overview of R packages relating to optimization

• # Integration, differentiation and differential equations

• To integrate a function numerically, use integrate (note that the function must be able to accept and return a vector):
```fun <- function(x, const) x^2 + 2*x + const
> integrate(fun, lower=0, upper=10, const=1)
443.3333 with absolute error < 4.9e-12
```
Integrate will evaluate the function over the specified range (lower to upper) by passing a vector of these values to the function being integrated. Note that any other arguments to fun must also be specified, as extra arguments to integrate, and that the order of the arguments of fun does not matter, provided all arguments are supplied in this way, apart from the one being integrated over:
```fun2 <- function(A, b, x) A*x^b    # "x" doesn't have to be the first argument
integrate(fun2, lower=0, upper=10, A=1, b=2)   # "A" & "b" are given explicitly
```
Now, let's say you wanted to integrate this function for a series of values of b
```bvals <- seq(0, 2, by=0.2)    # create vector of b values
fun2.int <- function(b) integrate(fun2, lower=0, upper=10, A=1, b=b)\$value
fun2.int(bvals)            # works for a single value of b
fun2.int(bvals)               # FAILS for a vector of values of b
```
to make it work, you need to force vectorization of the function, so it can cycle piecewise through the elements of the vector and evaluate the function for each one:
```fun2.intV <- Vectorize(fun2.int, "b")    # Vectorize "fun2.int" over "b"
fun2.intV(bvals)                         # returns a vector of values
```
• To compute symbolically the derivative of a simple expression, use D (see ?deriv for more info):
```> D(expression(sin(x)^2 - exp(x^2)), "x")  # differentiate with respect to "x"
2 * (cos(x) * sin(x)) - exp(x^2) * (2 * x)
```
• To solve differential equations, use the deSolve package. You can read a helpful introduction to the deSolve package in Volume 2/2 of the R journal.
```install.packages("deSolve")
library("deSolve")
library(help="deSolve")    # see information on package
```

• # Interpolating data

• A example of spline interpolation:
```fun <- function(x) sqrt(3) * sin(2*pi*x)   # function to generate some data
x <- seq(0, 1, length=20)
set.seed(123)                              # allow reproducible random numbers
y <- jitter(fun(x), factor=20)             # add a small amount of random noise
plot(y ~ x)                                # plot noisy data
lines(spline(x, y))                        # add splined data
```
Now, compare with the prediction from a smoothing spline:
```f <- smooth.spline(x, y)
lines(predict(f), lty=2)
```
why not also add the best-fit sine curve predicted from a linear regression with lm:
```lines(x, predict(lm( y ~ sin(2*pi*x))), col="red")
```
Note that, by default, the predicted values are evaluated at the (X) positions of the raw data. This means that you can end up with rather coarse curves, as seen above. To get round this, you need to work with functions for the splines, which can be supplied with more finely-spaced X values for plotting:
```fun.spline <- splinefun(x, y)
fun.smooth <- function(xx, ...) predict(smooth.spline(x, y), x=xx, ...)\$y

plot(y ~ x)
curve(fun.spline, add=TRUE)             # } "curve" uses n=101 points by default
curve(fun.smooth, add=TRUE, lty=2)      # }  at which to evaluate the function

#--And add a smoother best-fit sine curve:
fun.sine <- function(X) predict(lm( y ~ sin(2*pi*x)), newdata=list(x=X))

#--Finally, just for completeness, plot the original function:
```
• A wider range of splines is available in the package of the same name, accessed via library("splines"), including B splines, natural splines etc.

• # Matrix operations

• t transposes a matrix; %*% is the usual (inner) matrix multiplication; diag returns the diagonal matrix and upper.tri & lower.tri return logical arrays indicating which elements belong to the upper/lower triangles.
• To evaluate the classive five numbers from linear least squares regression (sum(x), sum(x^2), sum(y), sum(y^2), sum(x*y)), using matrices:
```N <- 10                     # }
x <- 1:N; y <- 10:19        # } create some X & Y data
M <- cbind(n=1, x, y)       # combine into a matrix
M2 <- t(M) %*% M            # matrix multiplication with transposed version
res <- M2[! lower.tri(M2)]  # length of x/y & famous five numbers:
identical(res, c(N, sum(x), sum(x^2), sum(y), sum(x*y), sum(y^2)))
```
• To calculate the outer product of 2 arrays, use outer:
```x <- seq(-1, 1, length=200)
A <- outer(x, x, function(x, y) cos(x)^2 - sin(y)^2 )
require(lattice)
levelplot(A)
```
You can also use this to generate a grid of values:
```outer(LETTERS[1:3], 1:5, paste, sep="")
```
• Matrix crossproduct:
```x <- rnorm(1e7)
system.time( a1 <- drop(crossprod(x)))
system.time( a2 <- sum(x^2))
# -> matrix version faster
identical(a1, a2)   # check the answer is the same
```
• You can also use solve to solve a system of equations, eigen to calculate eigenvalues and eigenvectors of a matrix, as well as svd to compute the singular-value decomposition of a rectangular matrix.
• There is also a dedicated matrix package, which can handle sparse and dense matrices, accessible via library(help="Matrix")

• # Statistical Analysis in R

Work in progress...This will be just a very brief taster of some of the many things that R can do in the way of statistical analysis.

# Fast bootstrap resampling to estimate regression parameter errors

Bootstrap resampling is a very useful method to determine parameter error estimates. This section makes use of the boot R package, which can be loaded with library(boot) or require(boot).

• A demonstration using simple linear regression:
```set.seed(123)   # allow reproducible random numbers
N <- 20
A <- data.frame(x=1:N, y=rnorm(N, mean=1:N))   # create some data
plot(y ~ x, A)   # plot the data
m <- lm(y ~ x, A)   # fit linear model
abline(m)    # add best-fit model to plot as a line
summary(m)   # show model best fit values & standard errors

#--Create a simple function to return the best-fit coefficients for the model
#  fitted to a subset of the original data ("A"), given a vector of row
#  numbers for the data frame ("indices"). "indices" will be the same length
#  as "nrow(A)", and will be supplied by the "boot" function, using random
#  sampling with replacement (i.e. "sample(nrow(A), replace=TRUE)")

mystat <- function(A, indices) {
m <- lm(y ~ x, A[indices, ])
return(coef(m))
}

#--Demonstrate function:
> mystat(A, 1:nrow(A))   # same as "coef(m)"
(Intercept)           x
0.3100925   0.9839554

> set.seed(123)   # allow reproducible random numbers
> mystat(A, sample(nrow(A), replace=TRUE))   # result for a single resample
(Intercept)           x
0.6143148   0.9416338

#--Run full set of "N.boot" bootstrap resamples:
N.boot <- 500
set.seed(123)   # allow reproducible random numbers
b <- boot(A, mystat, R=N.boot)

#--Plot results of bootstrapping:
plot(b, index=1)   # intercept
plot(b, index=2)   # slope
# see "?plot.boot" for details

#--Now compare the standard errors on the model parameters from
#   the bootstrap resampling with those from the normal summary method:
> b   # print results

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = A, statistic = mystat, R = N.boot)

Bootstrap Statistics :
original        bias    std. error
t1* 0.3100925  0.0139150827  0.44749693
t2* 0.9839554 -0.0007816746  0.03979363

## NB the bias is the difference between the mean of the N.boot
##  resample parameter values and the original best-fit model
##  parameter values, i.e. "apply(b\$t, 2, mean) - coef(m)"

#--Now show the standard errors computed (see "?summary.lm"):
> coef(summary(m))
Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 0.3100925 0.46199915  0.6711972 5.106173e-01
x           0.9839554 0.03856694 25.5129205 1.389147e-15

#--Or, quantify the difference by comparing the ratios:
> sd(b\$t) / coef(summary(m))[, 2] # close to 1 in each case
(Intercept)           x
0.9686099   1.0318067
```
• A quicker version of the above example, using the method described in Section 4.3.1 from Chambers & Hastie, 1992 (see References section in ?lm for book details). This method exploits the fact that lm does a certain amount of initial processing prior to the actual regression (using lm.fit), and this represents a substantial overhead that need only be performed once (and need not be replicated during each bootstrap resampling iteration).
```# following Section 4.3.1 from Chambers & Hastie (1992):
m <- lm(y ~ x, A, x=TRUE, y=TRUE)  # "x=y=TRUE" returns extra data for lm.fit

mystat.fast <- function(dummy, i, model) coef(lm.fit(model\$x[i, ], model\$y[i]))

require(boot)
set.seed(123)   # allow reproducible random numbers
system.time(slow <- boot(A, mystat, 1e4))
set.seed(123)
system.time(fast <- boot(A, mystat.fast, 1e4, model=m))   # roughly 10x faster

#--Check results are identical:
slow
fast

#--Formal check if two objects are identical:
# the only differences reported are in components 6 & 8, which are due to the
# different mystat function & name (i.e. "mystat" vs. "mstat.fast"), which is
# stored in the object returned by "boot":
identical(slow, fast)    # not completely identical
all.equal(slow, fast)    # only differences due to different mystat functions
```
• Note that there is no point having a very large number of bootstrap samples compared to the number of fitted values (i.e. the number of rows in the data frame), since the latter ultimately becomes the limiting factor in the accuracy of the recovered parameter error estimates.

• An example using non-linear regression (nls)
```#--Create some non-linear data:
N <- 20
set.seed(123)
#--Create some data:
B <- data.frame(x=1:N, y=(4 * log10(1:N)) + rnorm(N, mean=2, sd=0.2))
plot(y ~ x, B) # plot the data

#--Fit the non-linear model
m <- nls(y ~ a * log10(x) + b, data=B, start=list(a=1, b=1))
lines(B\$x, fitted(m), lty=2)    # plot best-fit model values as a dashed line

#--A better way of plotting the best-fit model (as a smooth curve):
summary(m)   # summarise the best-fit parameters and their errors etc.

#--There is no equivalent possible for the fast version of "mystat"
#  for nls, so set up the basic function to calculate bootstrapped fit:
mystat <- function(A, indices) {
m <- nls(y ~ a * log10(x) + b, data=B[indices, ], start=list(a=1, b=1))
return(coef(m))
}

#--Run full set of "N.boot" bootstrap resamples:
N.boot <- 500
set.seed(123)
require(boot)
b <- boot(B, mystat, R=N.boot)

#--Plot results of bootstrapping:
# (NB note significant non-normal distribution of values;
#  i.e. right panel quantile-quantile plot values deviate from a straight line)
plot(b, index=1)    # parameter "a"
plot(b, index=2)    # parameter "b"

#--Now compare the standard errors on the model parameters from
#   the bootstrap resampling with those from the normal summary method:
> b    # print results of bootstrap

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = B, statistic = mystat, R = N.boot)

Bootstrap Statistics :
original      bias    std. error
t1* 3.986804 -0.02582234   0.1361342
t2* 2.040456  0.02679389   0.1332876

> summary(m) # print standard errors (see "?summary.nls")

Formula: y ~ a * log10(x) + b

Parameters:
Estimate Std. Error t value Pr(>|t|)
a   3.9868     0.1299   30.70  < 2e-16 ***
b   2.0405     0.1275   16.01 4.33e-12 ***
---
Signif. pres:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1998 on 18 degrees of freedom

Number of iterations to convergence: 1
Achieved convergence tolerance: 1.234e-07

#--Or, quantify the difference by comparing the ratios:
sd(b\$t) / coef(summary(m))[, 2]    # close to 1 in each case
a        b
1.048229 1.045583
```

# Permutation resampling to estimate p-values

Permutation resampling is a very useful method to estimate the p-value of a statistical test in cases that cause problems for conventional estimates. An example of this is using the Kolmogorov-Smirnov test with tied data. The example here uses the 2-sample KS test to test if two sets of numbers are drawn from the same distribution. The code is based on the example from Rizzo, 2008 (Statistical Computing with R). The null hypothesis is that the 2 vectors (x and y) are drawn from the same distribution, so the test works by pooling the data and repeatedly drawing 2 vectors of length equal to x and y from random permutations of the pooled sample, i.e. sampling without replacement (unlike bootstrapping, which uses sampling with replacement)

• Function to calculate the permutation p-value:
```PermTest <- function(x, y, R=999, testfun=ks.test) {
z <- c(x, y)  # pooled sample
myfun <- function(a, b) suppressWarnings(unname(testfun(a, b)\$statistic))
DoIt <- function() {
i <- sample(length(z), length(x))
myfun(z[i], z[-i])  # z[-i] is everything *except* the "i" elements of z
}
pstats <- replicate(R, DoIt())
stat <- myfun(x, y)
c("p-value" = mean(c(stat, pstats) >= stat))
}
```
• Now create some simulated counts data, which will have lots of ties:
```N <- 100
set.seed(123)    # allow reproducible random numbers
x <- rpois(N, lambda=1)
set.seed(123)    # allow reproducible random numbers
y <- rpois(N, lambda=1.5)
## Run KS test:
> ks.test(x, y)

Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.16, p-value = 0.1545
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : p-values will be approximate in the presence of ties

## Compare to the permutation result:
> PermTest(x, y)
p-value
0.04
```

In this case the conventional (but incorrect) p-value suggests that x and y are drawn from the same distribution, but the permutation test is able to determine that they are not from the same distribution (i.e. p-value < 0.05). Note the warning message reported by ks.test - this is suppressed in PermTest, by using suppressWarnings.

• Now, to check the permuted p-values against the KS test exact p-values, run some simulations of data without ties:
```## Function to compare the 2 methods for calculating the p-value:
CompareTest <- function(x, y, testfun=ks.test, N=100) {
x <- rnorm(N); y <- rnorm(N)
p.exact <- suppressWarnings(testfun(x, y)\$p.value)
p.perm <- unname(PermTest(x, y))
c(p.exact = p.exact, p.perm = p.perm)
}
## Run the tests across multiple cores, to speed up the calculation:
require(parallel)
system.time( res <- mclapply(1:1000, CompareTest) )

## Combine separate list elements into a data frame:
res <- as.data.frame(do.call(rbind, res))

## Plot permuted vs. exact p-values & add line of equality:
plot(p.perm ~ p.exact, data=res); abline(0, 1)

## Fit least squares regression:
fm <- lm( p.perm ~ p.exact, data=res)
> coef(summary(fm))
Estimate   Std. Error     t value  Pr(>|t|)
(Intercept) 0.0006785381 0.0009002172   0.7537493 0.4511775
p.exact     1.0017142458 0.0014530249 689.3992438 0.0000000

## - you can see that the slope is not significantly different from 1

## Check for differences in the locations of the exact & permuted
## p-values using the non-parametric Kruskal-Wallis rank sum test:
> kruskal.test( values ~ ind, data=stack(res))

Kruskal-Wallis rank sum test

data:  values by ind
Kruskal-Wallis chi-squared = 0.1346, df = 1, p-value = 0.7137
```

So, there is no indication of any difference in the distributions of the p-values estimated by permutation compared to those computed exactly.

For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.

Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.