While R is best known as an environment for statistical computing, it is also a great tool for numerical analysis (optimization, integration, interpolation, matrix operations, differential equations etc). Here is a flavour of the capabilities that R offers in analysing data.

For a basic introduction, see the analysis section of the getting started page. For a more thorough overview of regression using R in an astronomical context, see this tutorial here.

Numerical Analysis in R

Parameter optimization

  • To find the minimum value of a function within some interval, use optimize (optimise is a valid alias):
    fun <- function(x) x^2 + x - 1                   # create function
    curve(fun, xlim=c(-2, 1))                        # plot f unction
    ( res <- optimize(fun, interval=c(-10, 10)) )
    points(res$minimum, res$objective)               # plot point at minimum
    Now, let's say you want to find the x value of the function at which the y value equals some number (1.234, say):
    #--Define an auxiliary minimizing function:
    fun.aux <- function(x, target) ( fun(x) - target )^2
    ( res <- optimize(fun.aux, interval=c(-10, 10), target=1.234) )
    fun(res$minimum)   # close enough
    Of course, there are 2 solutions in this case, as seen by plotting the function to be minimized:
    curve(fun.aux(x, target=1.234), xlim=c(-3, 2))
    points(res$minimum, res$objective)
    We can get the other solution by giving a skewed search interval (see ?optimize for how the start point is determined):
    res2 <- optimize(fun.aux, interval=c(-10, 100), target=1.234)  # force higher start value
    points(res2$minimum, res2$objective)    # plot other minimum
    #--Show target values plotted with original function:
    curve(fun, xlim=c(-3, 2))
    abline(h=1.234, lty=2)
    abline(v=c(res$minimum, res2$minimum), lty=3)
  • For more general-purpose optimization, use nlm, optim or nlminb (which I've found to be the most robust)
  • The CRAN task views webpage has a very thorough overview of R packages relating to optimization

  • Integration, differentiation and differential equations

  • To integrate a function numerically, use integrate (note that the function must be able to accept and return a vector):
    fun <- function(x, const) x^2 + 2*x + const
    > integrate(fun, lower=0, upper=10, const=1)
    443.3333 with absolute error < 4.9e-12
    Integrate will evaluate the function over the specified range (lower to upper) by passing a vector of these values to the function being integrated. Note that any other arguments to fun must also be specified, as extra arguments to integrate, and that the order of the arguments of fun does not matter, provided all arguments are supplied in this way, apart from the one being integrated over:
    fun2 <- function(A, b, x) A*x^b    # "x" doesn't have to be the first argument
    integrate(fun2, lower=0, upper=10, A=1, b=2)   # "A" & "b" are given explicitly
    Now, let's say you wanted to integrate this function for a series of values of b
    bvals <- seq(0, 2, by=0.2)    # create vector of b values
    fun2.int <- function(b) integrate(fun2, lower=0, upper=10, A=1, b=b)$value
    fun2.int(bvals[1])            # works for a single value of b
    fun2.int(bvals)               # FAILS for a vector of values of b
    to make it work, you need to force vectorization of the function, so it can cycle piecewise through the elements of the vector and evaluate the function for each one:
    fun2.intV <- Vectorize(fun2.int, "b")    # Vectorize "fun2.int" over "b"
    fun2.intV(bvals)                         # returns a vector of values
  • To compute symbolically the derivative of a simple expression, use D (see ?deriv for more info):
    > D(expression(sin(x)^2 - exp(x^2)), "x")  # differentiate with respect to "x"
    2 * (cos(x) * sin(x)) - exp(x^2) * (2 * x)
  • To solve differential equations, use the deSolve package. You can read a helpful introduction to the deSolve package in Volume 2/2 of the R journal.
    library(help="deSolve")    # see information on package

  • Interpolating data

  • A example of spline interpolation:
    fun <- function(x) sqrt(3) * sin(2*pi*x)   # function to generate some data
    x <- seq(0, 1, length=20)
    set.seed(123)                              # allow reproducible random numbers
    y <- jitter(fun(x), factor=20)             # add a small amount of random noise
    plot(y ~ x)                                # plot noisy data
    lines(spline(x, y))                        # add splined data
    Now, compare with the prediction from a smoothing spline:
    f <- smooth.spline(x, y)
    lines(predict(f), lty=2)
    why not also add the best-fit sine curve predicted from a linear regression with lm:
    lines(x, predict(lm( y ~ sin(2*pi*x))), col="red")
    Note that, by default, the predicted values are evaluated at the (X) positions of the raw data. This means that you can end up with rather coarse curves, as seen above. To get round this, you need to work with functions for the splines, which can be supplied with more finely-spaced X values for plotting:
    fun.spline <- splinefun(x, y)
    fun.smooth <- function(xx, ...) predict(smooth.spline(x, y), x=xx, ...)$y
    plot(y ~ x)
    curve(fun.spline, add=TRUE)             # } "curve" uses n=101 points by default
    curve(fun.smooth, add=TRUE, lty=2)      # }  at which to evaluate the function
    #--And add a smoother best-fit sine curve:
    fun.sine <- function(X) predict(lm( y ~ sin(2*pi*x)), newdata=list(x=X))
    curve(fun.sine, add=TRUE, col="red")
    #--Finally, just for completeness, plot the original function:
    curve(fun, add=TRUE, col="blue")
  • A wider range of splines is available in the package of the same name, accessed via library("splines"), including B splines, natural splines etc.

  • Matrix operations

  • t transposes a matrix; %*% is the usual (inner) matrix multiplication; diag returns the diagonal matrix and upper.tri & lower.tri return logical arrays indicating which elements belong to the upper/lower triangles.
  • To evaluate the classive five numbers from linear least squares regression (sum(x), sum(x^2), sum(y), sum(y^2), sum(x*y)), using matrices:
    N <- 10                     # } 
    x <- 1:N; y <- 10:19        # } create some X & Y data
    M <- cbind(n=1, x, y)       # combine into a matrix
    M2 <- t(M) %*% M            # matrix multiplication with transposed version
    res <- M2[! lower.tri(M2)]  # length of x/y & famous five numbers:
    identical(res, c(N, sum(x), sum(x^2), sum(y), sum(x*y), sum(y^2)))
  • To calculate the outer product of 2 arrays, use outer:
    x <- seq(-1, 1, length=200)
    A <- outer(x, x, function(x, y) cos(x)^2 - sin(y)^2 )
    You can also use this to generate a grid of values:
    outer(LETTERS[1:3], 1:5, paste, sep="")
  • Matrix crossproduct:
    x <- rnorm(1e7)
    system.time( a1 <- drop(crossprod(x)))
    system.time( a2 <- sum(x^2))
    # -> matrix version faster
    identical(a1, a2)   # check the answer is the same
  • You can also use solve to solve a system of equations, eigen to calculate eigenvalues and eigenvectors of a matrix, as well as svd to compute the singular-value decomposition of a rectangular matrix.
  • There is also a dedicated matrix package, which can handle sparse and dense matrices, accessible via library(help="Matrix")

  • Statistical Analysis in R

    Work in progress...This will be just a very brief taster of some of the many things that R can do in the way of statistical analysis.

    Fast bootstrap resampling to estimate regression parameter errors

    Bootstrap resampling is a very useful method to determine parameter error estimates. This section makes use of the boot R package, which can be loaded with library(boot) or require(boot).

    Permutation resampling to estimate p-values

    Permutation resampling is a very useful method to estimate the p-value of a statistical test in cases that cause problems for conventional estimates. An example of this is using the Kolmogorov-Smirnov test with tied data. The example here uses the 2-sample KS test to test if two sets of numbers are drawn from the same distribution. The code is based on the example from Rizzo, 2008 (Statistical Computing with R). The null hypothesis is that the 2 vectors (x and y) are drawn from the same distribution, so the test works by pooling the data and repeatedly drawing 2 vectors of length equal to x and y from random permutations of the pooled sample, i.e. sampling without replacement (unlike bootstrapping, which uses sampling with replacement)

    For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.

    Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.

    Quick links

    Jump to

    Copyright © 2010-2013 Alastair Sanderson