While R is best known as an environment for statistical computing, it is also a great tool for numerical analysis (optimization, integration, interpolation, matrix operations, differential equations etc). Here is a flavour of the capabilities that R offers in analysing data.
For a basic introduction, see the analysis section of the getting started page. For a more thorough overview of regression using R in an astronomical context, see this tutorial here.
Numerical Analysis in R
Parameter optimization
optimize(
optimiseis a valid alias):
fun <- function(x) x^2 + x - 1 # create function curve(fun, xlim=c(-2, 1)) # plot f unction ( res <- optimize(fun, interval=c(-10, 10)) ) points(res$minimum, res$objective) # plot point at minimum
#--Define an auxiliary minimizing function: fun.aux <- function(x, target) ( fun(x) - target )^2 ( res <- optimize(fun.aux, interval=c(-10, 10), target=1.234) ) fun(res$minimum) # close enough
curve(fun.aux(x, target=1.234), xlim=c(-3, 2)) points(res$minimum, res$objective)
?optimizefor how the start point is determined):
res2 <- optimize(fun.aux, interval=c(-10, 100), target=1.234) # force higher start value points(res2$minimum, res2$objective) # plot other minimum #--Show target values plotted with original function: curve(fun, xlim=c(-3, 2)) abline(h=1.234, lty=2) abline(v=c(res$minimum, res2$minimum), lty=3)
nlm,
optimor
nlminb(which I've found to be the most robust)
Integration, differentiation and differential equations
integrate(note that the function must be able to accept and return a vector):
fun <- function(x, const) x^2 + 2*x + const > integrate(fun, lower=0, upper=10, const=1) 443.3333 with absolute error < 4.9e-12
lowerto
upper) by passing a vector of these values to the function being integrated. Note that any other arguments to
funmust also be specified, as extra arguments to
integrate, and that the order of the arguments of
fundoes not matter, provided all arguments are supplied in this way, apart from the one being integrated over:
fun2 <- function(A, b, x) A*x^b # "x" doesn't have to be the first argument integrate(fun2, lower=0, upper=10, A=1, b=2) # "A" & "b" are given explicitly
b
bvals <- seq(0, 2, by=0.2) # create vector of b values fun2.int <- function(b) integrate(fun2, lower=0, upper=10, A=1, b=b)$value fun2.int(bvals[1]) # works for a single value of b fun2.int(bvals) # FAILS for a vector of values of b
fun2.intV <- Vectorize(fun2.int, "b") # Vectorize "fun2.int" over "b" fun2.intV(bvals) # returns a vector of values
D(see
?derivfor more info):
> D(expression(sin(x)^2 - exp(x^2)), "x") # differentiate with respect to "x" 2 * (cos(x) * sin(x)) - exp(x^2) * (2 * x)
deSolvepackage. You can read a helpful introduction to the deSolve package in Volume 2/2 of the R journal.
install.packages("deSolve")
library("deSolve")
library(help="deSolve") # see information on package
Interpolating data
fun <- function(x) sqrt(3) * sin(2*pi*x) # function to generate some data x <- seq(0, 1, length=20) set.seed(123) # allow reproducible random numbers y <- jitter(fun(x), factor=20) # add a small amount of random noise plot(y ~ x) # plot noisy data lines(spline(x, y)) # add splined data
f <- smooth.spline(x, y) lines(predict(f), lty=2)
lm:
lines(x, predict(lm( y ~ sin(2*pi*x))), col="red")
fun.spline <- splinefun(x, y) fun.smooth <- function(xx, ...) predict(smooth.spline(x, y), x=xx, ...)$y plot(y ~ x) curve(fun.spline, add=TRUE) # } "curve" uses n=101 points by default curve(fun.smooth, add=TRUE, lty=2) # } at which to evaluate the function #--And add a smoother best-fit sine curve: fun.sine <- function(X) predict(lm( y ~ sin(2*pi*x)), newdata=list(x=X)) curve(fun.sine, add=TRUE, col="red") #--Finally, just for completeness, plot the original function: curve(fun, add=TRUE, col="blue")
library("splines"), including B splines, natural splines etc.
Matrix operations
ttransposes a matrix;
%*%is the usual (inner) matrix multiplication;
diagreturns the diagonal matrix and
upper.tri&
lower.trireturn logical arrays indicating which elements belong to the upper/lower triangles.
N <- 10 # } x <- 1:N; y <- 10:19 # } create some X & Y data M <- cbind(n=1, x, y) # combine into a matrix M2 <- t(M) %*% M # matrix multiplication with transposed version res <- M2[! lower.tri(M2)] # length of x/y & famous five numbers: identical(res, c(N, sum(x), sum(x^2), sum(y), sum(x*y), sum(y^2)))
outer:
x <- seq(-1, 1, length=200) A <- outer(x, x, function(x, y) cos(x)^2 - sin(y)^2 ) require(lattice) levelplot(A)
outer(LETTERS[1:3], 1:5, paste, sep="")
x <- rnorm(1e7) system.time( a1 <- drop(crossprod(x))) system.time( a2 <- sum(x^2)) # -> matrix version faster identical(a1, a2) # check the answer is the same
solveto solve a system of equations,
eigento calculate eigenvalues and eigenvectors of a matrix, as well as
svdto compute the singular-value decomposition of a rectangular matrix.
library(help="Matrix")
Statistical Analysis in R
Work in progress...This will be just a very brief taster of some of the many things that R can do in the way of statistical analysis.
Fast bootstrap resampling to estimate regression parameter errors
Bootstrap resampling is a very useful method to determine parameter error estimates. This section makes use of the boot
R package, which can be loaded with library(boot)
or require(boot)
.
- A demonstration using simple linear regression:
set.seed(123) # allow reproducible random numbers N <- 20 A <- data.frame(x=1:N, y=rnorm(N, mean=1:N)) # create some data plot(y ~ x, A) # plot the data m <- lm(y ~ x, A) # fit linear model abline(m) # add best-fit model to plot as a line summary(m) # show model best fit values & standard errors #--Create a simple function to return the best-fit coefficients for the model # fitted to a subset of the original data ("A"), given a vector of row # numbers for the data frame ("indices"). "indices" will be the same length # as "nrow(A)", and will be supplied by the "boot" function, using random # sampling with replacement (i.e. "sample(nrow(A), replace=TRUE)") mystat <- function(A, indices) { m <- lm(y ~ x, A[indices, ]) return(coef(m)) } #--Demonstrate function: > mystat(A, 1:nrow(A)) # same as "coef(m)" (Intercept) x 0.3100925 0.9839554 > set.seed(123) # allow reproducible random numbers > mystat(A, sample(nrow(A), replace=TRUE)) # result for a single resample (Intercept) x 0.6143148 0.9416338 #--Run full set of "N.boot" bootstrap resamples: N.boot <- 500 require(boot) # load boot library set.seed(123) # allow reproducible random numbers b <- boot(A, mystat, R=N.boot) #--Plot results of bootstrapping: plot(b, index=1) # intercept plot(b, index=2) # slope # see "?plot.boot" for details #--Now compare the standard errors on the model parameters from # the bootstrap resampling with those from the normal summary method: > b # print results ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = A, statistic = mystat, R = N.boot) Bootstrap Statistics : original bias std. error t1* 0.3100925 0.0139150827 0.44749693 t2* 0.9839554 -0.0007816746 0.03979363 ## NB the bias is the difference between the mean of the N.boot ## resample parameter values and the original best-fit model ## parameter values, i.e. "apply(b$t, 2, mean) - coef(m)" #--Now show the standard errors computed (see "?summary.lm"): > coef(summary(m)) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3100925 0.46199915 0.6711972 5.106173e-01 x 0.9839554 0.03856694 25.5129205 1.389147e-15 #--Or, quantify the difference by comparing the ratios: > sd(b$t) / coef(summary(m))[, 2] # close to 1 in each case (Intercept) x 0.9686099 1.0318067 - A quicker version of the above example, using the method described in Section 4.3.1 from Chambers & Hastie, 1992 (see
References
section in?lm
for book details). This method exploits the fact thatlm
does a certain amount of initial processing prior to the actual regression (usinglm.fit
), and this represents a substantial overhead that need only be performed once (and need not be replicated during each bootstrap resampling iteration).# following Section 4.3.1 from Chambers & Hastie (1992): m <- lm(y ~ x, A, x=TRUE, y=TRUE) # "x=y=TRUE" returns extra data for lm.fit mystat.fast <- function(dummy, i, model) coef(lm.fit(model$x[i, ], model$y[i])) require(boot) set.seed(123) # allow reproducible random numbers system.time(slow <- boot(A, mystat, 1e4)) set.seed(123) system.time(fast <- boot(A, mystat.fast, 1e4, model=m)) # roughly 10x faster #--Check results are identical: slow fast #--Formal check if two objects are identical: # the only differences reported are in components 6 & 8, which are due to the # different mystat function & name (i.e. "mystat" vs. "mstat.fast"), which is # stored in the object returned by "boot": identical(slow, fast) # not completely identical all.equal(slow, fast) # only differences due to different mystat functions
Note that there is no point having a very large number of bootstrap samples compared to the number of fitted values (i.e. the number of rows in the data frame), since the latter ultimately becomes the limiting factor in the accuracy of the recovered parameter error estimates.
-
An example using non-linear regression (
nls
)#--Create some non-linear data: N <- 20 set.seed(123) #--Create some data: B <- data.frame(x=1:N, y=(4 * log10(1:N)) + rnorm(N, mean=2, sd=0.2)) plot(y ~ x, B) # plot the data #--Fit the non-linear model m <- nls(y ~ a * log10(x) + b, data=B, start=list(a=1, b=1)) lines(B$x, fitted(m), lty=2) # plot best-fit model values as a dashed line #--A better way of plotting the best-fit model (as a smooth curve): curve(predict(m, newdata=data.frame(x=x)), add=TRUE) summary(m) # summarise the best-fit parameters and their errors etc. #--There is no equivalent possible for the fast version of "mystat" # for nls, so set up the basic function to calculate bootstrapped fit: mystat <- function(A, indices) { m <- nls(y ~ a * log10(x) + b, data=B[indices, ], start=list(a=1, b=1)) return(coef(m)) } #--Run full set of "N.boot" bootstrap resamples: N.boot <- 500 set.seed(123) require(boot) b <- boot(B, mystat, R=N.boot) #--Plot results of bootstrapping: # (NB note significant non-normal distribution of values; # i.e. right panel quantile-quantile plot values deviate from a straight line) plot(b, index=1) # parameter "a" plot(b, index=2) # parameter "b" #--Now compare the standard errors on the model parameters from # the bootstrap resampling with those from the normal summary method: > b # print results of bootstrap ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = B, statistic = mystat, R = N.boot) Bootstrap Statistics : original bias std. error t1* 3.986804 -0.02582234 0.1361342 t2* 2.040456 0.02679389 0.1332876 > summary(m) # print standard errors (see "?summary.nls") Formula: y ~ a * log10(x) + b Parameters: Estimate Std. Error t value Pr(>|t|) a 3.9868 0.1299 30.70 < 2e-16 *** b 2.0405 0.1275 16.01 4.33e-12 *** --- Signif. pres: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1998 on 18 degrees of freedom Number of iterations to convergence: 1 Achieved convergence tolerance: 1.234e-07 #--Or, quantify the difference by comparing the ratios: sd(b$t) / coef(summary(m))[, 2] # close to 1 in each case a b 1.048229 1.045583
Permutation resampling to estimate p-values
Permutation resampling is a very useful method to estimate the p-value of a statistical test in cases that cause problems for conventional estimates. An example of this is using the Kolmogorov-Smirnov test with tied data. The example here uses the 2-sample KS test to test if two sets of numbers are drawn from the same distribution. The code is based on the example from Rizzo, 2008 (Statistical Computing with R
). The null hypothesis is that the 2 vectors (x and y) are drawn from the same distribution, so the test works by pooling the data and repeatedly drawing 2 vectors of length equal to x and y from random permutations of the pooled sample, i.e. sampling without replacement (unlike bootstrapping, which uses sampling with replacement)
- Function to calculate the permutation p-value:
PermTest <- function(x, y, R=999, testfun=ks.test) { z <- c(x, y) # pooled sample myfun <- function(a, b) suppressWarnings(unname(testfun(a, b)$statistic)) DoIt <- function() { i <- sample(length(z), length(x)) myfun(z[i], z[-i]) # z[-i] is everything *except* the "i" elements of z } pstats <- replicate(R, DoIt()) stat <- myfun(x, y) c("p-value" = mean(c(stat, pstats) >= stat)) } -
Now create some simulated counts data, which will have lots of ties:
N <- 100 set.seed(123) # allow reproducible random numbers x <- rpois(N, lambda=1) set.seed(123) # allow reproducible random numbers y <- rpois(N, lambda=1.5) ## Run KS test: > ks.test(x, y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.16, p-value = 0.1545 alternative hypothesis: two-sided Warning message: In ks.test(x, y) : p-values will be approximate in the presence of ties ## Compare to the permutation result: > PermTest(x, y) p-value 0.04In this case the conventional (but incorrect) p-value suggests that x and y are drawn from the same distribution, but the permutation test is able to determine that they are not from the same distribution (i.e. p-value < 0.05). Note the warning message reported by ks.test - this is suppressed in PermTest, by using suppressWarnings.
- Now, to check the permuted p-values against the KS test exact p-values, run some simulations of data without ties:
## Function to compare the 2 methods for calculating the p-value: CompareTest <- function(x, y, testfun=ks.test, N=100) { x <- rnorm(N); y <- rnorm(N) p.exact <- suppressWarnings(testfun(x, y)$p.value) p.perm <- unname(PermTest(x, y)) c(p.exact = p.exact, p.perm = p.perm) } ## Run the tests across multiple cores, to speed up the calculation: require(parallel) system.time( res <- mclapply(1:1000, CompareTest) ) ## Combine separate list elements into a data frame: res <- as.data.frame(do.call(rbind, res)) ## Plot permuted vs. exact p-values & add line of equality: plot(p.perm ~ p.exact, data=res); abline(0, 1) ## Fit least squares regression: fm <- lm( p.perm ~ p.exact, data=res) > coef(summary(fm)) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0006785381 0.0009002172 0.7537493 0.4511775 p.exact 1.0017142458 0.0014530249 689.3992438 0.0000000 ## - you can see that the slope is not significantly different from 1 ## Check for differences in the locations of the exact & permuted ## p-values using the non-parametric Kruskal-Wallis rank sum test: > kruskal.test( values ~ ind, data=stack(res)) Kruskal-Wallis rank sum test data: values by ind Kruskal-Wallis chi-squared = 0.1346, df = 1, p-value = 0.7137So, in there is no indication of any difference in the distributions of the p-values estimated by permutation compared to those computed exactly.
For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.
Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.