See also the introduction to R here. There's also an R wiki with lots of useful information, tips etc.
- The online help pages can be accessed as follows:
help.start() # Load HTML help pages into browser help(package) # List help page for "package" ?package # Shortform for "help(package)" help.search("keyword") # Search help pages for "keyword" ?help # For more options help(package=base) # List tasks in package "base" q() # Quit R - Some extra packages are not loaded by default, but can be loaded as follows:
library("package") library() # List available packages to load detach("package:pkg") # Unload the loaded package "pkg" library(help="package") # list package contents - Commands can be separated by a semicolon (
;
) or a newline - All characters after
#
are treated as comments (even on the same line as commands) - Variables are assigned using
<-
(although=
also works):x <- 4.5 y <- 2*x^2 # Use "<<-" for global assignment
- Vectors
If x is a vector, y will be a vector of the same length. The vector arithmetic in R means that an expression involving vectors of the same length (e.g. x+y, in the above example) will be evaluated piecewise for corresponding elements, without the need to loop explicitly over the values:
x <- c(1, 4, 5, 8) # Concatenate values & assign to x y <- 2*x^2 # Evaluate expression for each element in x > y # Typing the name of a variable evaluates it [1] 2 32 50 128
> x+y [1] 3 36 55 136 > 1:10 # Same as seq(1, 10) [1] 1 2 3 4 5 6 7 8 9 10 > a <- 1:10 a[-1] # Print whole vector *except* the first element a[-c(1, 5)] # Print whole vector except elements 1 & 5 a[1:5] # Print elements 1 to 5 a[length(a)] # Print last element, for any length of vector head(a, n=3) # Print the first n elements of a tail(a, n=3) # Print the last n elements of a
- Demonstrations of certain topics are available:
demo(graphics) # Run grapics demonstration demo() # List topics for which demos are available ?demo # List help for "demo" function
- Typing the name of a function lists its contents; typing a
variable evaluates it, or you can use
print(x)
. - First, generate a set of 10,000 Gaussian distributed random numbers:
See
data <- rnorm(1e4) # Gaussian distributed numbers with mean=0 & sigma=1 hist(data) # Plots a histogram, with sensible bin choices by default
?hist
for the full range of control parameters available, e.g. to specifiy 7 bins:hist(data, breaks=7)
- Assuming you have a file (
file.dat
) containing the following data, with either spaces or tabs between fields:Now read this file into R:r x y 1 4.2 14.2 2 2.4 64.8 3 8.76 63.4 4 5.9 32.2 5 3.4 89.6
Note that you can refer to R objects, names etc. using the first few letters of the full name, provided that is unambiguous, e.g.:> inp <- read.table("file.dat", header=T) # Read data into data frame > inp # Print contents of inp > inp[0] # Print the type of object that inp is: NULL data frame with 5 rows > colnames(inp) # Print the column names for inp: [1] "r" "x" "y" # This is simply a vector that can easily be changed: > colnames(inp) <- c("radius", "temperature", "time") > colnames(inp) [1] "radius" "temperature" "time"but note what happens if the information is ambiguous or if the column doesn't exist:> inp$r # Print the Radius column
Alternatively, you can refer to the columns by number:> inp$t # Could match "inp$temperature" or "inp$time" NULL > inp$wibble # "wibble" column doesn't exist NULL
> inp[[1]] # Print data as a vector (use "[[" & "]]") [1] 1 2 3 4 5 > inp[1] # Print data as a data.frame (only use "[" and "]")
- Writing data out to a file (if no filename is specified (the default), the output is written to the console)
By default the columns are separated by whitespace, but you can change this with the
> write.table(inp, quote=F, row.names=F, col.names=T) radius temperature time 1 4.2 14.2 2 2.4 64.8 3 8.76 63.4 4 5.9 32.2 5 3.4 89.6
sep=
option (see?write.table
for details), e.g. use a:
with a tab either side:> write.table(inp, quote=F, row.names=F, col.names=T, sep="\t:\t") radius : temperature : time 1 : 4.2 : 14.2 2 : 2.4 : 64.8 3 : 8.76 : 63.4 4 : 5.9 : 32.2 5 : 3.4 : 89.6
- Other types of connection
?connections # Help page on opening/closing connections write.table() # Write data to file # You can also refer to URLs for files, e.g. a <- read.table("http://www.sr.bham.ac.uk/~ajrs/R/datasets/file.dat", header=T) - Loading R commands from a file
source("commands.R") # Or, to load from a path specified by an environment variable # "$ENV_VAR/commands.R" source(paste(Sys.getenv("ENV_VAR"),"/commands.R",sep="")) - Saving data
Note that when you quit R (by typing
?save save(inp, t, file="data.RData") load(file="data.RData") # read back in the saved data: # - this will load the saved R objects into the current session, with # the same properties, i.e. all the variable will have the same names # and contents
q()
), it asks if you want to save the workspace image, if you specify yes (y
), it writes out two files to the current directory, called.RData
and.Rhistory
. The former contains the contents of the saved session (i.e. all the R objects in memory) and the latter is a list of all the command history (it's just a simple text file you can look at).
At any time you can save the history of commands using:savehistory(file="my.Rhistory") loadhistory(file="my.Rhistory") # load history file into current R session
- Useful functions:
Tip: to generate png, jpeg etc. files, I find it's best to create pdf versions in R, then convert them in gimp (having specified strong antialiasing, if asked, when loading the pdf file into gimp), otherwise the figures are "blocky" (i.e. suffer from aliasing).
?plot # Help page for plot command ?par # Help page for graphics parameter control ?Devices # or "?device" # (R can output in postscript, PDF, bitmap, PNG, JPEG and more formats) dev.list() # list graphics devices colours() # or "colors()" List all available colours ?plotmath # Help page for writing maths expressions in R
- To create an output file copy of a plot for printing or including in a document etc.
dev.copy2pdf(file="myplot.pdf") # } Copy contents of current graphics device dev.copy2eps(file="myplot.eps") # } to a PDF or Postscript file dev.copy()
- Adding more datasets to a plot:
x <- 1:10; y <- x^2 z <- 0.9*x^2 plot(x, y) # Plot original data points(x, z, pch="+") # Add new data, with different symbols lines(x,y) # Add a solid line for original data lines(x, z, col="red", lty=2) # Add a red dashed line for new data curve(1.1*x^2, add=T, lty=3, col="blue") # Plot a function as a curve text(2, 60, "An annotation") # Write text on plot abline(h=50) # Add a horizontal line abline(v=3, col="orange") # Add a vertical line
- Error bars (click here for source code)
Plot graph, but don't show the points (type="n") & plot errors:
source("http://www.sr.bham.ac.uk/~ajrs/R/scripts/errorbar.R") # source code # Create some data to plot: x <- seq(1,10); y <- x^2 xlo <- 0.9*x; xup <- 1.08*x; ylo <- 0.85*y; yup <- 1.2*yplot(x, y, log="xy", type="n")
errorbar(x, xlo, xup, y, ylo, yup)?plot
gives more info on plotting options (as does?par
). To use different line colours, styles, thicknesses & errorbar types for different points:errorbar(x, xlo, xup, y, ylo, yup, col=rainbow(length(x)), lty=1:5, type=c("b", "d", "c", "x"), lwd=1:3) - Plotting a function or equation
Plot chi-squared probability vs. reduced chi-squared for 10 and then 100 degrees of freedom
curve(x^2, from=1, to=100) # Plot a function over a given range curve(x^2, from=1, to=100, log="xy") # With log-log axes
Define & plot custom function:dof <- 10 curve(pchisq(x*dof,df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1 dof <- 100 curve(pchisq(x*dof, df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function") abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1myfun <- function(x) x^3 + log(x)*sin(x) #--Plot dotted (lty=3) red line, with 3x normal line width (lwd=3): curve(myfun, from=1, to=10, lty=3, col="red", lwd=3) #--Add a legend, inset slightly from top left of plot: legend("topleft", inset=0.05, "My function", lty=3, lwd=3, col="red") - Plotting maths symbols:
demo(plotmath) # Shows a demonstration of many examples ?plotmath # Manual page with more information
- Basic stuff
Plot histogram in terms of probability & also overlay density plot:
x <- rnorm(100) # Create 100 Gaussian-distributed random numbers hist(x) # Plot histogram of x summary(x) # Some basic statistical info d <- density(x) # Compute kernel density estimate d # Print info on density analysis ("?density" for details) plot(d) # Plot density curve for x plot(density(x)) # Plot density curve directlyhist(x, probability=T) lines(density(x), col="red") # overlay smoothed density curve #--Standard stuff: mean(x); median(x); max(x); min(x); sum(x); weighted.mean(x) sd(x) # Standard deviation var(x) # Variance mad(x) # median absolute deviation (robust sd)
- Testing the robustness of statisical estimators
Calculate standard deviation of 100 Gaussian-distributed random numbers (NB default sd is 1; can specify with
?replicate # Repeatedly evaluate an expression
rnorm(100, sd=3)
or whatever)Now, let's run N=5 Monte Carlo simulations of the above test:sd(rnorm(100))
and then compute the mean of the N=5 values:replicate(5, sd(rnorm(100)))
Obviously 5 is rather small for a number of MC sims (repeat the above command & see how the answer fluctuates). Let's try 10,000 instead:mean(replicate(5, sd(rnorm(100))))
With a sample size of 100, the measured sd closely matches the actual value used to create the random data (sd=1). However, see what happens when the sample size decreases:mean(replicate(1e4, sd(rnorm(100))))
You can see the bias more clearly with a histogram or density plot:mean(replicate(1e4, sd(rnorm(10)))) # almost a 3% bias low mean(replicate(1e4, sd(rnorm(3)))) # >10% bias low
This bias is important to consider when esimating the velocity dispersion of a stellar or galaxy system with a small number of tracers. The bias is even worse for the robust sd estimatora <- replicate(1e4, sd(rnorm(3))) hist(b, breaks=100, probability=T) # Specify lots of bins lines(density(b), col="red") # Overlay kernel density estimate abline(v=1, col="blue", lwd=3) # Mark true value with thick blue line
mad()
:However, in the presence of outliers,mean(replicate(1e4, mad(rnorm(10)))) # ~9% bias low mean(replicate(1e4, mad(rnorm(3)))) # >30% bias low
mad()
is robust compared tosd()
as seen with the following demonstration. First, create a function to add a 5 sigma outlier to a Gaussian random distribution:Now compare the performance ofoutlier <- function(N,mean=0,sd=1) { a <- rnorm(N, mean=mean, sd=sd) a[1] <- mean+5*sd # Replace 1st value with 5 sigma outlier return(a) }mad()
vs.sd()
:You can also compare the median vs. mean:mean(replicate(1e4, mad(outlier(10)))) mean(replicate(1e4, sd(outlier(10))))
...and without the outlier:mean(replicate(1e4, median(outlier(10)))) mean(replicate(1e4, mean(outlier(10))))
mean(replicate(1e4, median(rnorm(10)))) mean(replicate(1e4, mean(rnorm(10))))
- Kolmogorov-Smirnov test, taken from manual page (
?ks.test
)x <- rnorm(50) # 50 Gaussian random numbers y <- runif(30) # 30 uniform random numbers # Do x and y come from the same distribution? ks.test(x, y)
- Correlation tests
Load in some data (see section on data frames, below):dfile <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(dfile, header=T, sep="|") cor.test(A$z, A$Tx) # Specify X & Y data separately cor.test(~ z + Tx, data=A) # Alternative format when using a data frame ("A") cor.test(A$z, A$Tx, method="spearman") # } Alternative methods cor.test(A$z, A$Tx, method="kendall") # } - Spline interpolation of data
x <- 1:20 y <- jitter(x^2, factor=20) # Add some noise to vector sf <- splinefun(x, y) # Perform cubic spline interpolation # - note that "sf" is a function: sf(1.5) # Evaluate spline function at x=1.5 plot(x,y); curve(sf(x), add=T) # Plot data points & spline curve
- Scatter plot smoothing
#--Using above data frame ("A"): plot(Tx ~ z, data=A) #--Return (X & Y values of locally-weighted polynomial regression lowess(A$z, A$Tx) #--Plot smoothed data as line on graph: lines(lowess(A$z, A$Tx), col="red") - Numerical integration of a function (see
Functions
section below)fun <- function(x, norm, index) norm*x^index # Create simple function # See "?integrate" for details. Note that: # 1) the names of arguments in "fun" must not match those of arguments # in "integrate()" itself. # 2) fun must be able to return a vector, if supplied with a vector # (i.e. not just a single value) # i <- integrate(fun, lower=1, upper=10, norm=0.5, index=2) > i 166.5 with absolute error < 1.8e-12 > i[0] list() # Note that integrate() returns a list > names(i) # Show names components in list [1] "value" "abs.error" "subdivisions" "message" "call" > i$value # If you want the integral value alone [1] 166.5
- Regression
(see also the tutorial at Penn State's Center for Astrostatistics here, which is more thorough.) Load in some data (see section on data frames, below; this dataset is described in more detail in the section on factors, below):Specify 4 plots on same page, in 2x2 configuration:dfile <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(dfile, header=T, sep="|") names(A) # Print column names in data frame m <- lm(Tx ~ z, data=A) # Fit linear model summary(m) # Show best-fit parameters & errors
Plot data & best-fit model:layout(matrix(1:4, nrow=2, ncol=2)) layout.show(4) # Show plot layout for 4 plots plot(m) # Shows useful diagnostics of the fit (4 plots)
Plot residuals:plot(Tx ~ z, A) abline(m, col="blue") # add best-fit straight line model
Something fancier: show data, model & residuals in 2 panels:plot(A$z, resid(m))
Nonlinear least-squares estimates of the parameters of a nonlinear model:layout(matrix(c(1,1,2)),heights=c(1,1)) # Specify layout for panels layout.show(2) # Show plot layout for 2 plots plot(Tx ~ z, data=A) abline(m, col="blue") plot(A$z, resid(m), main="Residuals", col="blue", pch=20)
Since R was developed with experimental (rather than observational) sciences in mind, it is not geared up to handle errors in both X and Y directions. Therefore the existing algorithms all perform regression in one direction only. However, unweighed orthogonal regression is equivalent to principal components analysis.?nls
- Create a data frame:
A <- data.frame(a=1:3, b=4:6, c=7:9) colnames(A) # List column names for A rownames(A) # List row names (will just be numbers if not specified)
Accessing data frames
To retrieve a named column, where the column name is itself a variable:A$a # Print the column named "a" A[c("a", "c")] # Print the columns listed in the square brackets A[2] # Print column 2 of A (type=data frame) A[[2]] # Print column 2 of A (type=vector) A[,2] # Print column 2 of A (type=vector) A[2,] # Print row 2 of A A[1, ] # Print the 1st row of data frame A A[2:3, ] # Print rows 2 to 3 A[, -1] # Print everything except the 1st column of A A[0] # Print type of object (data frame) A[[1]][0] # (numeric, i.e. vector) A[A$a%%2 == 1, ] # Print rows of A where column "a" has an odd number A[A$b<6 & A$c>7, ] # Print rows of A according to a combination of constraints A[A$b<6 | A$c>7, ] # ("&" = logical AND; "|" = logical OR) A$a <- NULL # Delete a column from a data framename <- "b" A[name] # Produces a data frame A[[name]] # Produces a vector
Manipulating data frames
To join data frame B to A (side by side), you could use:B <- data.frame(d=11:13, e=14:16, f=17:19) A + 1 # Perform arithmetic on each element of A A * B # Operates on each element, if A & B have same dimensions A[order(A$a, decreasing=T), ] # Sort A according to a specified column t(A) # Transpose A (swap rows for columns & vice versa) # NB the type of t(A) is no longer a data frame, but you can make it so: as.data.frame(t(A))
However, in a situation where this operation might be occuring iteratively, you could end up with multiple, repeated instances of B:A <- data.frame(A, B) # Can also use "cbind(A, B)"
A better way might be to use:A <- data.frame(A, B) # To recover the original data, use: A <- A[, 1:3]
To append one data frame below another (instead of side-by-side), use:A[colnames(B)] <- B # - this will replace any existing named columns from B that are already # present in A
Note that if column names are specified, they must match in all data.frames for rbind to work:rbind(A, A, A)
rbind(A, B) # Fails! colnames(B) <- colnames(A) # Use same column names for B as for A rbind(A, B) # Works
- You can use
attach(A)
anddetach(A)
to add/remove data frame to/from the current search path, allowing you to use the column names as if they were standalone vectors. However, this approach is usually best avoided, since it can cause confusion if any objects exist with the same name as a column of an attached data frame. It's better to usewith
to achieve the same effect within a controlled setting: - The
with
function enables you to work with column names without having to prefix the data frame name, as followsA <- data.frame(a=1:20, b=rnorm(20)) with(A, a^2 + 2*b)
- This principle is used in the
transform
function, to allow you to construct new columns in the data frame using references to the other column names:transform(A, c=a^2 + 2*b) # Add new column to data frame
- Similarly, the
subset
function works in the same way to enable easy filtering of data frames:subset(A, b >= 0 & a%%2 == 0) # Returns rows with positive b & even numbered a
- You can also easily perform database join operations, using
merge
:B <- data.frame(a=1:7, x=runif(7)) merge(A, B) # Return rows with same "a", combining unique columns from A & B
- Basics
cat # Type function name without brackets to list contents args(cat) # Return arguments of any function body(cat) # Return main body of function formals(fun) # Get or set the formal arguments of a function debug(fun); undebug(fun) # Set or unset the debugging flag on a function
- Create your own function
> fun <- function(x, a, b, c) (a*x^2) + (b*x^2) + c > fun(3, 1, 2, 3)
[1] 30
> fun(5, 1, 2, 3) [1] 78 - A more complicated example of a function. First, create some data:
Now create a simple sigma clipping function (you can paste the following straight into R):
set.seed(123) # allow reproducible random numbers a <- rnorm(1000, mean=10, sd=1) # 1000 Gaussian random numbers b <- rnorm(100, mean=50, sd=15) # smaller population of higher numbers x <- c(a, b) # Combine datasets hist(x) # Shows outlier population clearly sd(x) # Strongly biased by outliers mad(x) # Robustly estimates sd of main sample mean(x) # biased median(x) # robust
Now apply the function to the test dataset:sigma.clip <- function(vec, nclip=3, N.max=5) { mean <- mean(vec); sigma <- sd(vec) clip.lo <- mean - (nclip*sigma) clip.up <- mean + (nclip*sigma) vec <- vec[vec < clip.up & vec > clip.lo] # Remove outliers if ( N.max > 0 ) { N.max <- N.max - 1 # Note the use of recursion here (i.e. the function calls itself): vec <- Recall(vec, nclip=nclip, N.max=N.max) } return(vec) }new <- sigma.clip(x) hist(new) # Only the main population remains! length(new) # } Compare numbers of length(x) # } values before & after clipping
- Basics
List type of vector b ("factor"): also lists the "levels", i.e. all the different types of value:
a <- rpois(100, 3) # Create 100 Poisson distributed random numbers (lambda=3) hist(a) # Plot frequency histogram of data a[0] # List type of vector ("numeric") b <- as.factor(a) # Change type to factorNow, plot b, to get a barchart showing the frequency of entries in each level:b[0]
plot(b) table(b) # a tabular summary of the same information
- A more practical example. Read in table of data (from Sanderson et al. 2006), which refers to a sample of 20 clusters of galaxies with Chandra X-ray data:
By default,
file <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson06/table1.txt" A <- read.table(file, header=T, sep="|")
read.table()
will treat non-numeric values as factors. Sometimes this is annoying, in which case useas.is=T
; you can specify the type of all the input columns explicitly -- see?read.table
.TheA[1,] # Print 1st row of data frame (will also show column names): plot(A$det) # Plot the numbers of each detector type plot(A$cctype) # Plot the numbers of cool-core and non-cool core clusters
det
column is the detector used for the observation of each cluster:S
for ACIS-S andI
for ACIS-I. Thecctype
denotes the cool core type:CC
for cool core,NCC
for non-cool core.
Now, plot one factor against another, to get a plot showing the number of overlaps between the two categories (the cool-core clusters are almost all observed with ACIS-S):You can easily explore the properties of the different categories, by plotting a factor against a numeric variable. The result is to show a boxplot (seeplot(cctype ~ det, data=a, xlab="ACIS Detector", ylab="Cool-core type")
?boxplot
) of the numeric variable for each catagory. For example, compare the mean temperature of cool-core & non-cool core clusters:and compare the redshifts of clusters observed with ACIS-S & ACIS-I:plot(kT ~ cctype, data=A)
You can define your own factor categories by partitioning a numeric vector of data into discrete bins usingplot(z ~ det, data=A)
cut()
, as shown here: - Converting a numeric vector into factors
Split cluster redshift range into 2 bins of equal width (nearer & further clusters), and assign each z point accordingly; create new column of data frame with the bin assignments (A$z.bin
):Plot mean temperatures of nearer & further clusters. Since this is essentially a flux-limited sample of galaxy clusters, the more distant clusters (higher redshift) are hotter (and hence more X-ray luminous - an example of the common selection effect known as Malmquist bias):A$z.bin <- cut(A$z, 2) A$z.bin # Show levels & bin assignments plot(A$z.bin) # Show bar chart
Check if redshifts of (non) cool-core clusters differ:plot(z ~ T.bin, data=A) plot(Tx ~ z, data=A) # More clearly seen!
Now, let's say you wanted to colour-code the different CC types. First, create a vector of colours by copying the cctype column (you can add this as a column to the data frame, or keep it as a separate vector):plot(cctype ~ z.bin, data=A) # Equally mixed
Now all you need to do is change the names of the levels:colour <- A$cctype
Now change thelevels(colour) # Show existing levels levels(colour) <- c("blue", "red") # specify blue=CC, red=NCC colour[0] # Show data type (& levels, since it's is a factor)type
of the vector tocharacter
(i.e. not a factor):Now plot mean temperature vs. redshift, colour coded by CC type:col <- as.character(colour) col[0] # Confirm change of data type col # } Print values and colour # } note the difference
Why not add a plot legend:plot(Tx ~ z, A, col=col)
Now for something a little bit fancier:legend(x="topleft", legend=c(levels(A$cctype)), text.col=levels(colour))
plot(Tx ~ z, A, col=col, xlab="Redshift", ylab="Temperature (keV)") legend(x="topleft", c(levels(A$cctype)), col=levels(colour), text.col=levels(colour), inset=0.02, pch=1)
A great feature of R functions is that they offer a lot of control to the user, but provide sensible hidden defaults. A good example is the histogram function, which works very well without any explicit control:
Data Input/Output
See also here
Plotting data
See also here
Analysis
Data Frames
See also here
Working with data frames
Functions
See also here
Factors
See also here
Factors are a vector type which treats the values as character strings which form part of a base set of values (called "levels"). The result of plotting a factor is to produce a frequency histogram of the number of entries in each level (see ?factor for more info).
For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.
Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.