One of the most important aspects of computing with data is the ability to manipulate it, to enable subsequent analysis and visualization. R offers a wide range of tools for this purpose. Note that the plyr

package provides an even more powerful and convenient means of manipulating and processing data, which I hope to describe in later updates to this page.

# Add and remove data

First create a data frame, then remove a column and create a new one:

A <- data.frame(a=LETTERS[1:5], b=1:5, c=rnorm(5)) A$d <- NULL # to delete column "d" A$e <- 1:5 # add in a new column "e"

1&

2). Note the use of the same column names to see what happens when

A&

Bare joined together:

set.seed(123) # allow reproducible random numbers B <- data.frame(a=letters[1:5], b=sample(1:2, size=5, replace=TRUE))

when usingstringsAsFactors=FALSE

):data.frame

> sapply(A, class) a b c e "factor" "integer" "numeric" "integer" > sapply(B, class) aa bb "factor" "integer"

, but the result will be a list:c

c(A, B) # creates a list > class(c(A, B)) [1] "list"

:data.frame

AB1 <- as.data.frame(c(A, B)) AB2 <- data.frame(A, B) > identical( AB1, AB2 ) [1] TRUE

:cbind

AB3 <- cbind(A, B) colnames(AB2) # } note the colnames(AB3) # } difference

&A

are rendered unambiguous when usingB

, by appendingas.data.frame(c(A, B))

.1to the 2nd data frame column names. It does this using

, which is useful if you need to generate unique elements, given a vector containing duplicated character strings.make.unique

### do.call

constructs and executes a function call from a name or a function and a list of arguments to be passed to it. It is an extremely useful task, that can be used to join together data data frames stored in a list, for example:
`do.call`

l <- list(first=A[1, ], second=A[2, ], rest=A[-c(1:2), ]) do.call(rbind, l)

orc

, without losing the 2 dimensional structure of the data stored within each component of the list.rbind

- add columns to df, c, data.array(A, newdf) - subset, transform, "[", "[[", methods - rbind, cbind

# Joining data frames

is used to perform a database join operation to merge together rows of 2 data frames which share
common entries in one or more columns:`merge`

A <- data.frame(letter=LETTERS[1:5], a=1:5) B <- data.frame(letter=LETTERS[sample(10)], x=runif(10)) merge(A, B) # Return rows with same "letter", combining unique columns from A & B merge(A, B, all=TRUE) # see how non-overlapping "letter" values are handled

identifies common elements between 2 vectors and returns the positions in the second vector of these matching elements in the order they appear in the first vector.match

match(c("B", "E"), LETTERS) match(c("B", "3", "E"), LETTERS) # returns "NA" if no corresponding match

B[match(A$letter, B$letter), ] # same as "merge(A, B)" but with row names from "B"

x <- rep(LETTERS[1:3], each=3) match(LETTERS[1:3], x) # "match" *only* returns position of *first* match match(x, LETTERS[1:3]) # match returns a vector as long as its first argument

:%in%

x[x %in% "B"] <- "b" # change elements of x equal to "B" #--Alternatively: x[grep("C", x)] <- "c" # change elements of x equal to "C"

intersect(1:5, 3:8) union(1:5, 3:8)

x <- c(1:5, 3:8) duplicated(x) # logical vector which(duplicated(x)) # return duplicate element numbers unique(x) # same as "x[! duplicated(x)]"

# Rearranging data structures

To sort a vector:a <- sample(1:10) sort(a) sort(a, decreasing=TRUE) # reverse order order(a) # the element numbers of "a" in order of the values of "a" a[order(a)] # same as "sort(a)"

to specify the row order of the data frame:order

A <- data.frame(a=sample(LETTERS[1:5]), b=sample(1:5)) A[order(A$a), ] # } compare A[order(A$b), ] # }

A <- matrix(1:6, nrow=3) t(A)

returns a matrix, the equivalent for a data frame is as follows:t

#--Create a data frame with column *and* row names: B <- data.frame(a=1:3, b=LETTERS[1:3], row.names=c("one", "two", "three")) > as.data.frame(t(B)) one two three a 1 2 3 b A B C

:aperm

A <- array(1:12, dim=c(2, 2, 3)) # create a 3d array aperm(A, perm=1:3) # return original structure aperm(A, perm=c(1, 3, 2)) # swap 2nd & 3rd dimensions

# Reshaping data

First create some multi-column data:set.seed(123) # allow reproducible random numbers A <- data.frame(a=letters[1:3], x=rnorm(3), y=runif(3)) > A a x y 1 a -0.5604756 0.5281055 2 b -0.2301775 0.8924190 3 c 1.5587083 0.5514350

> stack(A) values ind 1 -0.5604756 x 2 -0.2301775 x 3 1.5587083 x 4 0.5281055 y 5 0.8924190 y 6 0.5514350 y # NB, the "ind" column is now a factor: > class(stack(A)$ind) [1] "factor"

is lost in the stacking:a

> unstack(stack(A)) x y 1 -0.5604756 0.5281055 2 -0.2301775 0.8924190 3 1.5587083 0.5514350

which converts between so-called long and wide format data (i.e. columns stacked below each other vs. columns arranged beside each other). However, the documentation forreshape

is remarkably opaque! A much more convenient function isreshape

from the excellentmelt

reshapepackage:

install.packages("reshape") require(melt) melt(A) # retains column "a", unlike "stack(A)"

# Truncating and rounding data

set.seed(123) # allow reproducible random numbers x <- rnorm(20, sd=2) # default mean is zero round(x) # round to nearest integer round(x, 1) # round to 1 decimal place format(x, digits=1) # format to 1 d.p. (and convert to character) trunc(x) # x[round(x) != trunc(x)] # elements of x between N+0.5 and N+1, for integer N floor(x) # round down to nearest integer ceiling(x) # round up to nearest integer

i <- seq(along=x) # vector of x element numbers plot(i, x) # same as "plot(x)" abline(h=-4:3, lty=2) # add dashed lines to mark the integers segments(i, floor(x), i, ceiling(x)) # plot floor/ceiling values

x2 <- pmax(pmin(x, 1), 0) # uses nifty parallel maximum & minimum functions

plot(x, pch=3) # plot original data as "+" symbols abline(h=c(0, 1), lty=2) # show thresholds as dashed lines points(x2) # show thresholded data as default hollow points elms <- x2 %in% c(0, 1) # elements of x2 which have been thresholded points(i[elms], x2[elms], pch=19) # highlight thresholded points

# Miscellaneous commands

If a data frame contains any missing values (

), you can exclude the corresponding entire row:`NA`

A$y[4:9] <- A$x[2] <- NA > na.omit(A) x y 1 -0.5604756 0.8895393 3 1.5587083 0.6405068 10 -0.4456620 0.1471136

> unlist(list(a=1, b=2:5, c=6)) a b1 b2 b3 b4 c 1 2 3 4 5 6

to pull out separate list entries for each group:split

A <- data.frame(group=LETTERS[rep(1:3, 1:3)], x=rnorm(6)) # 3 groups: "A", "B", "C" a <- split(A$x, A$group) > a $A [1] 0.6849361 $B [1] -0.3200564 -1.3115224 $C [1] -0.5996083 -0.1294107 0.8867361

:unsplit

unsplit(a, f=A$group) unname(unlist(a)) # same result

For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.

Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.