One of the most important aspects of computing with data is the ability to manipulate it, to enable subsequent analysis and visualization. R offers a wide range of tools for this purpose. Note that the plyr
package provides an even more powerful and convenient means of manipulating and processing data, which I hope to describe in later updates to this page.
Add and remove data
First create a data frame, then remove a column and create a new one:
A <- data.frame(a=LETTERS[1:5], b=1:5, c=rnorm(5)) A$d <- NULL # to delete column "d" A$e <- 1:5 # add in a new column "e"
1&
2). Note the use of the same column names to see what happens when
A&
Bare joined together:
set.seed(123) # allow reproducible random numbers B <- data.frame(a=letters[1:5], b=sample(1:2, size=5, replace=TRUE))
stringsAsFactors=FALSEwhen using
data.frame):
> sapply(A, class)
a b c e
"factor" "integer" "numeric" "integer"
> sapply(B, class)
aa bb
"factor" "integer"
c, but the result will be a list:
c(A, B) # creates a list > class(c(A, B)) [1] "list"
data.frame:
AB1 <- as.data.frame(c(A, B)) AB2 <- data.frame(A, B) > identical( AB1, AB2 ) [1] TRUE
cbind:
AB3 <- cbind(A, B) colnames(AB2) # } note the colnames(AB3) # } difference
A&
Bare rendered unambiguous when using
as.data.frame(c(A, B)), by appending
.1to the 2nd data frame column names. It does this using
make.unique, which is useful if you need to generate unique elements, given a vector containing duplicated character strings.
do.call
do.call
constructs and executes a function call from a name or a function and a list of arguments to be passed to it. It is an extremely useful task, that can be used to join together data data frames stored in a list, for example:
l <- list(first=A[1, ], second=A[2, ], rest=A[-c(1:2), ]) do.call(rbind, l)
cor
rbind, without losing the 2 dimensional structure of the data stored within each component of the list.
- add columns to df, c, data.array(A, newdf) - subset, transform, "[", "[[", methods - rbind, cbind
Joining data frames
merge
is used to perform a database join operation to merge together rows of 2 data frames which share
common entries in one or more columns:
A <- data.frame(letter=LETTERS[1:5], a=1:5) B <- data.frame(letter=LETTERS[sample(10)], x=runif(10)) merge(A, B) # Return rows with same "letter", combining unique columns from A & B merge(A, B, all=TRUE) # see how non-overlapping "letter" values are handled
matchidentifies common elements between 2 vectors and returns the positions in the second vector of these matching elements in the order they appear in the first vector.
match(c("B", "E"), LETTERS)
match(c("B", "3", "E"), LETTERS) # returns "NA" if no corresponding match
B[match(A$letter, B$letter), ] # same as "merge(A, B)" but with row names from "B"
x <- rep(LETTERS[1:3], each=3) match(LETTERS[1:3], x) # "match" *only* returns position of *first* match match(x, LETTERS[1:3]) # match returns a vector as long as its first argument
%in%:
x[x %in% "B"] <- "b" # change elements of x equal to "B"
#--Alternatively:
x[grep("C", x)] <- "c" # change elements of x equal to "C"
intersect(1:5, 3:8) union(1:5, 3:8)
x <- c(1:5, 3:8) duplicated(x) # logical vector which(duplicated(x)) # return duplicate element numbers unique(x) # same as "x[! duplicated(x)]"
Rearranging data structures
To sort a vector:a <- sample(1:10) sort(a) sort(a, decreasing=TRUE) # reverse order order(a) # the element numbers of "a" in order of the values of "a" a[order(a)] # same as "sort(a)"
orderto specify the row order of the data frame:
A <- data.frame(a=sample(LETTERS[1:5]), b=sample(1:5)) A[order(A$a), ] # } compare A[order(A$b), ] # }
A <- matrix(1:6, nrow=3) t(A)
treturns a matrix, the equivalent for a data frame is as follows:
#--Create a data frame with column *and* row names:
B <- data.frame(a=1:3, b=LETTERS[1:3], row.names=c("one", "two", "three"))
> as.data.frame(t(B))
one two three
a 1 2 3
b A B C
aperm:
A <- array(1:12, dim=c(2, 2, 3)) # create a 3d array aperm(A, perm=1:3) # return original structure aperm(A, perm=c(1, 3, 2)) # swap 2nd & 3rd dimensions
Reshaping data
First create some multi-column data:set.seed(123) # allow reproducible random numbers A <- data.frame(a=letters[1:3], x=rnorm(3), y=runif(3)) > A a x y 1 a -0.5604756 0.5281055 2 b -0.2301775 0.8924190 3 c 1.5587083 0.5514350
> stack(A)
values ind
1 -0.5604756 x
2 -0.2301775 x
3 1.5587083 x
4 0.5281055 y
5 0.8924190 y
6 0.5514350 y
# NB, the "ind" column is now a factor:
> class(stack(A)$ind)
[1] "factor"
ais lost in the stacking:
> unstack(stack(A))
x y
1 -0.5604756 0.5281055
2 -0.2301775 0.8924190
3 1.5587083 0.5514350
reshapewhich converts between so-called long and wide format data (i.e. columns stacked below each other vs. columns arranged beside each other). However, the documentation for
reshapeis remarkably opaque! A much more convenient function is
meltfrom the excellent
reshapepackage:
install.packages("reshape")
require(melt)
melt(A) # retains column "a", unlike "stack(A)"
Truncating and rounding data
set.seed(123) # allow reproducible random numbers x <- rnorm(20, sd=2) # default mean is zero round(x) # round to nearest integer round(x, 1) # round to 1 decimal place format(x, digits=1) # format to 1 d.p. (and convert to character) trunc(x) # x[round(x) != trunc(x)] # elements of x between N+0.5 and N+1, for integer N floor(x) # round down to nearest integer ceiling(x) # round up to nearest integer
i <- seq(along=x) # vector of x element numbers plot(i, x) # same as "plot(x)" abline(h=-4:3, lty=2) # add dashed lines to mark the integers segments(i, floor(x), i, ceiling(x)) # plot floor/ceiling values
x2 <- pmax(pmin(x, 1), 0) # uses nifty parallel maximum & minimum functions
plot(x, pch=3) # plot original data as "+" symbols abline(h=c(0, 1), lty=2) # show thresholds as dashed lines points(x2) # show thresholded data as default hollow points elms <- x2 %in% c(0, 1) # elements of x2 which have been thresholded points(i[elms], x2[elms], pch=19) # highlight thresholded points
Miscellaneous commands
If a data frame contains any missing values (NA
), you can exclude the corresponding entire row:
A$y[4:9] <- A$x[2] <- NA
> na.omit(A)
x y
1 -0.5604756 0.8895393
3 1.5587083 0.6405068
10 -0.4456620 0.1471136
> unlist(list(a=1, b=2:5, c=6)) a b1 b2 b3 b4 c 1 2 3 4 5 6
splitto pull out separate list entries for each group:
A <- data.frame(group=LETTERS[rep(1:3, 1:3)], x=rnorm(6)) # 3 groups: "A", "B", "C" a <- split(A$x, A$group) > a $A [1] 0.6849361 $B [1] -0.3200564 -1.3115224 $C [1] -0.5996083 -0.1294107 0.8867361
unsplit:
unsplit(a, f=A$group) unname(unlist(a)) # same result
For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.
Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.