One of the most important aspects of computing with data is the ability to manipulate it, to enable subsequent analysis and visualization. R offers a wide range of tools for this purpose. Note that the plyr package provides an even more powerful and convenient means of manipulating and processing data, which I hope to describe in later updates to this page.


Add and remove data

First create a data frame, then remove a column and create a new one:

A <- data.frame(a=LETTERS[1:5], b=1:5, c=rnorm(5))
A$d <- NULL     # to delete column "d"
A$e <- 1:5      # add in a new column "e"
Now create a second data frame (the last column is simply a random mix of 1 & 2). Note the use of the same column names to see what happens when A & B are joined together:
set.seed(123)     # allow reproducible random numbers
B <- data.frame(a=letters[1:5], b=sample(1:2, size=5, replace=TRUE))
Note that the non-numeric columns of both data frames are treated as factors (unless you use stringsAsFactors=FALSE when using data.frame):
> sapply(A, class)
        a         b         c         e 
 "factor" "integer" "numeric" "integer" 

> sapply(B, class)
       aa        bb 
 "factor" "integer" 
To join them together, you could use c, but the result will be a list:
c(A, B)    # creates a list
> class(c(A, B))
[1] "list"
You can either convert this list to a data frame, or else use data.frame:
AB1 <- as.data.frame(c(A, B))
AB2 <- data.frame(A, B)
> identical( AB1, AB2 )
[1] TRUE
Compare this to what happens when using cbind:
AB3 <- cbind(A, B)
colnames(AB2)      # } note the
colnames(AB3)      # }  difference
the identical column names for A & B are rendered unambiguous when using as.data.frame(c(A, B)), by appending .1 to the 2nd data frame column names. It does this using make.unique, which is useful if you need to generate unique elements, given a vector containing duplicated character strings.

do.call

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it. It is an extremely useful task, that can be used to join together data data frames stored in a list, for example:

l <- list(first=A[1, ], second=A[2, ], rest=A[-c(1:2), ])
do.call(rbind, l)
This task cannot be performed using c or rbind, without losing the 2 dimensional structure of the data stored within each component of the list.
   - add columns to df, c, data.array(A, newdf)
   - subset, transform, "[", "[[", methods
   - rbind, cbind


Joining data frames

merge is used to perform a database join operation to merge together rows of 2 data frames which share common entries in one or more columns:

A <- data.frame(letter=LETTERS[1:5], a=1:5)
B <- data.frame(letter=LETTERS[sample(10)], x=runif(10))
merge(A, B)   # Return rows with same "letter", combining unique columns from A & B
merge(A, B, all=TRUE)   # see how non-overlapping "letter" values are handled
match identifies common elements between 2 vectors and returns the positions in the second vector of these matching elements in the order they appear in the first vector.
match(c("B", "E"), LETTERS)
match(c("B", "3", "E"), LETTERS)   # returns "NA" if no corresponding match
Using the above example, merge is equivalent to:
B[match(A$letter, B$letter), ]   # same as "merge(A, B)" but with row names from "B"
Some other examples:
x <- rep(LETTERS[1:3], each=3)
match(LETTERS[1:3], x)   # "match" *only* returns position of *first* match
match(x, LETTERS[1:3])   # match returns a vector as long as its first argument 
A more intuitive version of match is %in%:
x[x %in% "B"] <- "b"      # change elements of x equal to "B"
#--Alternatively:
x[grep("C", x)] <- "c"    # change elements of x equal to "C"
On a related theme, the following set operators are also useful:
intersect(1:5, 3:8)
union(1:5, 3:8)
and to identify or remove duplicate entries from a vector:
x <- c(1:5, 3:8)
duplicated(x)         # logical vector
which(duplicated(x))  # return duplicate element numbers
unique(x)             # same as "x[! duplicated(x)]"


Rearranging data structures

To sort a vector:
a <- sample(1:10)
sort(a)
sort(a, decreasing=TRUE)  # reverse order
order(a)                  # the element numbers of "a" in order of the values of "a"
a[order(a)]               # same as "sort(a)"
To reorder the rows of a data frame according to the contents of one of its columns you just need to use order to specify the row order of the data frame:
A <- data.frame(a=sample(LETTERS[1:5]), b=sample(1:5))
A[order(A$a), ]   # } compare
A[order(A$b), ]   # }
To transpose the rows and columns of a matrix:
A <- matrix(1:6, nrow=3)
t(A)
Since t returns a matrix, the equivalent for a data frame is as follows:
#--Create a data frame with column *and* row names:
B <- data.frame(a=1:3, b=LETTERS[1:3], row.names=c("one", "two", "three"))
> as.data.frame(t(B))
  one two three
a   1   2     3
b   A   B     C
A more general task for restructuring an array is aperm:
A <- array(1:12, dim=c(2, 2, 3))   # create a 3d array
aperm(A, perm=1:3)                 # return original structure
aperm(A, perm=c(1, 3, 2))          # swap 2nd & 3rd dimensions


Reshaping data

First create some multi-column data:
set.seed(123)     # allow reproducible random numbers
A <- data.frame(a=letters[1:3], x=rnorm(3), y=runif(3))
> A
  a          x         y
1 a -0.5604756 0.5281055
2 b -0.2301775 0.8924190
3 c  1.5587083 0.5514350
Now stack the columns:
> stack(A)
      values ind
1 -0.5604756   x
2 -0.2301775   x
3  1.5587083   x
4  0.5281055   y
5  0.8924190   y
6  0.5514350   y

# NB, the "ind" column is now a factor:
> class(stack(A)$ind)
[1] "factor"
But note that the column a is lost in the stacking:
> unstack(stack(A))
           x         y
1 -0.5604756 0.5281055
2 -0.2301775 0.8924190
3  1.5587083 0.5514350
There is also a function reshape which converts between so-called long and wide format data (i.e. columns stacked below each other vs. columns arranged beside each other). However, the documentation for reshape is remarkably opaque! A much more convenient function is melt from the excellent reshape package:
install.packages("reshape")
require(melt)
melt(A)                    # retains column "a", unlike "stack(A)"


Truncating and rounding data

  • Create a set of Gaussian-distributed random numbers:
    set.seed(123)            # allow reproducible random numbers
    x <- rnorm(20, sd=2)     # default mean is zero
    round(x)                 # round to nearest integer
    round(x, 1)              # round to 1 decimal place
    format(x, digits=1)      # format to 1 d.p. (and convert to character)
    trunc(x)                 #
    x[round(x) != trunc(x)]  # elements of x between N+0.5 and N+1, for integer N
    floor(x)                 # round down to nearest integer
    ceiling(x)               # round up to nearest integer
    
    Show the floor and ceiling values around each point:
    i <- seq(along=x)        # vector of x element numbers
    plot(i, x)               # same as "plot(x)"
    abline(h=-4:3, lty=2)    # add dashed lines to mark the integers
    segments(i, floor(x), i, ceiling(x))   # plot floor/ceiling values
    
    To truncate data above and below some thresholds (e.g. set all values below zero to zero and above 1 to 1):
    x2 <- pmax(pmin(x, 1), 0)   # uses nifty parallel maximum & minimum functions
    
    The result can be visualised as follows:
    plot(x, pch=3)              # plot original data as "+" symbols
    abline(h=c(0, 1), lty=2)    # show thresholds as dashed lines
    points(x2)                  # show thresholded data as default hollow points
    elms <- x2 %in% c(0, 1)     # elements of x2 which have been thresholded
    points(i[elms], x2[elms], pch=19)   # highlight thresholded points
    

  • Miscellaneous commands

    If a data frame contains any missing values (NA), you can exclude the corresponding entire row:

    A$y[4:9] <- A$x[2] <- NA
    > na.omit(A)
                x         y
    1  -0.5604756 0.8895393
    3   1.5587083 0.6405068
    10 -0.4456620 0.1471136
    
    > unlist(list(a=1, b=2:5, c=6))
     a b1 b2 b3 b4  c 
     1  2  3  4  5  6 
    
    When dealing with long format data, where a vector of values has an associated grouping vector, you can use split to pull out separate list entries for each group:
    A <- data.frame(group=LETTERS[rep(1:3, 1:3)], x=rnorm(6))  # 3 groups: "A", "B", "C"
    a <- split(A$x, A$group)
    > a
    $A
    [1] 0.6849361
    
    $B
    [1] -0.3200564 -1.3115224
    
    $C
    [1] -0.5996083 -0.1294107  0.8867361
    
    You can reverse the splitting with unsplit:
    unsplit(a, f=A$group)
    unname(unlist(a))       # same result
    

    For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.

    Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.


    Quick links


    Jump to


    Copyright © 2010-2013 Alastair Sanderson