This section deals with the basic structures R uses to store data and how to assemble them, as well as how to get data into and out of R.

# Data structures in R

All R objects have a type or mode, as well as a class, which can be determined with `typeof`, `mode` & `class`.

• Vector

Vectors are the basic structure and come in the following atomic modes (data types):

numeric, integer, character, logical, complex, raw

These modes have corresponding functions which test if an object is of that mode (`is`, e.g. `is.numeric`) and convert an object to that mode (`as`, e.g. `as.character`)

You can assemble and combine vectors using the often-used function `c`. Note that vectors must consist of values of the same data type:

```c(1, "a", TRUE)      # all values coerced to character
list(1, "a", TRUE)   # preserves different types (see below)
```
• Factors

Factors encode categorical data, and are an extremely useful and efficient way of handling categories with multiple entries. Note that R often coerces character data to a factor type by default (e.g. when using `read.table`). Also have `is.factor` & `as.factor`.

```chars <- strsplit("the cat sat on the mat", "")[[1]] # create vector of characters
chars <- factor(chars)     # convert from character to a factor
levels(chars)              # show factor levels (i.e. different letters here)
plot(chars)                # show barchart of factor level frequencies
levels(chars)[1] <- "_"    # replace whitespaces with underscores
paste(chars, collapse="")  # collapse to a single character string
```
One thing to watch out for with factors is converting them to numeric mode. Factors are actually stored as a list of integers, referring to the element number of the factor levels. In the following example, there are 3 levels ("100", "200" & "300"), which are represented as characters, and the numeric values of the factor comprise the integers 1-3, referring to the elements of the vector of levels.
```Xvector <- c(1, 2, 2, 3, 3, 3) * 100
Xfactor <- factor(Xvector)
levels(Xfactor)       # show levels, which are "100" "200" "300"
as.numeric(Xfactor)   # reports "1 2 2 3 3 3" - the elements of the levels vector
x <- as.numeric(levels(Xfactor)[Xfactor])     # retrieve actual numeric values
identical(x, Xvector)     # same as original numeric vector
```
• Matrix/arrays

Matrices are 2-dimensional arrays, which are themselves generalisations of a vector to more than 1 dimension. Also have `is.matrix`, `is.array` & `as.matrix`, `as.array`.

```M <- matrix(1:12, nrow=3)      # create a matrix with 3 rows & 4 columns)
dim(M)        # show dimensions
M[2, 3]       # print element in 2nd row & 3rd column
2 * matrix(rep(1, 12), nrow=3)     # multiply every element by a constant

A <- array(1:12, dim=c(2, 2, 3))   # create a 3d array
A[1, 2, 1]     # print single element
A[1, , ]       # print a matrix subset
```
Arrays are actually stored in a 1 dimensional structure, so you can still access their elements with a single subscript:
```A[5]
```
• List

Lists are used to store data of any type or dimensions in a free-form structure. Also have `is.list` & `as.list`

```l <- list(functions=c(mean, median), chars=month.abb, numbers=rnorm(7))
l\$chars       # print "chars" element
l[2]          # print 2nd element *as a single-item list*
l[[2]]        # print element as a *vector*
l["chars"]       # } compare and
l[["chars"]]     # }  contrast
```
To assemble a list cumulatively, e.g. in a loop:
```l <- as.list(NULL) # create empty list
for ( i in 1:3 ) l[i] <- LETTERS[i]
```
• Data frame

Data frames are widely used in R to store data in a variety of formats with related entries in each row and different attributes in each column, much like a table or spreadsheet. A data frame is essentially a special type of list and elements of data frames can be accessed in exactly the same way as for a list. Also have `is.data.frame` & `as.data.frame`

```A <- data.frame(a=LETTERS[1:4], b=1:4, c=c(T, T, F, T))
sapply(A, class)    # show data types for each column
A\$a^2               # perform arithmetic on column as a vector
dim(A); nrow(A); ncol(A)     # show dimensions of data frame (rows, columns)
A[1, ]              # print first row
A[, 2]              # print 2nd column

as.list(A)          # convert to a list
# Note that matrices must contain data of the same type, so the following
#  command converts all the values to character format:
as.matrix(A)        # convert to a matrix
```
Data frames can have both row and column names (default row names are the row number). This is the same as having a named vector, as seen in the following example:
```# created separate, named vectors of data:planets.mass <- c("Mercury"=0.33, "Venus"=4.87, "Earth"=5.98, "Mars"=0.64,
"Jupiter"=1899, "Saturn"=569, "Uranus"=87, "Neptune"=102, "Pluto"=0.13) * 1e24

planets.semimajoraxis <- c("Mercury"=57.9, "Venus"=108, "Earth"=150,
"Mars"=228, "Jupiter"=778, "Saturn"=1430, "Uranus"=2870, "Neptune"=4500,
"Pluto"=5900) * 1e9

# Now create a data frame:
planets <- data.frame(mass=planets.mass, semimajoraxis=planets.semimajoraxis)
planets["Earth", ]         # show all data for the Earth
planets["Mars", "mass"]    # show the mass of Mars; same as planets[4, 1]
rownames(planets) <- paste("planet", 1:9)     # change row names
dimnames(planets); rownames(planets); colnames(planets)     # show info
```
Working with data frames is very easy:
```subset(planets, mass > mean(mass))
subset(planets, mass > 1e24 & semimajoraxis < 1e12 )
# Adding new columns to the data frame:
planets <- transform(planets, log10mass = log10(mass),
wibble = mass * semimajoraxis)

# You can access the columns without including the data frame name, using with:
with(planets, mass^2 + 3 * semimajoraxis)
# which is more convenient than:
planets\$mass^2 + 3 * planets\$semimajoraxis

# Similarly, you can often access column data within other functions
# e.g. plotting with a data frame:
plot( semimajoraxis ~ mass, data=planets, log="xy")
```
Excluding columns from a data frame is also very easy, and can be done by reference to the column number or name:
```A <- transform(planets, dummy = 1:nrow(planets))    # add an extra column
A[, -3]                                             # exclude extra column by number
A[, -c(2:3)]                                        # exclude multiple columns by number
subset(A, select = -dummy)                          # exclude extra column by name
subset(A, select = -c(dummy, mass))                 # exclude multiple columns by name
```

# Data input/output in R

For a basic introduction, see getting started. See also the R Data Import/Export manual.

R recognises a variety of formats for reading in data. For tabular data, the basic command `read.table` offers a powerful range of options, which is also used by the shortform commands `read.csv` and `read.delim`, for reading in comma-separated variable (e.g. output from a spreadsheet) and tab-delimited format data, respectively. Similarly, the command `write.table` is used to output tabular format data.

For fixed-width format data, use `read.fwf`. A more powerful method is to read in data directly into a vector or list, using `scan`. The following are useful functions for reading and writing a variety of data types. See their respective help pages for details.

• `source` : read in R commands from a file *ideal for loading pre-written chunks of code*
• `save ; load` : read / write R objects from / to a file (see below) *ideal for storing R data*
• `scan`: basic core function to read in data into a list/vector
• `read.table ; write.table` : generic table-format data
• `read.csv` : comma-separated values data (e.g. exported from spreadsheet)
• `read.fwf` : fixed-width format data
• `read.fortran` : fixed-format data files using Fortran-style format specifications
• `read.DIF` : Data Interchange Format (DIF) for data frames from single spreadsheets
• `read.dcf` : Debian Control File format
• `read.ftable / write.ftable` : flat contingency tables
• `readBin ; writeBin` : binary data
• `readChar ; writeChar` : character strings
• `readLines ; writeLines`: lines
• `write` : write data to a file
• `dump` : write text representation of an object
• `dget ; dput` : read or recreate an ASCII representation of an R object

# Other packages for R data input / output

There are a number of separate packages for reading and writing data in different formats. The following are some common examples; see the R Data Import/Export manual for more information.

• `library(help="foreign") # Minitab, S, SAS, SPSS, Stata, Systat, dBase, Octave format`
• RODBC package : for database sources supporting an ODBC interface
• gdata package : various tools, e.g. `read.xls` for reading data from Excel
• xtable package : Export tables to LaTeX or HTML

# Entering & editing data within R

• `data.entry ; de` : conveniet GUI tools for entering data
• `edit` : use text editor to modify an R object
• `fix` : invoke `edit` to change & overwrite an R object

# Saving & loading R objects

• `save` writes an external representation of R objects to the specified file; these can then be loaded back into R using `load`, e.g.
```a <- 1:10; b <- a^2
save(a,b,file="mydata.RData")
rm(a,b)                        # Remove (delete) objects
load("mydata.RData")           # Load data into R
tmp <- load("mydata.RData")
tmp                            # Lists names of objects in file
[1] "a" "b"
```
• At any time you can save the history of commands using:
savehistory(file="my.Rhistory")
• and you can load such commands using:
loadhistory(file="my.Rhistory")
• `ls` & `objects` lists the objects currently defined
• `apropos` finds objects with names containing the specified string, e.g.
```apropos("max")
[1] "cummax"    "max"       "max.col"   "pmax"      "pmax.int"  "promax"
[7] "varimax"   "which.max"
```

For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.

Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.

#### Jump to

Copyright © 2010-2013 Alastair Sanderson