R provides a variety of methods for summarising data in tabular and other forms.

# View data structure

- Before you do anything else, it is important to understand the structure of your data and that of any objects derived from it.
A <- data.frame(a=LETTERS[1:10], x=1:10) class(A) # "data.frame" sapply(A, class) # show classes of all columns typeof(A) # "list" names(A) # show list components dim(A) # dimensions of object, if any head(A) # extract first few (default 6) parts tail(A, 1) # extract last row head(1:10, -1) # extract everything except the last element

- It is sometimes useful to work with a smaller version of a large data frame, by creating a representative subset of the data, via random sampling:
A.small <- A[sample(nrow(A), 4), ] # select 4 rows at random

# Basic numerical summaries

- Generate and summarise some random numbers:
a <- rnorm(50) summary(a) # gives min, max, mean, median, 1st & 3rd quartiles min(a); max(a) # } range(a) # } self-explanatory mean(a); median(a) # } sd(a); mad(a) # standard deviation, median absolute deviation IQR(a) # interquartile range quantile(a) # quartiles (by default) quantile(a, c(1, 3)/4) # specific percentiles (25% & 75% in this case)

- Data frame summaries:
A <- data.frame(a=rnorm(10), b=rpois(10, lambda=10)) summary(A) # summarise data frame apply(A, 1, mean) # calculate row means apply(A, 2, mean) # calculate column means: same as "mean(A)"

&`which.min`

return the element number of the lowest/highest value:`which.max`This can be used in a data frame to extract the corresponding row containing the min/max value of one of the columns:set.seed(123) # allow reproducible random numbers x <- sample(10) > which.max(x) [1] 7 > x[which.max(x)] [1] 10

A <- data.frame(x=rnorm(10), y=runif(10)) A[which.min(A$x), ] #--Alternatively: subset(A, x == min(x))

- Other summaries:
x <- rnorm(100) fivenum(x) # Tukey's five number summary, used to construct a boxplot: boxplot(x) # see

?boxplot.stats

for more details stem(x) # A stem-and-leaf plot - Matrix summaries:
A <- matrix(rnorm(50), nrow=10) # create 10x5 random number matrix colSums(A); rowSums(A); colMeans(A), rowMeans(A) # self-explanatory max.col(A) # maximum position for each row of a matrix, same as: which.max(A[1,]); which.max(A[2,]) # etc.

# Tables

- Load some data on a sample of 20 galaxy clusters with a categorical classification status (
cctype

) indicating whether there is a cool core or not and a factor (det

) specifying which of two detectors was used to make the X-ray observation of the cluster:file <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson09/sanderson09_table2.txt" a <- read.table(file, header=TRUE, sep="|") # table(a$cctype) # count numbers in each cctype category table(a$cctype, a$det) # 2-way table xtabs(~ cctype + det, data=a) # alternative (formula) syntax addmargins(xtabs(~ cctype + det, data=a)) # add row/col summary (default is sum) prop.table(xtabs(~ cctype + det, data=a)) # show counts as proportions of total

- To test whether the input factors are independent of each other:
-there is marginal evidence (p=0.07) of an interaction: clusters observed with ACIS-S are more likely to have a cool core than not.
chisq.test(xtabs(~ det + cctype, data=a), simulate.p.value=TRUE)

# Calculate aggregate statistics

- Calculate numerical summaries for subsets of a data frame (using above dataset):
> aggregate( kT ~ cctype, data=a, FUN=mean) cctype kT 1 CC 5.121111 2 non-CC 6.146364 # mean cluster redshift of each cctype for each detector: > aggregate(z ~ cctype + det, data=a, FUN=mean) cctype det z 1 CC I 0.06070000 2 non-CC I 0.05137500 3 CC S 0.04105714 4 non-CC S 0.03636667 #--Show mean values of a few quantitied, for each cctype: aggregate(. ~ cctype, data=a[c("cctype", "z", "kT", "Z", "S01", "index")], mean)

- You can also apply multi-number summaries:
> aggregate( index ~ cctype, data=a, FUN=range) cctype index.1 index.2 1 CC 0.714 1.120 2 non-CC 0.283 0.944

For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data using R.

Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source code used to create them.