Data visualization and outlier detection
Any dataset can potentially have outliers. To get good results through statistical analysis, outliers should be always excluded. There are multiple ways to do that. One way is to create box whisker plot and visually and manually find them.
Basic concept
What do the box and whiskers represent in box whisker plot? The box represents the distance between the 1st and 3rd quartiles. The whiskers show the highest and lowest data points or 1.5 times the box (Q3-Q1). Outlier points are those that are greater than 1.5 times (Q3 -Q1).
Example
One data set includes several samples of seed weight of each genotype in an experiment with a large number of varieties. For a given genotype, a box whisker plot can be plotted and outliers can be visually find. Here we show how to use R to get this done. We utilize lattice library to generate separate box whisker graph per genotype. Suppose the dataset is stored in file “c:/seedweight.csv” and seed weight in column called VALUE and genotype in column called GENOTYPE. Here is the R commands to generate bwplots..
library("lattice")
seedweight<- read.csv("C:/seedweight.csv")
attach(seedweight)
bwplot(~ VALUE | factor(GENOTYPE), seedweight)
Click to download the data file - seedweight
Second method
There is an individual leaf area dataset from corn plants. The relationship between leaf number and leaf area from individual plant is curve linear, well defined bell shape. Since the relationship is known, it can be utilized to detect data errors and outliers. Suppose the dataset is stored in file "c:/indla.csv". There are three columns, id, ln, and la, in the file. Here is R commands to generate scatter plot for individual plants.
library("lattice")
la<-read.csv("C:/indla.csv")
> xyplot(la ~ ln|id, la, type="b")
Click to download the data file - indla
