[R] Help with PCA data file prep and R code

Albin Blaschka albin.blaschka at standortsanalyse.net
Wed Sep 28 15:51:55 CEST 2016


Hi,

maybe the package vegan with its tutorials is a good starting point, 
too...

http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

http://cc.oulu.fi/~jarioksa/opetus/metodi/sessio2.pdf

all the best,
Albin

Am 22.09.2016 10:23 PM, schrieb David L Carlson:
> Looking at your data there are several issues.
> 
> 1. Tank is an integer, but it sounds like you intend to use it as a
> categorical measure. If so that it should be a factor, but factors
> cannot be used in pca. Is Tank 10 10 times more of something than Tank
> 1?
> 
> 2. Date is a factor. That means you are not measuring time, just the
> fact that 2 rows are the same time or different time. Factors cannot
> be used in pca.
> 
> 3. Treatment is a factor, but factors cannot be used in pca.
> 
> 4. Your log transformed data has many 0's and no negative values. Did
> you add 1 to each value before taking logarithms?
> 
> First line of your code after reading the data:
> 
>> meso.pca <- prcomp(mesocleaned, center=TRUE, scale.=TRUE)
> Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
>>                    scale. = TRUE)
> 
> -------------------------------------
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
> 
> 
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Sarah 
> Stinson
> Sent: Wednesday, September 21, 2016 10:25 AM
> To: r-help at r-project.org
> Subject: Re: [R] Help with PCA data file prep and R code
> 
> Hello DRUGs,
> I'm new to R and would appreciate some expert advice on prepping files 
> for,
> and running, PCA...
> 
> My data set consists of aquatic invertebrate and zooplankton count data 
> and
> physicochemical measurements from an ecotoxicology study. Four chemical
> treatments were applied to mesocosm tanks, 4 replicates per treatment 
> (16
> tanks total), then data were collected weekly over a 3 month period.
> 
> I cleaned the data in excel by removing columns with all zero values, 
> and
> all rows with NA values.
> All zooplankton values were volume normalized, then log normalized. All
> other data was log normalized in excel prior to analysis in R. All 
> vectorss
> are numeric. I've attached the .txt file to this email rather that 
> using
> dput(dataframe).
> 
> My questions are:
> 
> 1. Did I do the cleaning step appropriately? I know that there are ways 
> to
> run PCA's using data that contain NA values (pcaMethods), but wasn't 
> able
> to get the code to work...
> (I understand that this isn't strictly an R question, but any help 
> would be
> appreciated.)
> 2. Does my code look correct for the PCA and visualization (see below)?
> 
> Thanks in advance,
> Sarah
> 
> #read data
> mesocleaned <- read.csv("MesoCleanedforPCA.9.16.16.csv")
> 
> #run PCA
> meso.pca <- prcomp(mesocleaned,
>                    center = TRUE,
>                    scale. = TRUE)
> 
> # print method
> print(meso.pca)
> 
> #compute standard deviation of each principal component
> std_dev <- meso.pca$sdev
> 
> #compute variance
> pr_var <- std_dev^2
> 
> #check variance of first 10 components
> pr_var[1:10]
> 
> #proportion of variance explained
> prop_varex <- pr_var/sum(pr_var)
> prop_varex[1:20]
> 
> #The first principal component explains 12.7% of the variance
> #The second explains 8.1%
> 
> #visualize
> biplot(meso.pca)
> 
> #for visualization, make Treatment vector a factor instead of numeric
> meso.treatment <- as.factor(mesocleaned[, 3])
> 
> #ggbiplot to visualize by Treatment group
> #reference: 
> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
> 
> library(devtools)
> install_github("ggbiplot", "vqv")
> library(ggbiplot)
> 
> print(ggbiplot(meso.pca, obs.scale = 1, var.scale = 1, groups =
> meso.treatment, ellipse = TRUE, circle = TRUE))
> g <- ggbiplot(meso.pca, obs.scale = 1, var.scale = 1,
>               groups = meso.treatment, ellipse = TRUE,
>               circle = TRUE)
> g <- g + scale_color_brewer(name = deparse(substitute(Treatments)), 
> palette
> = 'Dark2') #must change meso.treatment to a factor for this to work
> g <- g + theme(legend.direction = 'horizontal',
>                legend.position = 'top')
> print(g)
> 
> #Circle plot
> #plot each variables coefficients inside a unit circle to get insight 
> on a
> possible interpretation for PCs.
> #reference: 
> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
> 
> theta <- seq(0,2*pi,length.out = 100)
> circle <- data.frame(x = cos(theta), y = sin(theta))
> p <- ggplot(circle,aes(x,y)) + geom_path()
> 
> loadings <- data.frame(meso.pca$rotation,
>                        .names = row.names(meso.pca$rotation))
> p + geom_text(data=loadings,
>               mapping=aes(x = PC1, y = PC2, label = .names, colour =
> .names)) +
>   coord_fixed(ratio=1) +
>   labs(x = "PC1", y = "PC2")
> 
> On Tue, Sep 20, 2016 at 10:28 PM, Sarah Stinson <sastinson at ucdavis.edu>
> wrote:
> 
>> Hello DRUGs,
>> I'm new to R and would appreciate some expert advice on prepping files
>> for, and running, PCA...
>> 
>> My data set consists of aquatic invertebrate and zooplankton count 
>> data
>> and physicochemical measurements from an ecotoxicology study. Four 
>> chemical
>> treatments were applied to mesocosm tanks, 4 replicates per treatment 
>> (16
>> tanks total), then data were collected weekly over a 3 month period.
>> 
>> I cleaned the data in excel by removing columns with all zero values, 
>> and
>> all rows with NA values.
>> All zooplankton values were volume normalized, then log normalized. 
>> All
>> other data was log normalized in excel prior to analysis in R. All 
>> vectorss
>> are numeric. I've attached the .csv file to this email rather that 
>> using
>> dput(dataframe). I hope that's acceptable.
>> 
>> My questions are:
>> 
>> 1. Did I do the cleaning step appropriately? I know that there are 
>> ways to
>> run PCA's using data that contain NA values (pcaMethods), but wasn't 
>> able
>> to get the code to work...
>> (I understand that this isn't strictly an R question, but any help 
>> would
>> be appreciated.)
>> 2. Does my code look correct for the PCA and visualization (see 
>> below)?
>> 
>> Thanks in advance,
>> Sarah
>> 
>> #read data
>> mesocleaned <- read.csv("MesoCleanedforPCA.9.16.16.csv")
>> 
>> #run PCA
>> meso.pca <- prcomp(mesocleaned,
>>                    center = TRUE,
>>                    scale. = TRUE)
>> 
>> # print method
>> print(meso.pca)
>> 
>> #compute standard deviation of each principal component
>> std_dev <- meso.pca$sdev
>> 
>> #compute variance
>> pr_var <- std_dev^2
>> 
>> #check variance of first 10 components
>> pr_var[1:10]
>> 
>> #proportion of variance explained
>> prop_varex <- pr_var/sum(pr_var)
>> prop_varex[1:20]
>> 
>> #The first principal component explains 12.7% of the variance
>> #The second explains 8.1%
>> 
>> #visualize
>> biplot(meso.pca)
>> 
>> #for visualization, make Treatment vector a factor instead of numeric
>> meso.treatment <- as.factor(mesocleaned[, 3])
>> 
>> #ggbiplot to visualize by Treatment group
>> #reference: 
>> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
>> 
>> library(devtools)
>> install_github("ggbiplot", "vqv")
>> library(ggbiplot)
>> 
>> print(ggbiplot(meso.pca, obs.scale = 1, var.scale = 1, groups =
>> meso.treatment, ellipse = TRUE, circle = TRUE))
>> g <- ggbiplot(meso.pca, obs.scale = 1, var.scale = 1,
>>               groups = meso.treatment, ellipse = TRUE,
>>               circle = TRUE)
>> g <- g + scale_color_brewer(name = deparse(substitute(Treatments)),
>> palette = 'Dark2') #must change meso.treatment to a factor for this to 
>> work
>> g <- g + theme(legend.direction = 'horizontal',
>>                legend.position = 'top')
>> print(g)
>> 
>> #Circle plot
>> #plot each variables coefficients inside a unit circle to get insight 
>> on a
>> possible interpretation for PCs.
>> #reference: 
>> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
>> 
>> theta <- seq(0,2*pi,length.out = 100)
>> circle <- data.frame(x = cos(theta), y = sin(theta))
>> p <- ggplot(circle,aes(x,y)) + geom_path()
>> 
>> loadings <- data.frame(meso.pca$rotation,
>>                        .names = row.names(meso.pca$rotation))
>> p + geom_text(data=loadings,
>>               mapping=aes(x = PC1, y = PC2, label = .names, colour =
>> .names)) +
>>   coord_fixed(ratio=1) +
>>   labs(x = "PC1", y = "PC2")
>> 
>> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
| Dr.rer.nat. Albin Blaschka
| Etrichstrasse 26, A-5020 Salzburg
| * www.standortsanalyse.net *
| * www.researchgate.net/profile/Albin_Blaschka *
| - It's hard to live in the mountains, hard but not hopeless!



More information about the R-help mailing list