[R] Help with PCA data file prep and R code
Albin Blaschka
albin.blaschka at standortsanalyse.net
Wed Sep 28 15:51:55 CEST 2016
Hi,
maybe the package vegan with its tutorials is a good starting point,
too...
http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf
http://cc.oulu.fi/~jarioksa/opetus/metodi/sessio2.pdf
all the best,
Albin
Am 22.09.2016 10:23 PM, schrieb David L Carlson:
> Looking at your data there are several issues.
>
> 1. Tank is an integer, but it sounds like you intend to use it as a
> categorical measure. If so that it should be a factor, but factors
> cannot be used in pca. Is Tank 10 10 times more of something than Tank
> 1?
>
> 2. Date is a factor. That means you are not measuring time, just the
> fact that 2 rows are the same time or different time. Factors cannot
> be used in pca.
>
> 3. Treatment is a factor, but factors cannot be used in pca.
>
> 4. Your log transformed data has many 0's and no negative values. Did
> you add 1 to each value before taking logarithms?
>
> First line of your code after reading the data:
>
>> meso.pca <- prcomp(mesocleaned, center=TRUE, scale.=TRUE)
> Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
>> scale. = TRUE)
>
> -------------------------------------
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
>
>
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Sarah
> Stinson
> Sent: Wednesday, September 21, 2016 10:25 AM
> To: r-help at r-project.org
> Subject: Re: [R] Help with PCA data file prep and R code
>
> Hello DRUGs,
> I'm new to R and would appreciate some expert advice on prepping files
> for,
> and running, PCA...
>
> My data set consists of aquatic invertebrate and zooplankton count data
> and
> physicochemical measurements from an ecotoxicology study. Four chemical
> treatments were applied to mesocosm tanks, 4 replicates per treatment
> (16
> tanks total), then data were collected weekly over a 3 month period.
>
> I cleaned the data in excel by removing columns with all zero values,
> and
> all rows with NA values.
> All zooplankton values were volume normalized, then log normalized. All
> other data was log normalized in excel prior to analysis in R. All
> vectorss
> are numeric. I've attached the .txt file to this email rather that
> using
> dput(dataframe).
>
> My questions are:
>
> 1. Did I do the cleaning step appropriately? I know that there are ways
> to
> run PCA's using data that contain NA values (pcaMethods), but wasn't
> able
> to get the code to work...
> (I understand that this isn't strictly an R question, but any help
> would be
> appreciated.)
> 2. Does my code look correct for the PCA and visualization (see below)?
>
> Thanks in advance,
> Sarah
>
> #read data
> mesocleaned <- read.csv("MesoCleanedforPCA.9.16.16.csv")
>
> #run PCA
> meso.pca <- prcomp(mesocleaned,
> center = TRUE,
> scale. = TRUE)
>
> # print method
> print(meso.pca)
>
> #compute standard deviation of each principal component
> std_dev <- meso.pca$sdev
>
> #compute variance
> pr_var <- std_dev^2
>
> #check variance of first 10 components
> pr_var[1:10]
>
> #proportion of variance explained
> prop_varex <- pr_var/sum(pr_var)
> prop_varex[1:20]
>
> #The first principal component explains 12.7% of the variance
> #The second explains 8.1%
>
> #visualize
> biplot(meso.pca)
>
> #for visualization, make Treatment vector a factor instead of numeric
> meso.treatment <- as.factor(mesocleaned[, 3])
>
> #ggbiplot to visualize by Treatment group
> #reference:
> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
>
> library(devtools)
> install_github("ggbiplot", "vqv")
> library(ggbiplot)
>
> print(ggbiplot(meso.pca, obs.scale = 1, var.scale = 1, groups =
> meso.treatment, ellipse = TRUE, circle = TRUE))
> g <- ggbiplot(meso.pca, obs.scale = 1, var.scale = 1,
> groups = meso.treatment, ellipse = TRUE,
> circle = TRUE)
> g <- g + scale_color_brewer(name = deparse(substitute(Treatments)),
> palette
> = 'Dark2') #must change meso.treatment to a factor for this to work
> g <- g + theme(legend.direction = 'horizontal',
> legend.position = 'top')
> print(g)
>
> #Circle plot
> #plot each variables coefficients inside a unit circle to get insight
> on a
> possible interpretation for PCs.
> #reference:
> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
>
> theta <- seq(0,2*pi,length.out = 100)
> circle <- data.frame(x = cos(theta), y = sin(theta))
> p <- ggplot(circle,aes(x,y)) + geom_path()
>
> loadings <- data.frame(meso.pca$rotation,
> .names = row.names(meso.pca$rotation))
> p + geom_text(data=loadings,
> mapping=aes(x = PC1, y = PC2, label = .names, colour =
> .names)) +
> coord_fixed(ratio=1) +
> labs(x = "PC1", y = "PC2")
>
> On Tue, Sep 20, 2016 at 10:28 PM, Sarah Stinson <sastinson at ucdavis.edu>
> wrote:
>
>> Hello DRUGs,
>> I'm new to R and would appreciate some expert advice on prepping files
>> for, and running, PCA...
>>
>> My data set consists of aquatic invertebrate and zooplankton count
>> data
>> and physicochemical measurements from an ecotoxicology study. Four
>> chemical
>> treatments were applied to mesocosm tanks, 4 replicates per treatment
>> (16
>> tanks total), then data were collected weekly over a 3 month period.
>>
>> I cleaned the data in excel by removing columns with all zero values,
>> and
>> all rows with NA values.
>> All zooplankton values were volume normalized, then log normalized.
>> All
>> other data was log normalized in excel prior to analysis in R. All
>> vectorss
>> are numeric. I've attached the .csv file to this email rather that
>> using
>> dput(dataframe). I hope that's acceptable.
>>
>> My questions are:
>>
>> 1. Did I do the cleaning step appropriately? I know that there are
>> ways to
>> run PCA's using data that contain NA values (pcaMethods), but wasn't
>> able
>> to get the code to work...
>> (I understand that this isn't strictly an R question, but any help
>> would
>> be appreciated.)
>> 2. Does my code look correct for the PCA and visualization (see
>> below)?
>>
>> Thanks in advance,
>> Sarah
>>
>> #read data
>> mesocleaned <- read.csv("MesoCleanedforPCA.9.16.16.csv")
>>
>> #run PCA
>> meso.pca <- prcomp(mesocleaned,
>> center = TRUE,
>> scale. = TRUE)
>>
>> # print method
>> print(meso.pca)
>>
>> #compute standard deviation of each principal component
>> std_dev <- meso.pca$sdev
>>
>> #compute variance
>> pr_var <- std_dev^2
>>
>> #check variance of first 10 components
>> pr_var[1:10]
>>
>> #proportion of variance explained
>> prop_varex <- pr_var/sum(pr_var)
>> prop_varex[1:20]
>>
>> #The first principal component explains 12.7% of the variance
>> #The second explains 8.1%
>>
>> #visualize
>> biplot(meso.pca)
>>
>> #for visualization, make Treatment vector a factor instead of numeric
>> meso.treatment <- as.factor(mesocleaned[, 3])
>>
>> #ggbiplot to visualize by Treatment group
>> #reference:
>> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
>>
>> library(devtools)
>> install_github("ggbiplot", "vqv")
>> library(ggbiplot)
>>
>> print(ggbiplot(meso.pca, obs.scale = 1, var.scale = 1, groups =
>> meso.treatment, ellipse = TRUE, circle = TRUE))
>> g <- ggbiplot(meso.pca, obs.scale = 1, var.scale = 1,
>> groups = meso.treatment, ellipse = TRUE,
>> circle = TRUE)
>> g <- g + scale_color_brewer(name = deparse(substitute(Treatments)),
>> palette = 'Dark2') #must change meso.treatment to a factor for this to
>> work
>> g <- g + theme(legend.direction = 'horizontal',
>> legend.position = 'top')
>> print(g)
>>
>> #Circle plot
>> #plot each variables coefficients inside a unit circle to get insight
>> on a
>> possible interpretation for PCs.
>> #reference:
>> https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
>>
>> theta <- seq(0,2*pi,length.out = 100)
>> circle <- data.frame(x = cos(theta), y = sin(theta))
>> p <- ggplot(circle,aes(x,y)) + geom_path()
>>
>> loadings <- data.frame(meso.pca$rotation,
>> .names = row.names(meso.pca$rotation))
>> p + geom_text(data=loadings,
>> mapping=aes(x = PC1, y = PC2, label = .names, colour =
>> .names)) +
>> coord_fixed(ratio=1) +
>> labs(x = "PC1", y = "PC2")
>>
>>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
| Dr.rer.nat. Albin Blaschka
| Etrichstrasse 26, A-5020 Salzburg
| * www.standortsanalyse.net *
| * www.researchgate.net/profile/Albin_Blaschka *
| - It's hard to live in the mountains, hard but not hopeless!
More information about the R-help
mailing list