[R-sig-phylo] PIC or PGLS for genome-wide SNP screening

Theodore Garland Jr theodore.garland at ucr.edu
Wed May 23 17:30:40 CEST 2012

>Is this the way in which one decides for OLS vs. PGLS?

If you have the same set of independent variables, then you just prefer the one (OLS or PGLS) with the higher likelihood.  So far as I am told by Joe Felsenstein, you cannot do a ln maximum likelihood ratio test because the number of parameters is the same, although this paper seems to suggest otherwise:
Mooers, A. O., S. M. Vamosi, and D. Schluter. 1999. Using phylogenies to test macroevolutionary hypotheses of trait evolution in cranes (Gruinae). American Naturalist 154:249-259.

For two models with the same set of independent variables, AIC does not add anything for you.

If you go to something like Regression with an OU process modeled for the residuals, then you do have an additional parameter being estimated and so you can do an ln maximum likelihood ratio test of that model versus OLS and versus PGLS.  For example, see:
Lavin, S. R., W. H. Karasov, A. R. Ives, K. M. Middleton, and T. Garland, Jr. 2008. Morphometrics of the avian small intestine, compared with non-flying mammals: A phylogenetic perspective. Physiological and Biochemical Zoology 81:526-550. [provides Matlab Regressionv2.m, released as part of the PHYSIG package]
Gartner, G. E. A., J. W. Hicks, P. R. Manzani, D. V. Andrade, A. S. Abe, T. Wang, S. M. Secor, and T. Garland, Jr. 2010. Phylogeny, ecology, and heart position in snakes. Physiological and Biochemical Zoology 83:43-54.


Theodore Garland, Jr.
Department of Biology
University of California, Riverside
Riverside, CA 92521
Office Phone:  (951) 827-3524
Wet Lab Phone:  (951) 827-5724
Dry Lab Phone:  (951) 827-4026
Home Phone:  (951) 328-0820
Facsimile:  (951) 827-4286 = Dept. office (not confidential)
Email:  tgarland at ucr.edu

Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments. 2009.
Edited by Theodore Garland, Jr. and Michael R. Rose
(PDFs of chapters are available from me or from the individual authors)

From: r-sig-phylo-bounces at r-project.org [r-sig-phylo-bounces at r-project.org] on behalf of Mattia Prosperi [ahnven at gmail.com]
Sent: Wednesday, May 23, 2012 8:05 AM
To: r-sig-phylo at r-project.org
Subject: [R-sig-phylo] PIC or PGLS for genome-wide SNP screening

Dear all,

I am working on a data set composed of bacterial genomic sequences (a
few genes) associated to phenotypic values (in-vitro resistance to
antibiotics, a numerical value discretised into a binary class). Of
note, the bacterial isolates were sampled non-uniformly at different
times and locations, thus with a possible sampling bias. The data set
is ~1,000 variables and ~1,000 observations.

I have been applying several methods for developing a model to predict
antibiotic resistance from the single nucleotide polymorphisms (SNP)
extracted from a multiple alignment, applying classical statistical
learning and feature selection methods.
Eventually, I found that a logistic regression with main effects,
where the variables were selected first by a univariable chi-square
screening and then by AIC stepwise, was as good as other more complex
and non-linear methods (such as random forests) by comparing different
loss function (AUROC, specificity, sensitivity) distributions  upon
multiple cross-validation runs. Also, the SNP sets selected by
different approaches were highly similar and consistent across several
bootstrap evaluations.

I found that a few relevant (even after Bonferroni correction) SNP
were located in gene regions that are not supposed to be related with
antibiotic resistance. I thought that this might be a consequence of
neutral mutations that became fixed in the population by chance after
a genetic bottleneck (e.g. antibiotic pressure).
I'd like to understand if such SNP that is associated to antibiotic
resistance (and actually not expected to be) is indeed just a random
mutation of an early isolate that was carrying the true resistance SNP
(in another gene region) and that was selected by the antibiotic
pressure, thus transfering both the true resistance SNP and the
"hitchhicking" ones to the offspring. Unfortunately it is not easy to
cross-tabulate SNP in different genes because not all isolates have
been sequenced the same set of genes.

In order to check for fake/true SNP associated to resistance, I
thought I might use a PIC or PGLS approach (after estimating a
phylogenetic tree from the multiple alignment), in the same settings
as the original analysis, i.e. a model selection approach with both
feature and performance evaluation (well, since the coefficients of
PGLS/OLS are the same, it's just a matter of standard errors and
feature set selection), regressing the resistance class as a dependent
variable and using the SNP as covariates.

Is this a reasonable approach? Does it make sense to set up -for
instance- an AIC stepwise selection within a PGLS modeling?
I know that there is a way to check for phylogenetic signal and
therefore decide if the PGLS approach shall be employed. Is this the
way in which one decides for OLS vs. PGLS?

Last but not least, which is the most appropriate covariance matrix
calculation and PGLS implementation for this input-output set (i.e.
categorical variables, binary class)? The "brunch" function within
caper, or compar.gee within ape?

Thanks and apologies if some of the questions are silly.

M. Prosperi.

R-sig-phylo mailing list
R-sig-phylo at r-project.org

More information about the R-sig-phylo mailing list