[R-sig-phylo] Inconsistencies with DNAbin object and its affect on functions

Emmanuel Paradis Emmanuel.Paradis at mpl.ird.fr
Wed Apr 7 15:09:11 CEST 2010


Here are some possible improvements following Alastair's request.

1. Force read.dna(  , format = "fasta") to return a matrix as suggested
by Christoph. This would be controlled with an option 'as.matrix':

as.matrix=NULL -> return a list or a matrix
as.matrix=TRUE -> return a matrix or stop with an error
as.matrix=FALSE -> return a list

2. Change the default of '[' for DNAbin objects to have drop=FALSE. This
will solve Alastair's problem below (see ?"[.DNAbin" and ?"[" in R):

     woodmouse[attr(hx,"index")[[3]], , drop = FALSE]

I guess most users do not want to drop the labels of their sequences.

3. Use the generic function labels for DNAbin objects. That's quite simple:

labels.DNAbin <- function(object, ...)
{
     if (is.list(object)) return(names(object))
     if (is.matrix(object)) return(rownames(object))
     NULL
}

4. Change print.DNAbin for summary.DNAbin (which will be removed), so
instead of having only:

> woodmouse
15 DNA sequences in binary format.

this will print:

> woodmouse
15 DNA sequences in binary format stored in a matrix.

All sequences of same length: 965

Labels: No305 No304 No306 No0906S No0908S No0909S ...

Base composition:
     a     c     g     t
0.307 0.261 0.126 0.306


Any other ideas or suggestions?

Emmanuel

Alastair Potts wrote on 02/04/2010 08:11:
> Hi Christoph,
> Thanks very much for the response. Maybe the block should also give a 
> warning if there are unequal lengths and tell users that the DNAbin 
> object is a list?
> I still find that my problem #2 is not solved: I cannot find out the 
> sample name if I extract a single sample. Below is code illustrating my 
> point...
> 
> data(woodmouse)
> x <- woodmouse[c(1, 1, 2, 2, 2,3), ]
> x
> hx <- haplotype(x)
> summary(woodmouse[attr(hx,"index")[[1]],])  # sample names for haplotype 
> present
> summary(woodmouse[attr(hx,"index")[[2]],])  # sample names for haplotype 
> present
> summary(woodmouse[attr(hx,"index")[[3]],])  # what sample name is 
> associated with this haplotype?
> 
> Thank you for your time,
> Alastair
> 
> Christoph Heibl wrote:
>> Hi Alastair (and Emmanuel),
>>
>> DNAbin objects can either be lists or matrices. The idea behind lists 
>> is to be able to store sequences of different length, like those 
>> supported by the FASTA format.
>> What happens here is that 'read.dna' reads your FASTA file into a list 
>> because a priori it does not know wether all sequences are of equal 
>> length or not and it treats them as if they were not.
>>
>> This is easy to fix. We have to add this piece of code just before 
>> returning 'obj':
>>
>> if (format == "fasta")
>>    if (length(unique(sapply(obj, length))) == 1)       obj <- 
>> as.matrix(obj)
>>
>> I have uploaded a fixed version called 'read.dna.new' to my website:
>> http://www.christophheibl.de/read.dna.new.R
>>
>> I hope there is nothing I have overlooked ...
>> Christoph
>>
>>  
>> On Apr 1, 2010, at 4:37 PM, Alastair Potts wrote:
>>
>>  
>>> Good day all,
>>> I've noted a strange inconsistency with DNAbin objects. Emmanuel 
>>> Paradis asked me to send this out to the list to see if others are 
>>> affected by this issue.
>>> I have included example code below and attached in a .R file.
>>>
>>> # Summary of problem:
>>> # 1)  There are actually two different 'objects' that exist under the 
>>> class DNAbin.
>>> #     This can be very confusing as some functions work with the one 
>>> object and
>>> #     not the other. I think that not many people will realise this.
>>> # 2)  The is.matrix()=TRUE object cannot seem to extract a single 
>>> sample with
>>> #     the sample name. This has been a real problem for me when I am 
>>> trying to
>>> #     obtain the names of all of samples for each haplotype, and some 
>>> haplotypes
>>> #     only have one sample.
>>> # See example code below...
>>>
>>> library(pegas)
>>> library(ape)
>>> data(woodmouse)
>>> # creating the two DNAbin files, one from fasta the other from txt
>>> # 1) txt is the original woodmouse data
>>> txt <- woodmouse
>>> # 2) Creating a DNAbin object from a fasta file
>>> setwd('D:') #wherever you have write priveledges
>>> write.dna(woodmouse,file="woodmouse.temp.fasta",format="fasta")
>>> fasta <- read.dna("woodmouse.temp.fasta",format="fasta")
>>>
>>> # the fasta file from the read.dna() function is not a matrix
>>> is.matrix(fasta) # not a matrix
>>> is.matrix(txt)   # is a matrix
>>>
>>> # Thus the following functions do not work consistently...
>>>
>>> # Trying to print ALL names using names()
>>> names(fasta)  # works
>>> names(txt)    # does not work
>>>
>>> # Trying to print ALL names using rownames()
>>> rownames(fasta)  # does not work
>>> rownames(txt)    # works
>>>
>>> # Suggested alternative from Emmanuel
>>> labels(fasta)   # Gives vector of sample names
>>> labels(txt)     # Give list of sample names [[1]] and base numbers [[2]]
>>>
>>> # Trying to print name of the first sample
>>> names(fasta[1,])      # works
>>> rownames(txt[1,])    # does not work
>>>
>>> # So this results in a problem when you want to try and access the name
>>> # of a single sample
>>> summary(fasta[1,])    # sample name is present
>>> summary(txt[1,])      # sample name is NOT present
>>>
>>> # This is not a problem when you try and access two samples
>>> summary(fasta[1:2,])   # sample name is present
>>> summary(txt[1:2,])     # sample name is present
>>>
>>> # Extracting bases also gives conflicting results
>>> summary(fasta[,50:100])   # sequence length = 965
>>> summary(txt[,50:100])     # sequence length = 51
>>>
>>> # Emmanual suggests using the as.matrix() function on any fasta files 
>>> that
>>> # are imported by read.dna()
>>> fasta2txt <- as.matrix(fasta)
>>> # This fasta2txt DNAbin object now behaves exactly like the txt 
>>> DNAbin object
>>> # However, I still cannot get hold of the sample name for a single 
>>> object.
>>> # Even using the labels() function cannot get it.
>>>
>>> # So, now my questions for all of you...
>>> # 1) Have any of you found that this affects your results? I wonder
>>> # whether the two different DNAbin objects will act differently in other
>>> # functions (e.g. dist.dna, nuc.div etc. etc.)
>>> # 2) Do you have a suggestion as to how I should be accessing the name
>>> # for a single sample in a matrix DNAbin object?
>>> # 3) What are your thoughts about the same class object having different
>>> # properties? Do you think this is a problem? Any ideas onto how to 
>>> solve it?
>>>
>>>
>>> -- 
>>> Alastair Potts
>>> PhD candidate
>>> Botany Department University of Cape Town
>>>
>>> Cell: 082 491-7275 | Alastair.Potts at uct.ac.za
>>> University Private Bag, Rondebosch 7700, South Africa or
>>> PO Box 115, Loxton 6985, South Africa
>>>
>>> # Summary of problem:
>>> # 1)  There are actually two different 'objects' that exist under the 
>>> class DNAbin.
>>> #     This can be very confusing as some functions work with the one 
>>> object and
>>> #     not the other. I think that not many people will realise this.
>>> # 2)  The is.matrix()=TRUE object cannot seem to extract a single 
>>> sample with
>>> #     the sample name. This has been a real problem for me when I am 
>>> trying to
>>> #     obtain the names of all of samples for each haplotype, and some 
>>> haplotypes
>>> #     only have one sample.
>>> # See example code below...
>>>
>>> library(pegas)
>>> library(ape)
>>> data(woodmouse)
>>> # creating the two DNAbin files, one from fasta the other from txt
>>> # 1) txt is the original woodmouse data
>>> txt <- woodmouse
>>> # 2) Creating a DNAbin object from a fasta file
>>> setwd('D:') #wherever you have write priveledges
>>> write.dna(woodmouse,file="woodmouse.temp.fasta",format="fasta")
>>> fasta <- read.dna("woodmouse.temp.fasta",format="fasta")
>>>
>>> # the fasta file from the read.dna() function is not a matrix
>>> is.matrix(fasta) # not a matrix
>>> is.matrix(txt)   # is a matrix
>>>
>>> # Thus the following functions do not work consistently...
>>>
>>> # Trying to print ALL names using names()
>>> names(fasta)  # works
>>> names(txt)    # does not work
>>>
>>> # Trying to print ALL names using rownames()
>>> rownames(fasta)  # does not work
>>> rownames(txt)    # works
>>>
>>> # Suggested alternative from Emmanuel
>>> labels(fasta)   # Gives vector of sample names
>>> labels(txt)     # Give list of sample names [[1]] and base numbers [[2]]
>>>
>>> # Trying to print name of the first sample
>>> names(fasta[1,])      # works
>>> rownames(txt[1,])    # does not work
>>>
>>> # So this results in a problem when you want to try and access the name
>>> # of a single sample
>>> summary(fasta[1,])    # sample name is present
>>> summary(txt[1,])      # sample name is NOT present
>>>
>>> # This is not a problem when you try and access two samples
>>> summary(fasta[1:2,])   # sample name is present
>>> summary(txt[1:2,])     # sample name is present
>>>
>>> # Extracting bases also gives conflicting results
>>> summary(fasta[,50:100])   # sequence length = 965
>>> summary(txt[,50:100])     # sequence length = 51
>>>
>>> # Emmanual suggests using the as.matrix() function on any fasta files 
>>> that
>>> # are imported by read.dna()
>>> fasta2txt <- as.matrix(fasta)
>>> # This fasta2txt DNAbin object now behaves exactly like the txt 
>>> DNAbin object
>>> # However, I still cannot get hold of the sample name for a single 
>>> object.
>>> # Even using the labels() function cannot get it.
>>>
>>> # So, now my questions for all of you...
>>> # 1) Have any of you found that this affects your results? I wonder
>>> # whether the two different DNAbin objects will act differently in other
>>> # functions (e.g. dist.dna, nuc.div etc. etc.)
>>> # 2) Do you have a suggestion as to how I should be accessing the name
>>> # for a single sample in a matrix DNAbin object?
>>> # 3) What are your thoughts about the same class object having different
>>> # properties? Do you think this is a problem? Any ideas onto how to 
>>> solve it?
>>>
>>>
>>> _______________________________________________
>>> R-sig-phylo mailing list
>>> R-sig-phylo at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>>     
>>
>>
>>
> 
> _______________________________________________
> R-sig-phylo mailing list
> R-sig-phylo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> 

-- 
Emmanuel Paradis
IRD, Montpellier, France
   ph: +33 (0)4 67 16 64 47
  fax: +33 (0)4 67 16 64 40
http://ape.mpl.ird.fr/



More information about the R-sig-phylo mailing list