[Bioc-devel] restrictToSNV for VCF

Fri Mar 21 23:35:56 CET 2014

Hi Martin,

On 03/21/2014 01:45 PM, Martin Morgan wrote:
> On 03/20/2014 05:20 PM, Hervé Pagès wrote:
>> Hi,
>>
>> On 03/19/2014 01:10 PM, Michael Lawrence wrote:
>>> You can apparently use 1D extraction for VCF, which is a little
>>> surprising;
>>> I learned it from restrictToSNV.
>>
>> This is inherited from SummarizedExperiment:
>>
>>    > example(SummarizedExperiment)
>>
>>    > se1
>>    class: SummarizedExperiment
>>    dim: 200 6
>>    exptData(0):
>>    assays(1): counts
>>    rownames: NULL
>>    rowData metadata column names(0):
>>    colnames(6): A B ... E F
>>    colData names(1): Treatment
>>
>>    > se1[1:4]
>>    class: SummarizedExperiment
>>    dim: 4 6
>>    exptData(0):
>>    assays(1): counts
>>    rownames: NULL
>>    rowData metadata column names(0):
>>    colnames(6): A B ... E F
>>    colData names(1): Treatment
>>
>> To me that means that a SummarizedExperiment has a length
>> (conceptually), and that this length is the number of rows.
>> It would actually help if a "length" method was defined:
>>
>>    > length(se1)
>>    [1] 1
>
> I think of a SummarizedExperiment as fundamentally a matrix with row and
> column annotations. 'length' would then be prod(dim(se1))

But it's not defined as such either.

Note that findOverlaps() on SummarizedExperiment objects returns a
Hits object with indices in the 1:nrow(query) or 1:nrow(subject)
range. I'd like to be able to say "in the seq_along(query) or
seq_along(subject) range" because that's what findOverlaps() does
on any other object defined in IRanges/GenomicRanges/GenomicAlignments.
But I can't because that would be inaccurate.

However, it's conceptually true: I can use the indices in the Hits
object to do 1D extractions from the query or subject. This is good
and consistent with any other type of query or subject.

> col- and
> rownames() are defined but names() is NULL. I guess 1-D sub-setting
> isn't matrix-like, but I don't think that removing this 'feature' simply
> for consistency sake is worth it;

I was not suggesting that.

> I guess the subsetting logical was
> copy/pasted from other code without enough thought. head(), tail() could
> be implemented if this were somehow useful (I usually use these for
> compact display, and that's irrelevant here...);

I still find it sometimes useful to be able to do head() on a big
object when I just want to try things on a few elements first:

   > dim(vcf)
   [1] 1000000       3

   toy <- head(vcf)
   rowData(toy)
   assay(toy)
   isSNV(toy)
   findOverlaps(toy, exons)

It's more convenient and much quicker than having to truncate the
individual results:

   head(rowData(vcf))
   head(assay(vcf))
   head(isSNV(vcf))
   head(findOverlaps(vcf, exons))

I guess what I'm trying to say is that while it helps thinking of
a SummarizedExperiment as fundamentally a matrix, there are already
enough differences with the matrix API to suggest that, unlike for a
matrix, the length of a SummarizedExperiment object is its nb of rows.
It's implicit in many ways and I think that formalizing it would help
rather than hurt. It will still be somewhat a surprise for the
end-user, but not a bigger surprise than the ones s/he gets right
now with seq_along(vcf), vcf[i], isSNV(vcf), findOverlaps(), head(),
etc.. And once s/he gets over it, there won't be anymore surprises:
all these things will be in agreement with length(vcf) and behave
as expected.

Thanks,
H.

> rev() on a matrix
> doesn't do anything useful.
>
> Martin
>
>>
>> That would automatically fix many convenience [ wrappers like head(),
>> tail(), rev(), etc...
>>
>>    > head(se1)
>>    class: SummarizedExperiment
>>    dim: 1 6
>>    exptData(0):
>>    assays(1): counts
>>    rownames: NULL
>>    rowData metadata column names(0):
>>    colnames(6): A B ... E F
>>    colData names(1): Treatment
>>
>>    > rev(se1)
>>    class: SummarizedExperiment
>>    dim: 1 6
>>    exptData(0):
>>    assays(1): counts
>>    rownames: NULL
>>    rowData metadata column names(0):
>>    colnames(6): A B ... E F
>>    colData names(1): Treatment
>>
>> Following that logic names(se1) also probably return colnames(se1).
>>
>> H.
>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 19, 2014 at 1:07 PM, Vincent Carey
>>> <stvjc at channing.harvard.edu>wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 19, 2014 at 4:00 PM, Michael Lawrence <
>>>> lawrence.michael at gene.com> wrote:
>>>>
>>>>> It would be nice to have functions like isSNV, isIndel, isDeletion,
>>>>> etc
>>>>> that at least provide precise definitions of the terminology. I've
>>>>> added
>>>>> these, but they're designed only for VRanges. Should work for
>>>>> ExpandedVCF.
>>>>>
>>>>> Also, it would be nice if restrictToSNV just assumed that alt(x)
>>>>> must be
>>>>> something with nchar() support (with special handling for any
>>>>> List), so
>>>>> that the 'character' vector of alt,VRanges would work immediately.
>>>>> Basically restrictToSNV should just be x[isSNV(x)]. Is there even a
>>>>> use-case for the restrictToSNV abstraction if we did that?
>>>>>
>>>>>
>>>> for VCF instance it would be x[isSNV(x),] and indeed I think that
>>>> would be
>>>> sufficient.  i like the idea of having this family of predicates for
>>>> variant classes to allow such selections
>>>>
>>>>
>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 18, 2014 at 10:36 AM, Valerie Obenchain
>>>>> <vobencha at fhcrc.org>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've added a restrictToSNV() function to VariantAnnotation
>>>>>> (1.9.46). The
>>>>>> return value is a subset VCF object containing SNVs only. The
>>>>>> function
>>>>>> operates on CollapsedVCF or ExapandedVCF and the alt(VCF) value
>>>>>> must be
>>>>>> nucleotides (i.e., no structural variants).
>>>>>>
>>>>>> A variant is considered a SNV if the nucleotide sequences in both
>>>>>> ref(vcf) and alt(x) are of length 1. I have a question about how
>>>>>> variants
>>>>>> with multiple 'ALT' values should be handled.
>>>>>>
>>>>>> Should we consider row 4 a SNV? One 'ALT' is length 1, the other
>>>>>> is not.
>>>>>>
>>>>>> ALT <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
>>>>>> REF <- DNAStringSet(c("G", c("AA"), "T", "G"))
>>>>>>
>>>>>>> DataFrame(REF, ALT)
>>>>>>>>
>>>>>>> DataFrame with 4 rows and 2 columns
>>>>>>>               REF                ALT
>>>>>>>    <DNAStringSet> <DNAStringSetList>
>>>>>>> 1              G                  A
>>>>>>> 2             AA                 TT
>>>>>>> 3              T                G,A
>>>>>>> 4              G               TT,C
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>> Valerie
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319