[Bioc-devel] restrictToSNV for VCF
Martin Morgan
mtmorgan at fhcrc.org
Fri Mar 21 21:45:52 CET 2014
On 03/20/2014 05:20 PM, Hervé Pagès wrote:
> Hi,
>
> On 03/19/2014 01:10 PM, Michael Lawrence wrote:
>> You can apparently use 1D extraction for VCF, which is a little surprising;
>> I learned it from restrictToSNV.
>
> This is inherited from SummarizedExperiment:
>
> > example(SummarizedExperiment)
>
> > se1
> class: SummarizedExperiment
> dim: 200 6
> exptData(0):
> assays(1): counts
> rownames: NULL
> rowData metadata column names(0):
> colnames(6): A B ... E F
> colData names(1): Treatment
>
> > se1[1:4]
> class: SummarizedExperiment
> dim: 4 6
> exptData(0):
> assays(1): counts
> rownames: NULL
> rowData metadata column names(0):
> colnames(6): A B ... E F
> colData names(1): Treatment
>
> To me that means that a SummarizedExperiment has a length
> (conceptually), and that this length is the number of rows.
> It would actually help if a "length" method was defined:
>
> > length(se1)
> [1] 1
I think of a SummarizedExperiment as fundamentally a matrix with row and column
annotations. 'length' would then be prod(dim(se1)) col- and rownames() are
defined but names() is NULL. I guess 1-D sub-setting isn't matrix-like, but I
don't think that removing this 'feature' simply for consistency sake is worth
it; I guess the subsetting logical was copy/pasted from other code without
enough thought. head(), tail() could be implemented if this were somehow useful
(I usually use these for compact display, and that's irrelevant here...); rev()
on a matrix doesn't do anything useful.
Martin
>
> That would automatically fix many convenience [ wrappers like head(),
> tail(), rev(), etc...
>
> > head(se1)
> class: SummarizedExperiment
> dim: 1 6
> exptData(0):
> assays(1): counts
> rownames: NULL
> rowData metadata column names(0):
> colnames(6): A B ... E F
> colData names(1): Treatment
>
> > rev(se1)
> class: SummarizedExperiment
> dim: 1 6
> exptData(0):
> assays(1): counts
> rownames: NULL
> rowData metadata column names(0):
> colnames(6): A B ... E F
> colData names(1): Treatment
>
> Following that logic names(se1) also probably return colnames(se1).
>
> H.
>
>>
>>
>>
>>
>> On Wed, Mar 19, 2014 at 1:07 PM, Vincent Carey
>> <stvjc at channing.harvard.edu>wrote:
>>
>>>
>>>
>>>
>>> On Wed, Mar 19, 2014 at 4:00 PM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>> It would be nice to have functions like isSNV, isIndel, isDeletion, etc
>>>> that at least provide precise definitions of the terminology. I've added
>>>> these, but they're designed only for VRanges. Should work for ExpandedVCF.
>>>>
>>>> Also, it would be nice if restrictToSNV just assumed that alt(x) must be
>>>> something with nchar() support (with special handling for any List), so
>>>> that the 'character' vector of alt,VRanges would work immediately.
>>>> Basically restrictToSNV should just be x[isSNV(x)]. Is there even a
>>>> use-case for the restrictToSNV abstraction if we did that?
>>>>
>>>>
>>> for VCF instance it would be x[isSNV(x),] and indeed I think that would be
>>> sufficient. i like the idea of having this family of predicates for
>>> variant classes to allow such selections
>>>
>>>
>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> On Tue, Mar 18, 2014 at 10:36 AM, Valerie Obenchain <vobencha at fhcrc.org>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've added a restrictToSNV() function to VariantAnnotation (1.9.46). The
>>>>> return value is a subset VCF object containing SNVs only. The function
>>>>> operates on CollapsedVCF or ExapandedVCF and the alt(VCF) value must be
>>>>> nucleotides (i.e., no structural variants).
>>>>>
>>>>> A variant is considered a SNV if the nucleotide sequences in both
>>>>> ref(vcf) and alt(x) are of length 1. I have a question about how variants
>>>>> with multiple 'ALT' values should be handled.
>>>>>
>>>>> Should we consider row 4 a SNV? One 'ALT' is length 1, the other is not.
>>>>>
>>>>> ALT <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
>>>>> REF <- DNAStringSet(c("G", c("AA"), "T", "G"))
>>>>>
>>>>>> DataFrame(REF, ALT)
>>>>>>>
>>>>>> DataFrame with 4 rows and 2 columns
>>>>>> REF ALT
>>>>>> <DNAStringSet> <DNAStringSetList>
>>>>>> 1 G A
>>>>>> 2 AA TT
>>>>>> 3 T G,A
>>>>>> 4 G TT,C
>>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>> Valerie
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list