Yes, the chr.select parameter will allow you to import selected chromosomes
only from your bam files. I recommend to create an according bam index file
(samtools) and to store it together with the bam file in the same directory.

Lukas


On Tue, Apr 1, 2014 at 1:49 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote:

> On 04/01/2014 12:52 PM, Vining, Kelly wrote:
>
>> This is a follow-up question about the most efficient way to use
>> functions in Rsamtools to filter my bam files so that only alignments to
>> chromosomes are included (extrachromosomal scaffold alignments are
>> excluded).
>>
>
> I think from your original question you are really looking to provide the
> names of the sequences in your BSgenome object as a value to the chr.select
> argument of MEDIPS.createSet, I *think* chr.select=seqnames(BSgenome) (be
> sure to only include seqnames defined in the bam file, too).
>
>
>>  From the Rsamtools documentation, it looks like there are two ways to
>>> accomplish this:
>>>
>> 1) bamFile(bamFilter())
>>
>
> ??? I'm not sure what you're talking about here, neither bamFile nor
> bamFilter are functions defined in Rsamtools ???
>
> Use filterBam() to create a new bam file from an old bam files. It's use
> is illustrated in an example on the help page ?filterBam.
>
>  2) bamFile(ScanBamParam())
>>
>
> ScanBamParam() is a way to selectively input data, e.g., with the
> readGAlignments or scanBam functions. Its use is illustrated for instance
> on p. 2 of the vignette available in the Rsamtools package and at
>
> http://bioconductor.org/packages/release/bioc/
> vignettes/Rsamtools/inst/doc/Rsamtools-Overview.pdf
>
> MEDIPS.createSet uses a ScanBamParam internally, but as a user you do not
> have access to it.
>
> Martin
>
>
>> For the first method, a "destination" output file name has to be
>> designated. I don't want to create a separate file, so it appears that the
>> second method is what I want, as it allows control of which records are
>> imported and doesn't output a separate file.
>>
>> I set up a FilterRules object like so:
>>
>>> filt <- FilterRules(list(isChr = function(x) rname(x) == "^Chr"))
>>>
>>
>> I also set up a "which" variable to contain the seqnames in my BSgenome:
>>
>>> which <- seqnames(BSgenome)
>>>
>>
>> However, I'm getting stuck trying to apply either of these to
>> ScanBamParam. Perhaps I want ScanBamWhat instead of ScanBamParam? Can
>> someone clue me in about the proper syntax to accomplish this simple
>> filtering operation?
>>
>> Thanks for any help,
>> --Kelly
>>
>> ________________________________________
>> From: bioconductor-bounces@r-project.org [bioconductor-bounces@r-
>> project.org] on behalf of Vining, Kelly [Kelly.Vining@oregonstate.edu]
>> Sent: Tuesday, April 01, 2014 8:12 AM
>> To: 'Lukas Chavez'
>> Cc: Steve Lianoglou; bioconductor@r-project.org
>> Subject: Re: [BioC] MEDIPS.createSet error
>>
>> I see, Lukas. That makes sense. Thanks for explaining it. So, MEDIPS
>> looks in the BSgenome object for seqnames only, and doesn't look in
>> mseqnames. It seems like a lot of users of the MEDIPS package must
>> encounter this problem, because most genomes have sets of unassembled
>> scaffolds that, following the BSgenome forging instructions, inevitably end
>> up as mseq objects. It would be great if a future version of MEDIPS looked
>> for both types of sequences in the BSgenome.
>>
>> I also meant to reply to Steve's question about this line from the
>> vignette:
>> BSgenome="BSgenome.Hsapiens.UCSC.hg19"
>>
>> In my case, with my custom genome, this would be:
>> BSgenome = "BSgenome.Ptrichocarpa.Phytozome.v3"
>>
>> But when I do this, the literal BSgenome.Ptrichocarpa.Phytozome.v3 is
>> assigned to the BSgenome variable as a character vector. So instead, I do
>> this:
>> BSgenome = BSgenome.Ptrichocarpa.Phytozome.v3
>>
>> That works. So maybe quotes are only required for genomes listed under
>> available_genomes()?
>>
>> Meanwhile, for my MEDIPS analysis, I'm willing to go back to the bam
>> files and filter out the scaffold alignments. It looks like Rsamtools has
>> methods for this. Should I be trying to set up a FilterRules instance, or a
>> ScanBamParam instance? Any help with the syntax for this step will be
>> appreciated.
>>
>> I'm sure this is not going to work, but I'll start with something like
>> this:
>> bamFilt_Fall1 <- BamFile(filterBam(file, filter=FilterRules("scaffold")))
>>
>>
>> Thanks much,
>> --Kelly
>>
>> From: Lukas Chavez [mailto:lukas.chavez.mailings@googlemail.com]
>> Sent: Tuesday, April 01, 2014 12:32 AM
>> To: Vining, Kelly
>> Cc: Steve Lianoglou; bioconductor@r-project.org
>> Subject: Re: [BioC] MEDIPS.createSet error
>>
>> Thank you Steve for explaining in detail how to look at the content of
>> the BSgenome object and the bam file.
>>
>> When MEDIPS imports the bam file, it will try to extract information for
>> the 'scaffold_' chromosomes from the BSgenome object. The initially
>> reported error
>>
>> Calculating genomic coordinates...Error in vector(length =
>> supersize_chr[length(
>> chromosomes)], mode = "character") :
>>    vector size cannot be NA/NaN
>> occurs because MEDIPS is trying to find the length of a chromosome given
>> in the bam file, but does not find the chromosome in the BSgenome object.
>>  Subsequently, MEDIPS tries to calculate genomic coordinates for the
>> chromosome wide windows, but the length of the chromosome is NA. I
>> understand that it will be necessary to catch this error and add a warning
>> message in a future version of MEDIPS.
>> If you want to keep your hits on the 'scaffold_' chromosomes in the bam
>> file, and to include all scaffolds in your further analysis, you will have
>> to recreate your BSgenome object treating each scaffold as an individual
>> chromosome just as the other assembled chromosomes.
>> Lukas
>>
>>
>>
>> On Mon, Mar 31, 2014 at 3:05 PM, Vining, Kelly <
>> Kelly.Vining@oregonstate.edu<mailto:Kelly.Vining@oregonstate.edu>> wrote:
>> Thanks, Steve, for your advice - appreciate you working with a package
>> that isn't familiar to you.
>> Thanks also for catching that I didn't have Rsamtools installed.
>>
>> I think this output that you suggested I generate gives an idea about
>> what might be going on:
>>
>>
>>  si.bsg
>>>
>> Seqinfo of length 19
>> seqnames seqlengths isCircular genome
>> Chr01      50495391      FALSE    3.0
>> Chr02      25263035      FALSE    3.0
>> Chr03      21816808      FALSE    3.0
>> Chr04      24267051      FALSE    3.0
>> Chr05      25890704      FALSE    3.0
>> ...             ...        ...    ...
>> Chr15      15278577      FALSE    3.0
>> Chr16      14494361      FALSE    3.0
>> Chr17      16080358      FALSE    3.0
>> Chr18      16958300      FALSE    3.0
>> Chr19      15942145      FALSE    3.0
>>
>>> si.bam
>>>
>> Seqinfo of length 1446
>> seqnames      seqlengths isCircular genome
>> Chr01           50495391       <NA>   <NA>
>> Chr02           25263035       <NA>   <NA>
>> Chr03           21816808       <NA>   <NA>
>> Chr04           24267051       <NA>   <NA>
>> Chr05           25890704       <NA>   <NA>
>> ...                  ...        ...    ...
>> scaffold_3584       3654       <NA>   <NA>
>> scaffold_3595       2008       <NA>   <NA>
>> scaffold_3648       2796       <NA>   <NA>
>> scaffold_3664       3053       <NA>   <NA>
>> scaffold_3681       1022       <NA>   <NA>
>>
>>
>> When I created the reference genome structure, the scaffolds were all
>> entered in a single, multiple seq object, following instructions. But in my
>> bam files, the scaffold hits are as shown above. So that could be one
>> problem, but I don't know how to solve that other than filtering out all of
>> the alignments to the scaffolds from the bam files. I really would like to
>> retain the scaffold alignments and not lose all of that precious data.
>>
>> The other potential issue is the "NA" entries in the "isCircular" and
>> "genome" columns. I'm not sure whether those would be a problem.
>>
>> Continued thanks,
>> --Kelly
>>
>> ________________________________________
>> From: Steve Lianoglou [lianoglou.steve@gene.com<mailto:
>> lianoglou.steve@gene.com>]
>> Sent: Monday, March 31, 2014 1:15 PM
>> To: Vining, Kelly
>> Cc: Lukas Chavez; bioconductor@r-project.org<mailto:
>> bioconductor@r-project.org>
>> Subject: Re: [BioC] MEDIPS.createSet error
>> Hi,
>>
>> Caveat being that I've never used MEDIPS, and I'm just going along with
>> the vignette.
>>
>> So, comments inline:
>>
>>
>> On 31 Mar 2014, at 13:09, Vining, Kelly wrote:
>>
>>  Hi,
>>> Thanks for the advice thus far! To confirm what is in my BSgenome
>>> variable, I did this:
>>>
>>>  class(BSgenome)
>>>>
>>> [1] "BSgenome"
>>> attr(,"package")
>>> [1] "BSgenome"
>>>
>>>> names(BSgenome)
>>>>
>>> [1] "Chr01" "Chr02" "Chr03" "Chr04" "Chr05" "Chr06" "Chr07" "Chr08"
>>> "Chr09"
>>> [10] "Chr10" "Chr11" "Chr12" "Chr13" "Chr14" "Chr15" "Chr16" "Chr17"
>>> "Chr18"
>>> [19] "Chr19" "scaf"
>>>
>>> And then:
>>>
>>>> print(BSgenome)
>>>>
>>> Black cottonwood genome
>>> |
>>> | organism: Populus trichocarpa (Black cottonwood)
>>> | provider: Phytozome (JGI)
>>> | provider version: 3.0
>>> | release date: January 2010
>>> | release name: Populus trichocarpa v3.0
>>> |
>>> | single sequences (see '?seqnames'):
>>> |   Chr01  Chr02  Chr03  Chr04  Chr05  Chr06  Chr07  Chr08  Chr09
>>> Chr10  Chr11
>>> |   Chr12  Chr13  Chr14  Chr15  Chr16  Chr17  Chr18  Chr19
>>> |
>>> | multiple sequences (see '?mseqnames'):
>>> |   scaf
>>> |
>>> | (use the '$' or '[[' operator to access a given sequence)
>>>
>>>
>>> So that looks ok. Interestingly, when I followed the vignette and did
>>> the equivalent of
>>> BSgenome="BSgenome.Hsapiens.UCSC.hg19"
>>>
>>
>> The vignette suggests that BSgenome should be the package name of the
>> BSgenome package to open, so yours should be something like
>> "BSgenome.Ptrichocarpa.XXX.YY" or something -- I guess you built this
>> packge by yourself, or something, so you'd know its name ...
>>
>>  That didn't work for me. It only worked without quotes. If I included
>>> quotes, it just assigned a character vector to that variable.
>>>
>>
>> Sorry, what exactly didn't work for you? Can you show me the code that
>> failed?
>>
>>  Then, following your advice:
>>>
>>>  si.bsg <- seqinfo(BSgenome.Ptrichocarpa.Phytozome.v3)
>>>> si.bam <- seqinfo(BamFile("FallBud_brep1_aln_sorted.bam"))
>>>>
>>> Error in seqinfo(BamFile("FallBud_brep1_aln_sorted.bam")) :
>>> error in evaluating the argument 'x' in selecting a method for
>>> function 'seqinfo': Error: could not find function "BamFile"
>>>
>>
>> The BamFile function is defined in the Rsamtools package, you need to
>> load that first (the first line of code I suggested you run was to load
>> the Rsamtools package). Load it first, then redo the
>> seqinfo(BamFile(...)) stuff.
>>
>> -steve
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.
>> science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.
>> science.biology.informatics.conductor
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

	[[alternative HTML version deleted]]