[BioC] creating GRanges and reading BAM files
hpages at fhcrc.org
Thu Mar 8 21:11:36 CET 2012
Hi Michael, Dave,
On 03/07/2012 09:23 AM, Michael Lawrence wrote:
> Herve, do you think we could support the P in the CIGAR strings? Or does it
> complicate things too much?
According to the SAM spec:
P: padding (silent deletion from padded reference)
Not clear to me what they mean exactly. In particular, what's
the "padded reference"?
So I looked at the example provided in section 1.1 of the spec
(please use a fixed-width font to display this message):
Coor 12345678901234 5678901234567890123456789012345
The corresponding SAM format is:
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0
r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
As you can see, P is used in the cigar of read r002.
So yes the reference sequence is padded with 2 consecutive * but
that's just a displaying trick here that makes it easier to display
the 2-letter insertion in r001/1.
So I'm confused about how relevant the 1P operation in the r002
cigar is. The same cigar but without the 1P operation would be
3S6M1I4M, and it tells me all I need to know about how the read
is aligned to the reference (not to the *padded* reference, why
should I care?).
Am I missing something or are there situations where the P
actually carries more interesting/important meaning? Otherwise,
adapting the cigar utils in GenomicRanges will be easy: they'll
just ignore the P's :-)
But I'd really like to have a closer look at those BAM files
produced by the Roche's software first. Dave do you think you
can put this file (the one with P's), or a subset of it,
somewhere online where I can access it?
> On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa at hotmail.com> wrote:
>> Thanks a lot Michael.
>> I will modify my pseudo-bed file accordingly.
> If you do actually get your BED file into spec, then you could use
> rtracklayer::import to load it directly as a GRanges. You'll still need to
> set the seqlengths on it.
>> Regarding padding not being supported, and while waiting to see if it ever
>> gets support so that I can continue with analysis, is there any tool that
>> will automatically unpad the alignment? or do people create their own
>> scripts to remove the P flags? That is all it needs right?
>> Date: Wed, 7 Mar 2012 07:21:31 -0800
>> Subject: Re: [BioC] creating GRanges and reading BAM files
>> From: lawrence.michael at gene.com
>> To: dasolexa at hotmail.com
>> CC: bioconductor at r-project.org
>> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa at hotmail.com> wrote:
>> Hi list,
>> sorry for a simple question, but I am a newbie a bit lost reading all the
>> information on how to handle NGS data using R tools. I have a set of BAM
>> files from Junior sequencer, one BAM per amplicon per sample (Roche's
>> software does not output one single BAM file per sample including all
>> amplicons). I also have a bed file with the features sequenced. At the
>> moment I am only testing with two amplicons for one sample. My final aim is
>> to get the coverage per amplicon per sample
>> I am using R version 2.14.2 (2012-02-29)
>> I read in the bed file and tried to create a GRanges object:
>>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE)
>> chromosome start end
>> 1 APOBex26_M13 1 478
>> 2 APOBex29_M13 1 448
>> This does not appear to be a valid BED file. There should not be any
>> column names, and the intervals should be 0-based. But anyway, let's just
>> say it is BED-like.
>> GRanges with 2 ranges and 0 elementMetadata values:
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>>  APOBex26_M13 [1, 478] *
>>  APOBex29_M13 [1, 448] *
>> APOBex26_M13 APOBex29_M13
>> NA NA
>> I also read in the list of BAM files vand convert to BamFileList:
>>> fls = list.files(pattern="*bam",full=TRUE)
>>  "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam"
>>  "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam"
>> bfs<- BamFileList(fls)
>> I then try to summarizeOverlaps but gives me the following error.
>>> olap<-summarizeOverlaps(gr, bfs)
>> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") :
>> in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported
>> yet, sorry!
>> What is the meaning of this error and how can I overcome it?
>> Well I would just take it at face value: the P cigar operation is not
>> supported yet. Maybe it is time to start supporting it.. :)
>> Is the gr object not created properly? Is Metadata necessary? Why is not
>> seqlengths filled automatically using the IRanges?
>> There is no way to know the seqlengths automatically. You've constructed a
>> GRanges with features that lay on some typically longer sequence, but the
>> sequence length was never specified. You can pass your "end" values as the
>> "seqlengths" to the GRanges constructor.
>> Thanks for your help,
>> [[alternative HTML version deleted]]
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> Search the archives:
> [[alternative HTML version deleted]]
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor