[BioC] readGappedAlignments, package:GenomicRanges

Hervé Pagès hpages at fhcrc.org
Fri Mar 16 21:12:04 CET 2012


A small correction. See below...

On 03/16/2012 01:03 PM, Hervé Pagès wrote:
> Hi Milica,
>
> On 03/16/2012 07:28 AM, Milica Krunic wrote:
>> Hello!
>>
>> I am working on RNA Seq. I mapped the reads to a genome using bwa. Now I
>> want to get the count tables (number of reads per genomic feature),
>> and one
>> of the steps include conversion of .sam files into sorted .bam files, and
>> creating a GRangesList out of sorted .bam files. Before the last
>> mentioned
>> step, I also applied readGappedAlignments to read in the sorted .bam
>> files.
>> Code, tail of tested .sam file, and contest of variable bwa_output are
>> following:
>>
>> *Code*:
>>
>> fcatusDb=makeTranscriptDbFromBiomart(biomart="ensembl",
>> dataset="fcatus_gene_ensembl")
>> tx=transcriptsBy(fcatusDb)
>>
>> bwa_output=readGappedAlignments(file_name1, format="BAM")
>>
>> bwa_grglist=grglist(bwa_output)
>>
>> rc=countOverlaps(tx, bwa_grglist[countOverlaps(bwa_grglist,tx)==1])
>>
>>
>>
>> *test.sam*:
>>
>> @SQ SN:GeneScaffold_295 LN:1999586
>> @SQ SN:GeneScaffold_5264 LN:2008247
>> @SQ SN:GeneScaffold_3594 LN:2061208
>> @SQ SN:GeneScaffold_3107 LN:2081283
>> HWUSI-EAS475:1:1:9547:3828#0 0 GeneScaffold_1986 3069 0
>> 41M * 0 0
>> CCACATCTCTCCCAGTTTCTTTGCAACATCACCAATGGATA * XT:A:R NM:i:0
>> X0:i:3 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:41
>> XA:Z:scaffold_214963,-224762,41M,0;GeneScaffold_4609,+41298,41M,0;GeneScaffold_886,-415265,41M,1;
>>
>> HWUSI-EAS475:1:1:5401:3825#0 16 scaffold_96256 459 0 41M
>> * 0 0 GTTGGCTCTCAGGGGTACATGATGAACTGGGGAAGGAGTAT
>> * XT:A:R NM:i:0 X0:i:3 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:41
>> XA:Z:scaffold_166015,+60068,41M,0;scaffold_5553,+4693,41M,0;
>> HWUSI-EAS475:1:1:8428:7999#0 0 GeneScaffold_4103 98907 37
>> 41M * 0 0
>> CCTGCACGTTCATTGTGTGTGTCTTGAGTTGATTTGTCACC * XT:A:U NM:i:1
>> X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:11T29
>> HWUSI-EAS475:1:1:5814:7999#0 0 scaffold_75884 112 14 41M
>> * 0 0 CCCAGAAGTCCCAGGTGTTCCCATTCTGTGATCAGCACAGA
>> * XT:A:U NM:i:0 X0:i:1 X1:i:8 XM:i:0 XO:i:0 XG:i:0 MD:Z:41
>> XA:Z:scaffold_99844,+4698,41M,1;scaffold_20760,+365,41M,1;scaffold_95772,+1923,41M,1;scaffold_16284,-3782,41M,1;scaffold_22266,-52,41M,1;scaffold_82279,-5356,41M,1;scaffold_75127,-757,41M,1;scaffold_82292,-3298,41M,1;
>>
>>
>>
>>
>> *bwa_output:*
>> *
>> *
>> *
>> GappedAlignments with 4 alignments and 0 elementMetadata values:
>> seqnames strand cigar qwidth start end
>> <Rle> <Rle> <character> <integer> <integer> <integer>
>> [1] scaffold_75884 + 41M 41 112 152
>> [2] scaffold_96256 - 41M 41 459 499
>> [3] GeneScaffold_1986 + 41M 41 3069 3109
>> [4] GeneScaffold_4103 + 41M 41 98907 98947
>> width ngap
>> <integer> <integer>
>> [1] 41 0
>> [2] 41 0
>> [3] 41 0
>> [4] 41 0
>> ---
>>
>>
>> *
>> *
>> *
>>
>> According the manual ("Multi-reads won't receive any special treatment
>> i.e.
>> the various SAM/BAM records corresponding to a multi-read will show up in
>> the GappedAlignments object as if they were coming from
>> different/unrelated
>> queries.") and taken into account previously mentioned, I would be
>> grateful
>> if someone can confirm me that readGappedAlignments (output of bwa gives
>> .sam files which contain all information about mapping locations of one
>> read in one line) will only take the first hit, no matter what is
>> following
>> (whether that read is mapped to several equally scored positions...).
>
> This part of the manual is trying to emphasize the fact that
> readGappedAlignments() is returning a GappedAlignments object with
> *one* element per record in the BAM file. readGappedAlignments()
> is pretty low-level i.e. it doesn't try to merge or to drop records
> when a read has multiple hits or when reads are paired-end.
> By default it loads all the records that can be loaded. The only
> exception is for unmapped reads: those are never loaded in the
> GappedAlignments object.
>
> So, by default, records corresponding to the same read (i.e. records
> with the same QNAME value, note that you can use use.names=TRUE to
> set this field as the names of the returned object) will be loaded.
> And you will typically end up with duplicated names on your
> GappedAlignments object.
>
> AFAIK the reasons for finding more than 1 record with the same QNAME
> value in a BAM file are because:
> (a) more than 1 hit were reported in the BAM file by the aligner;
> (b) or the reads are paired-end, then 2 records are needed to
> report a "paired-hit";
> (c) or a combination of (a) and (b), e.g. if the aligner found 3
> "paired-hits" for a given paired-end read, then 6 records
> are needed.
>
> Here are the 2 most common strategies for dealing with single-end
> multi-reads (i.e. single-end reads with more than 1 record):
>
> 1. AFAIU most aligner should/will set flag bit 0x100 to 1
> for hits that are considered secondary. So you can filter
> out those records at load time with:
>
> library(Rsamtools)
> param <- ScanBamParam(flag=scanBamFlag(isNotPrimaryRead=FALSE))
> bwa_output <- readGappedAlignments(file_name1, param=param)
>
> My understanding of the SAM spec is that, in the case of
> single-end reads, there should be only 1 primary record in the
> BAM file for a given QNAME.
>
> 2. Aligners will sometime set predefined tag NH (number of hits)
> for each record. You can load this tag with:
>
> param <- ScanBamParam(tag="NH")
> bwa_output <- readGappedAlignments(file_name1, param=param)
>
> and then subset 'bwa_output' based on the value of this tag:
>
> bwa_output[elementMetadata(bwa_output)$NH == 1L]
>
> Note that this filtering is more stringent than 1.: it will not
> only drop all the secondary records but it will also drop the
> primary record of a read that has secondary records.
>
> For paired-end reads the situation is a little bit more complicated
> but there are tools currently in development in Rsamtools/GenomicRanges
> that will make it easy to load them, keep the pairing information
> and filter out "secondary pairs".
>
>>
>> If this is the case, would you filter the .sam files for uniquely mapped
>> reads before reading it via readGappedAlignments? Or will you let the
>> countOverlaps do it?
>
> The precise meaning of "uniquely mapped" depends on the context. When
> we talk about reads with unique/multiple SAM records, it means "uniquely
> mapped to the reference genome". But when you do:
>
> countOverlaps(bwa_grglist,tx)==1
>
> you are selecting alignments (i.e. elements in your GappedAlignments
> object 'bwa_grglist') that hit only 1 gene (because of how you made
> 'bwa_grglist', i.e. by grouping transcripts by gene, the hit must
> actually be in one of the gene exons in order to be counted).

Sorry that's not correct. I should say: the hit is counted if it's
withing the boundaries of at least 1 of the transcripts of the gene.

>
> So the 2 filterings (i.e. filter the .sam files for uniquely mapped
> reads and filtering with countOverlaps(bwa_grglist,tx)==1) are
> independent. A read can have multiple hits in the reference genome
> while still hitting a single gene. And, inversely, a read can have
> a unique hit in the reference genome and hit more than 1 gene (e.g.
> if 2 genes have exons that overlap, probably a rare but not

if 2 genes have transcripts that overlap (looking at their boundaries
only, regardless of the exon/intron structure)

Cheers,
H.

> impossible event). So you can do either filtering or both of them.
>
> I'm not sure there is a "one size fits all" strategy. I would try
> different solutions and compare them... Note that there are also
> higher level tools available for doing this kind of counting e.g.
> summarizeOverlaps() in GenomicRanges (see the summarizeOverlaps.pdf
> vignette in that package) and the htseq-count script distributed
> with the HTSeq Python framework (see the DESeq.pdf vignette in the
> DESeq package). It's a tough question so it's always good to see how
> other people have tried to solve it.
>
> HTH,
> H.
>
>>
>>
>>
>> Thank you!
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list