[BioC] readGappedAlignments, package:GenomicRanges

Fri Mar 16 21:03:11 CET 2012

Hi Milica,

On 03/16/2012 07:28 AM, Milica Krunic wrote:
> Hello!
>
> I am working on RNA Seq. I mapped the reads to a genome using bwa. Now I
> want to get the count tables (number of reads per genomic feature), and one
> of the steps include conversion of .sam files into sorted .bam files, and
> creating a GRangesList out of sorted .bam files. Before the last mentioned
> step, I also applied readGappedAlignments to read in the sorted .bam files.
> Code, tail of tested .sam file, and contest of variable bwa_output are
> following:
>
> *Code*:
>
> fcatusDb=makeTranscriptDbFromBiomart(biomart="ensembl",
> dataset="fcatus_gene_ensembl")
> tx=transcriptsBy(fcatusDb)
>
> bwa_output=readGappedAlignments(file_name1, format="BAM")
>
> bwa_grglist=grglist(bwa_output)
>
> rc=countOverlaps(tx, bwa_grglist[countOverlaps(bwa_grglist,tx)==1])
>
>
>
> *test.sam*:
>
> @SQ     SN:GeneScaffold_295     LN:1999586
> @SQ     SN:GeneScaffold_5264    LN:2008247
> @SQ     SN:GeneScaffold_3594    LN:2061208
> @SQ     SN:GeneScaffold_3107    LN:2081283
> HWUSI-EAS475:1:1:9547:3828#0    0       GeneScaffold_1986       3069    0
>      41M     *       0       0
> CCACATCTCTCCCAGTTTCTTTGCAACATCACCAATGGATA       *       XT:A:R  NM:i:0
>   X0:i:3  X1:i:1  XM:i:0  XO:i:0  XG:i:0  MD:Z:41
> XA:Z:scaffold_214963,-224762,41M,0;GeneScaffold_4609,+41298,41M,0;GeneScaffold_886,-415265,41M,1;
> HWUSI-EAS475:1:1:5401:3825#0    16      scaffold_96256  459     0       41M
>      *       0       0       GTTGGCTCTCAGGGGTACATGATGAACTGGGGAAGGAGTAT
> *       XT:A:R  NM:i:0  X0:i:3      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:41
> XA:Z:scaffold_166015,+60068,41M,0;scaffold_5553,+4693,41M,0;
> HWUSI-EAS475:1:1:8428:7999#0    0       GeneScaffold_4103       98907   37
>       41M     *       0       0
> CCTGCACGTTCATTGTGTGTGTCTTGAGTTGATTTGTCACC       *       XT:A:U  NM:i:1
>   X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:11T29
> HWUSI-EAS475:1:1:5814:7999#0    0       scaffold_75884  112     14      41M
>      *       0       0       CCCAGAAGTCCCAGGTGTTCCCATTCTGTGATCAGCACAGA
> *       XT:A:U  NM:i:0  X0:i:1      X1:i:8  XM:i:0  XO:i:0  XG:i:0  MD:Z:41
> XA:Z:scaffold_99844,+4698,41M,1;scaffold_20760,+365,41M,1;scaffold_95772,+1923,41M,1;scaffold_16284,-3782,41M,1;scaffold_22266,-52,41M,1;scaffold_82279,-5356,41M,1;scaffold_75127,-757,41M,1;scaffold_82292,-3298,41M,1;
>
>
>
> *bwa_output:*
> *
> *
> *
> GappedAlignments with 4 alignments and 0 elementMetadata values:
>                 seqnames strand       cigar    qwidth     start       end
>                    <Rle>   <Rle>  <character>  <integer>  <integer>  <integer>
>    [1]    scaffold_75884      +         41M        41       112       152
>    [2]    scaffold_96256      -         41M        41       459       499
>    [3] GeneScaffold_1986      +         41M        41      3069      3109
>    [4] GeneScaffold_4103      +         41M        41     98907     98947
>            width      ngap
>        <integer>  <integer>
>    [1]        41         0
>    [2]        41         0
>    [3]        41         0
>    [4]        41         0
>    ---
>
>
> *
> *
> *
>
> According the manual ("Multi-reads won't receive any special treatment i.e.
> the various SAM/BAM records corresponding to a multi-read will show up in
> the GappedAlignments object as if they were coming from different/unrelated
> queries.") and taken into account previously mentioned, I would be grateful
> if someone can confirm me that readGappedAlignments (output of bwa gives
> .sam files which contain all information about mapping locations of one
> read in one line) will only take the first hit, no matter what is following
> (whether that read is mapped to several equally scored positions...).

This part of the manual is trying to emphasize the fact that
readGappedAlignments() is returning a GappedAlignments object with
*one* element per record in the BAM file. readGappedAlignments()
is pretty low-level i.e. it doesn't try to merge or to drop records
when a read has multiple hits or when reads are paired-end.
By default it loads all the records that can be loaded. The only
exception is for unmapped reads: those are never loaded in the
GappedAlignments object.

So, by default, records corresponding to the same read (i.e. records
with the same QNAME value, note that you can use use.names=TRUE to
set this field as the names of the returned object) will be loaded.
And you will typically end up with duplicated names on your
GappedAlignments object.

AFAIK the reasons for finding more than 1 record with the same QNAME
value in a BAM file are because:
   (a) more than 1 hit were reported in the BAM file by the aligner;
   (b) or the reads are paired-end, then 2 records are needed to
       report a "paired-hit";
   (c) or a combination of (a) and (b), e.g. if the aligner found 3
       "paired-hits" for a given paired-end read, then 6 records
       are needed.

Here are the 2 most common strategies for dealing with single-end
multi-reads (i.e. single-end reads with more than 1 record):

   1. AFAIU most aligner should/will set flag bit 0x100 to 1
      for hits that are considered secondary. So you can filter
      out those records at load time with:

        library(Rsamtools)
        param <- ScanBamParam(flag=scanBamFlag(isNotPrimaryRead=FALSE))
        bwa_output <- readGappedAlignments(file_name1, param=param)

      My understanding of the SAM spec is that, in the case of
      single-end reads, there should be only 1 primary record in the
      BAM file for a given QNAME.

   2. Aligners will sometime set predefined tag NH (number of hits)
      for each record. You can load this tag with:

        param <- ScanBamParam(tag="NH")
        bwa_output <- readGappedAlignments(file_name1, param=param)

      and then subset 'bwa_output' based on the value of this tag:

        bwa_output[elementMetadata(bwa_output)$NH == 1L]

      Note that this filtering is more stringent than 1.: it will not
      only drop all the secondary records but it will also drop the
      primary record of a read that has secondary records.

For paired-end reads the situation is a little bit more complicated
but there are tools currently in development in Rsamtools/GenomicRanges
that will make it easy to load them, keep the pairing information
and filter out "secondary pairs".

>
> If this is the case, would you filter the .sam files for uniquely mapped
> reads before reading it via readGappedAlignments? Or will you let the
> countOverlaps do it?

The precise meaning of "uniquely mapped" depends on the context. When
we talk about reads with unique/multiple SAM records, it means "uniquely
mapped to the reference genome". But when you do:

   countOverlaps(bwa_grglist,tx)==1

you are selecting alignments (i.e. elements in your GappedAlignments
object 'bwa_grglist') that hit only 1 gene (because of how you made
'bwa_grglist', i.e. by grouping transcripts by gene, the hit must
actually be in one of the gene exons in order to be counted).

So the 2 filterings (i.e. filter the .sam files for uniquely mapped
reads and filtering with countOverlaps(bwa_grglist,tx)==1) are
independent. A read can have multiple hits in the reference genome
while still hitting a single gene. And, inversely, a read can have
a unique hit in the reference genome and hit more than 1 gene (e.g.
if 2 genes have exons that overlap, probably a rare but not
impossible event). So you can do either filtering or both of them.

I'm not sure there is a "one size fits all" strategy. I would try
different solutions and compare them... Note that there are also
higher level tools available for doing this kind of counting e.g.
summarizeOverlaps() in GenomicRanges (see the summarizeOverlaps.pdf
vignette in that package) and the htseq-count script distributed
with the HTSeq Python framework (see the DESeq.pdf vignette in the
DESeq package). It's a tough question so it's always good to see how
other people have tried to solve it.

HTH,
H.

>
>
>
> Thank you!
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319