[Bioc-sig-seq] Rsamtools readPileup with Mosaik alignments

Martin Morgan mtmorgan at fhcrc.org
Thu Apr 22 14:01:59 CEST 2010


Hi James --

On 04/21/2010 03:34 PM, James MacDonald wrote:
> Hi,
> 
> I have found what I think is an infelicity with readPileup() when the 
> alignment is done with software other than say BWA or Bowtie.

yes readPileup was written to parse output from samtools pileup, which
it seems is quite different from the pileup format documented at, e.g.,

http://samtools.sourceforge.net/pileup.shtml

I will add support for this to Rsamtools, probably not before early next
week.

Martin

> 
> I am aligning SOLiD data, so my choices are rather limited. In order to 
> do gapped alignments, I used Mosaik to do the alignment and then 
> converted to bam files and proceeded from there using samtools.
> 
> When I use the readPileup() function, if I choose variant="SNP", it 
> appears I am not reading in all the SNPs, and if I choose 
> variant="indel", it appears I am reading in SNPs as well as indels.
> 
> I think this arises because of an expectation for the pileup format that 
> is outlined on this page:
> 
> http://samtools.sourceforge.net/cns0.shtml 
> 
> where an indel should look like this, in the pileup output:
> 
> seq2  156 A  A  10  0  99  11  .$......+2AG.+2AG.+2AGGG <975;:<<<<<
> seq2  156 *  +AG/+AG  71  252  99  11  +AG  *  3  8  0
> 
> In this case, both lines indicate an indel at the same position, so 
> readPileup() with variant="SNP" will search for lines containing "*" in 
> the third column, and remove a given line and the line preceding it. So 
> in this case, both of these lines would be removed, as expected.
> 
> If variant="indel", then data from both lines are used, which is OK for 
> the data above, but not so much for me.
> 
> This is because the output from pileup that I get when using the Mosaik 
> aligned data doesn't follow that convention. Instead it looks like this:
> 
> chr10 50620982 G A 45 72 31 16 AAAAaaaaAaaaAAac  =<><@=9>@'7:@@:&
> chr10 50623836 G A 60 63 30 19 AA.CAAAAAAAAAaAAAAA <9/::@<@A?:A@>4=:AB
> chr10 50650827 * */-cca 162 162 22 26 * -cca 24 2 0 0 0
> chr10 50650919 * +A/* 182 182 20 27 +A * 5 20 2 0 0
> chr10 50651118 * */-t 72 72 29 51 * -t 48 3 0 0 0
> chr10 50651155 * */-t 145 145 19 21 * -t 13 4 4 0 0
> chr10 50651269 C Y 46 53 42 16 ,tT,,,,TtttTt,,, @A9;<<7<@>8>;0,,
> 
> So when I read these data using readPileup(<thefile>, variant="SNP"), I 
> lose the SNP in the second line. If I read in indels using 
> variant="indel", then the SNP in the second line will be read in.
> 
> I don't know if there are also instances where an indel takes two lines 
> as in the samtools example above, but that would be the minority if so.
> 
> Best,
> 
> Jim
> 
> 
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list