[BioC] [devteam-bioc] Create consensus fasta file from indexed Bam file
Jacques LEGOUIS
Jacques.LEGOUIS at clermont.inra.fr
Sat Jun 22 10:30:28 CEST 2013
Hi Hervé,
Just tested what you proposed on my files. It worked very fine.
Thanks a lot for the help.
Regards
> Hi Jacques,
>
> I just added a tool to the Rsamtools package, that makes this easy.
> It's the sequenceLayer() function. Here is how to use it:
>
> (1) Load the BAM file into a GAlignments object:
>
> library(Rsamtools)
> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(what="seq")
> gal <- readGAlignmentsFromBam(bamfile, param=param)
> qseq <- mcols(gal)$seq # the query sequences
>
> (2) Use sequenceLayer() to "lay" the query sequences on the reference
> space. This will remove the parts from the query sequences that
> correspond to insertions and soft clipping, and it will fill them
> with - where deletions and/or skipped regions occurred:
>
> qseq_on_ref <- sequenceLayer(qseq, cigar(gal),
> from="query", to="reference")
>
> (3) Compute 1 consensus matrix per chromosome:
>
> qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
> qseq_pos_by_chrom <- splitAsList(start(gal), seqnames(gal))
>
> cm_by_chrom <- lapply(names(qseq_pos_by_chrom),
> function(seqname)
> consensusMatrix(qseq_on_ref_by_chrom[[seqname]],
> as.prob=TRUE,
> shift=qseq_pos_by_chrom[[seqname]]-1,
> width=seqlengths(gal)[[seqname]]))
> names(cm_by_chrom) <- names(qseq_pos_by_chrom)
>
> ## 'cm_by_chrom' is a list of consensus matrices. Each matrix
> ## has 17 rows (1 per letter in the DNA alphabet) and 1 column
> ## per chromosome position.
>
> (4) Compute the consensus string from each consensus matrix. We'll
> put "+" ins the strings wherever there is no coverage for that
> position, and "N" where there is coverage but no consensus.
>
> cs_by_chrom <- lapply(cm_by_chrom,
> function(cm) {
> ## Because consensusString() doesn't like consensus
> ## matrices with columns that contain only zeroes (and
> ## you will have columns like for chromosome positions
> ## that don't receive any coverage), we need to "fix"
> ## 'cm' first.
> idx <- colSums(cm) == 0
> cm["+", idx] <- 1
> DNAString(consensusString(cm, ambiguityMap="N"))
> })
>
> (5) Write 'cs_by_chrom' to a FASTA file:
>
> writeXStringSet(DNAStringSet(cs_by_chrom), "consensus.fa")
>
> Please let us know how this goes with your 454 data.
>
> Note that you will need the 1.13.18 version of Rsamtools for this,
> which is the latest devel version of the package. This means you need
> to install BioC devel. See our website for how to do this.
> Rsamtools 1.13.18 will be available in the next 24 hours or so thru
> biocLite().
>
> Cheers,
> H.
>
>
> On 06/19/2013 02:54 AM, Maintainer wrote:
>>
>> Hello,
>> I have an indexed bam files from 454 sequencing on short reference
>> sequences. I would like to create a consensus sequence in fasta
>> format from this bam file using R Bioconductor. Is there a way to do
>> that simply ?
>> Thanks in advance
>> Jacques
>>
>> -- output of sessionInfo():
>>
>> R version 2.15.1 (2012-06-22)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
>> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
>> [5] LC_TIME=French_France.1252
>>
>> attached base packages:
>> [1] grDevices datasets splines graphics stats tcltk utils
>> [8] methods base
>>
>> other attached packages:
>> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7
>> [4] IRanges_1.16.6 BiocGenerics_0.4.0 TinnR_1.0-5
>> [7] R2HTML_2.2.1 Hmisc_3.10-1.1 survival_2.36-14
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5 cluster_1.14.4 grid_2.15.1 lattice_0.20-15
>> [5] parallel_2.15.1 stats4_2.15.1 tools_2.15.1 zlibbioc_1.4.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> ________________________________________________________________________
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>
--
Jacques Le Gouis
INRA, UMR INRA/UBP 1095
Génétique, Diversité et Ecophysiologie des Céréales
Domaine de Crouelle
5 Chemin de Beaulieu
63039 Clermont Ferrand Cedex 2
Tel : +33 (0)4-73-62-43-11
Fax : +33 (0)4-73-67-18-39
e-mail : jlegouis at clermont.inra.fr
http://www4.clermont.inra.fr/umr1095/abc
More information about the Bioconductor
mailing list