[BioC] alignShortReads
Martin Morgan
mtmorgan at fhcrc.org
Thu Aug 22 06:42:08 CEST 2013
On 08/21/2013 02:54 PM, Sonja Althammer wrote:
> Hi!
> I am having trouble aligning a DNAStringSet to a BSgenome. The example in
> the help works for me....
> Could anyone explain to me the error I am getting, please and how I can
> make it work? Actually I would like to align these reads to a specific
> gene... Is there maybe a better way?
> Thanks!
> Sonja
>
>
>> alignShortReads(reads, Hsapiens, seqNames="chr1")
> Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,
alignShortReads is from R453Plus1Toolbox.
DNAStringSet can contain ambiguity letters such as 'M', to indicate that the
nucleotide could be either A or C
> IUPAC_CODE_MAP
A C G T M R W S Y K V
"A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG"
H D B N
"ACT" "AGT" "CGT" "ACGT"
but the aligner in alignShortReads expects only
> DNA_BASES
[1] "A" "C" "G" "T"
If your sequences do contain just bases, it would make sense to run
Biostrings::matchPDict directly on the sequence of the gene of interest. You
could obtained the sequence of your gene(s) of interest by, e.g., using the
function ?select and the Homo.sapiens package to get coordinates of the gene,
?getSeq to extract the sequence from the Hsapiens BSgenome object you already
have, and matchPDict to do the alignment.
Probably in a way that requires more care, I did the following to arrive at the
sequence of the transcripts of BRCA1
library(BSgenome.Hsapiens.UCSC.hg19)
library(Homo.sapiens)
Discover what I can do:
## 'keytypes' I can query with...
keytypes(Homo.sapiens)
## discover the correct way to specify a gene symbol I'm interested in
grep("BRCA", keys(Homo.sapiens, "SYMBOL"))
## discover what info I can extract
cols(Homo.sapiens)
Now create ranges for transcripts
## select genomic coordinates for BRCA1
xx = select(Homo.sapiens, "BRCA1",
c("TXID", "TXCHROM", "TXSTRAND", "TXSTART", "TXEND"),
"SYMBOL")
## turn coordinates into a GRanges
yy = with(xx, GRanges(TXCHROM, IRanges(TXSTART, TXEND), TXSTRAND))
or alternatively map to the central 'Entrez' identifier and...
## figure out Entrez ID corresponding to symbol
eid = select(Homo.sapiens, "BRCA1", "ENTREZID", "SYMBOL")$ENTREZID
## extract transcript coordinates from TxDb
yy = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")[eid]
What's a gene? Maybe it's just the range of transcripts:
gn = range(yy)
Finally retrieve the relevant sequence(s)
seq = getSeq(Hsapiens, yy) ## or getSeq(Hsapiens, gn)
Judging by the package you're using, you might have longer 454 reads, where the
ungapped high fidelity matches of matchPDict are not appropriate. If you're
interested in more general alignments, or if the sequences you're trying to
align do have ambiguity letters, there is Biostrings::pairwiseAlignment. It
might still be that this general purpose aligner is not appropriate for the task
you're trying to accomplish, so some detail (and perhaps input from others on
the list!) might be needed.
Hope that provides some guidance.
Martin
> :
> non base DNA letter found in Trusted Band for pattern 248
>> class(reads)
> [1] "DNAStringSet"
> attr(,"package")
> [1] "Biostrings"
>
>> class(Hsapiens)
> [1] "BSgenome"
> attr(,"package")
> [1] "BSgenome"
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list