[BioC] R Package for retrieving sequence features
Charles C. Berry
cberry at tajo.ucsd.edu
Thu Mar 23 19:26:59 CET 2006
On Thu, 23 Mar 2006, Wuming Gong wrote:
> Dear list,
>
> I want to retrieve the sequence features, such as start/end position
> of UTR and CDS, according to given genbank accession numbers. I found
> that several R functions have the ability to retrieve the sequences
> alone from genbank, such as getSEQ() from annotate package,
> read.GenBank() from apt package (not in Bioconductor) and seqNCBI()
> from GeneR package, but none of them could retrieve the information on
> sequence features.
>
> Is there any R package that can retrieve the sequence features (just
> like get_SeqFeatures() function in BioPerl)?
I do not know of such a capability in package, but it is not hard to
'roll-your-own' using the file
http://hgdownload.cse.ucsc.edu/goldenPath/hg17/database/genscan.txt.gz
Like this:
genePred.fmt <- list(name = "a", chrom = "a", strand = "a",
txStart = 1, txEnd = 1, cdsStart = 1,
cdsEnd = 1, exonCount = 1, exonStarts = "a",
exonEnds = "a")
genPred.dat <- scan( gzfile( file.path( my.path,"genscan.txt.gz" ),
what = genePred.fmt)
get.features <-
function(x, y=genPred.dat) {
indx <- match( x, y$name )
sapply( y, "[", indx )
}
> get.features( c( "NT_077402.1", "NT_077402.4") )
name chrom strand txStart txEnd cdsStart cdsEnd
exonCount
[1,] "NT_077402.1" "chr1" "+" "2052" "4012" "2052" "4012" "3"
[2,] "NT_077402.4" "chr1" "+" "121020" "124696" "121020" "124696" "5"
exonStarts
[1,] "2052,2475,3913,"
[2,] "121020,121450,122181,122997,124620,"
exonEnds
[1,] "2090,2584,4012,"
[2,] "121200,121708,122244,123179,124696,"
>
HTH,
Chuck
>
> Thanks,
>
> Wuming
>
>
>
>
> [ Part 3.9: "Included Message" ]
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
More information about the Bioconductor
mailing list