[BioC] Reading gff files in R
Kasper Daniel Hansen
khansen at stat.berkeley.edu
Tue Oct 21 19:13:56 CEST 2008
There used to be a package for this, by Oleg Skylar, but it was
deprecated. There is a nice function in the davidTiling experimental
data package for extracting attributes from GFF files (probably taken
from the gff code base?), I'll do a copy and paste here, since
davidTiling is a large download.
getAttributeField <- function (x, field, attrsep = ";") {
s = strsplit(x, split = attrsep, fixed = TRUE)
sapply(s, function(atts) {
a = strsplit(atts, split = "=", fixed = TRUE)
m = match(field, sapply(a, "[", 1))
if (!is.na(m)) {
rv = a[[m]][2]
}
else {
rv = as.character(NA)
}
return(rv)
})
}
and here is quick parser
gffRead <- function(gffFile, nrows = -1) {
cat("Reading ", gffFile, ": ", sep="")
gff = read.table(gffFile, sep="\t", as.is=TRUE, quote="",
header=FALSE, comment.char="#", nrows = nrows,
colClasses=c("character", "character", "character", "integer",
"integer",
"character", "character", "character", "character"))
colnames(gff) = c("seqname", "source", "feature", "start", "end",
"score", "strand", "frame", "attributes")
cat("found", nrow(gff), "rows with classes:",
paste(sapply(gff, class), collapse=", "), "\n")
stopifnot(!any(is.na(gff$start)), !any(is.na(gff$end)))
return(gff)
}
Now you can do stuff like
gff <- gffRead(gfffile)
gff$Name <- getAttributeField(gff$attributes, "Name")
gff$ID <- getAttributeField(gff$attributes, "ID")
gfffile is just an object holding the file name.
Kasper
On Oct 21, 2008, at 4:56 , Tobias Straub wrote:
> hi Naira
> just parse the gff with
>
> read.delim("gff_file.gff", header=F, comment.char="#") -> gff
>
> and you get a table that you can filter for gene entries, sth like
> gff.genes <- gff[gff[,2]=="gene",]
> depending on where and how the gene is specified as a gene
>
> T.
>
> On Oct 21, 2008, at 11:00 AM, Naira Naouar wrote:
>
>> Dear all,
>>
>> I was wondering if there were already tools to read gff files in R?
>> I am looking for a fast way to extract gene coordinates from a gff
>> file.
>>
>> Regards,
>> Naira
>>
>> --
>> ==================================================================
>> Naïra Naouar
>> Tel:+32 (0)9 331 38 63
>> VIB Department of Plant Systems Biology, Ghent University
>> Technologiepark 927, 9052 Gent, BELGIUM
>> nanao at psb.ugent.be http://www.psb.ugent.be
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> ----------------------------------------------------------------------
> Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, München D
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list