[BioC] annotationTools: character vector clean-up

Thu Feb 28 23:24:35 CET 2013

Oh, and x = rat2human, of course.

On Thu 28 Feb 2013 02:23:31 PM PST, Ryan C. Thompson wrote:
> You can try this:
>
> library(stringr)
> x <- str_replace(string=x, pattern=" /// .*$", replacement="")
> stopifnot(!any(str_detect(x, "///"))
>
> You might want to adjust the pattern to allow arbitrary spacing rather
> than just single spaces.
>
> On Thu 28 Feb 2013 02:11:21 PM PST, Hooiveld, Guido wrote:
>> Hi,
>> I have a simple problem that's driving me nuts... Any hints are
>> appreciated!
>>
>> I am retrieving the human homologues of rat genes. I use the
>> functions 'getHOMOLOG' and 'listToCharacterVector' from the library
>> annotationTools. Everything is going fine, except for one thing:
>> Some rows (genes) contain multiple entries (homologues); for such row
>> I would like to get rid of all entries except the first one.
>> Example: for row 18634 I currently have:
>> [18634] "6173 /// 100529097"
>>
>> I would like to get rid of everything except the first entry, so to
>> get this:
>> [18634] "6173"
>>
>> How to do this for all relevant rows? Basically, I thus would like to
>> remove everything positioned after the first number, starting with
>> space-3xforwardslash-etc.
>> Thanks,
>> Guido
>>
>>
>> library(annotationTools)
>> library(hugene11stv1hsentrezg.db)
>> library(ragene11stv1rnentrezg.db)
>>
>> #Download HomoloGene data from:
>> #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/
>> homologene<-read.delim("homologene.data.121212.data",header=FALSE)  #
>> (date of file manually added to name when saving download)
>> colnames (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID",
>> "Symbol", "ProteinGI", "ProteinAcc")
>>
>> # Read rat probesets that are on the array as Entrez IDs; this
>> returns a list which is converted to a character vector
>> # Next the probesets that don't have an EntrezID are removed
>> rat.eg.array <- mget(ls(ragene11stv1rnentrezgENTREZID),
>> ragene11stv1rnentrezgENTREZID)
>> rat.eg.array <- listToCharacterVector(rat.eg.array)
>> rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)]
>>
>> # Convert rat EG IDs into human (9606) homologs; this returns a list
>> which is converted to a character vector
>>> rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes
>>> some time
>> Warning messages:
>> 1: In getHOMOLOG(rat.eg.array, 9606, homologene) :
>>    One or more gene input gene ID/cluster not found in homologue table
>> 2: In getHOMOLOG(rat.eg.array, 9606, homologene) :
>>    One or more gene ID/cluster with no target provided in homologue
>> table
>>> rat2human <- listToCharacterVector(rat2human)
>>> class(rat2human)
>> [1] "character"
>>>
>>> head(rat2human)
>> [1] "54552"  "80212"  "11277"  "10663"  "199692" "399947"
>>>
>>> #example of multiple entries
>>> rat2human[18634]
>> [1] "6173 /// 100529097"
>>>
>>
>>
>>
>> ---------------------------------------------------------
>> Guido Hooiveld, PhD
>> Nutrition, Metabolism & Genomics Group
>> Division of Human Nutrition
>> Wageningen University
>> Biotechnion, Bomenweg 2
>> NL-6703 HD Wageningen
>> the Netherlands
>> tel: (+)31 317 485788
>> fax: (+)31 317 483342
>> email:      guido.hooiveld at wur.nl
>> internet:   http://nutrigene.4t.com
>> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
>> http://www.researcherid.com/rid/F-4912-2010
>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor