[BioC] annotationTools: character vector clean-up
Hooiveld, Guido
Guido.Hooiveld at wur.nl
Fri Mar 1 22:34:11 CET 2013
Thanks, that did the trick. Workflow finished as expected.
One question, though, to fully understand: what is the exact meaning of .*$ in the argument pattern? I tried to look it up but only found that:
" * The preceding item will be matched zero or more times. "
Thanks,
Guido
-----Original Message-----
From: Ryan C. Thompson [mailto:rct at thompsonclan.org]
Sent: Thursday, February 28, 2013 23:24
To: Hooiveld, Guido
Cc: bioconductor at r-project.org
Subject: Re: [BioC] annotationTools: character vector clean-up
You can try this:
library(stringr)
x <- str_replace(string=x, pattern=" /// .*$", replacement="") stopifnot(!any(str_detect(x, "///"))
You might want to adjust the pattern to allow arbitrary spacing rather than just single spaces.
On Thu 28 Feb 2013 02:11:21 PM PST, Hooiveld, Guido wrote:
> Hi,
> I have a simple problem that's driving me nuts... Any hints are appreciated!
>
> I am retrieving the human homologues of rat genes. I use the functions 'getHOMOLOG' and 'listToCharacterVector' from the library annotationTools. Everything is going fine, except for one thing:
> Some rows (genes) contain multiple entries (homologues); for such row I would like to get rid of all entries except the first one.
> Example: for row 18634 I currently have:
> [18634] "6173 /// 100529097"
>
> I would like to get rid of everything except the first entry, so to get this:
> [18634] "6173"
>
> How to do this for all relevant rows? Basically, I thus would like to remove everything positioned after the first number, starting with space-3xforwardslash-etc.
> Thanks,
> Guido
>
>
> library(annotationTools)
> library(hugene11stv1hsentrezg.db)
> library(ragene11stv1rnentrezg.db)
>
> #Download HomoloGene data from:
> #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/
> homologene<-read.delim("homologene.data.121212.data",header=FALSE) #
> (date of file manually added to name when saving download) colnames
> (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", "Symbol",
> "ProteinGI", "ProteinAcc")
>
> # Read rat probesets that are on the array as Entrez IDs; this returns
> a list which is converted to a character vector # Next the probesets
> that don't have an EntrezID are removed rat.eg.array <-
> mget(ls(ragene11stv1rnentrezgENTREZID), ragene11stv1rnentrezgENTREZID)
> rat.eg.array <- listToCharacterVector(rat.eg.array)
> rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)]
>
> # Convert rat EG IDs into human (9606) homologs; this returns a list
> which is converted to a character vector
>> rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes
>> some time
> Warning messages:
> 1: In getHOMOLOG(rat.eg.array, 9606, homologene) :
> One or more gene input gene ID/cluster not found in homologue table
> 2: In getHOMOLOG(rat.eg.array, 9606, homologene) :
> One or more gene ID/cluster with no target provided in homologue
> table
>> rat2human <- listToCharacterVector(rat2human)
>> class(rat2human)
> [1] "character"
>>
>> head(rat2human)
> [1] "54552" "80212" "11277" "10663" "199692" "399947"
>>
>> #example of multiple entries
>> rat2human[18634]
> [1] "6173 /// 100529097"
>>
>
>
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism & Genomics Group Division of Human Nutrition
> Wageningen University Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email: guido.hooiveld at wur.nl
> internet: http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list