[BioC] annotationTools: character vector clean-up

Fri Mar 1 22:34:11 CET 2013

Thanks, that did the trick. Workflow finished as expected.
One question, though, to fully understand: what is the exact meaning of .*$ in the argument pattern? I tried to look it up but only found that:
 " *    The preceding item will be matched zero or more times. " 

Thanks,
Guido

-----Original Message-----
From: Ryan C. Thompson [mailto:rct at thompsonclan.org] 
Sent: Thursday, February 28, 2013 23:24
To: Hooiveld, Guido
Cc: bioconductor at r-project.org
Subject: Re: [BioC] annotationTools: character vector clean-up

You can try this:

library(stringr)
x <- str_replace(string=x, pattern=" /// .*$", replacement="") stopifnot(!any(str_detect(x, "///"))

You might want to adjust the pattern to allow arbitrary spacing rather than just single spaces.

On Thu 28 Feb 2013 02:11:21 PM PST, Hooiveld, Guido wrote:
> Hi,
> I have a simple problem that's driving me nuts... Any hints are appreciated!
>
> I am retrieving the human homologues of rat genes. I use the functions 'getHOMOLOG' and 'listToCharacterVector' from the library annotationTools. Everything is going fine, except for one thing:
> Some rows (genes) contain multiple entries (homologues); for such row I would like to get rid of all entries except the first one.
> Example: for row 18634 I currently have:
> [18634] "6173 /// 100529097"
>
> I would like to get rid of everything except the first entry, so to get this:
> [18634] "6173"
>
> How to do this for all relevant rows? Basically, I thus would like to remove everything positioned after the first number, starting with space-3xforwardslash-etc.
> Thanks,
> Guido
>
>
> library(annotationTools)
> library(hugene11stv1hsentrezg.db)
> library(ragene11stv1rnentrezg.db)
>
> #Download HomoloGene data from:
> #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/
> homologene<-read.delim("homologene.data.121212.data",header=FALSE)  # 
> (date of file manually added to name when saving download) colnames 
> (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", "Symbol", 
> "ProteinGI", "ProteinAcc")
>
> # Read rat probesets that are on the array as Entrez IDs; this returns 
> a list which is converted to a character vector # Next the probesets 
> that don't have an EntrezID are removed rat.eg.array <- 
> mget(ls(ragene11stv1rnentrezgENTREZID), ragene11stv1rnentrezgENTREZID) 
> rat.eg.array <- listToCharacterVector(rat.eg.array)
> rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)]
>
> # Convert rat EG IDs into human (9606) homologs; this returns a list 
> which is converted to a character vector
>> rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes 
>> some time
> Warning messages:
> 1: In getHOMOLOG(rat.eg.array, 9606, homologene) :
>    One or more gene input gene ID/cluster not found in homologue table
> 2: In getHOMOLOG(rat.eg.array, 9606, homologene) :
>    One or more gene ID/cluster with no target provided in homologue 
> table
>> rat2human <- listToCharacterVector(rat2human)
>> class(rat2human)
> [1] "character"
>>
>> head(rat2human)
> [1] "54552"  "80212"  "11277"  "10663"  "199692" "399947"
>>
>> #example of multiple entries
>> rat2human[18634]
> [1] "6173 /// 100529097"
>>
>
>
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism & Genomics Group Division of Human Nutrition 
> Wageningen University Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email:      guido.hooiveld at wur.nl
> internet:   http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor