[BioC] Gene names
Christopher Wilkinson
christopher.wilkinson at adelaide.edu.au
Mon Nov 7 01:13:35 CET 2005
If you want to do this in R, the function you want is strsplit, telling
it to split on the "|" character. However "|" is special in character
splitting (regular expressions) so we have to protect it with
backslashes. As a word of advice look up regular expressions - they are
extremely powerful for manipulating strings (?regexp)
> geneName <- "SFTPB|NM_000542.1|4506904|surfactant,
pulmonary-associated protein B"
> strsplit(geneName,"\\|")
[[1]]
[1] "SFTPB"
"NM_000542.1"
[3] "4506904" "surfactant,
pulmonary-associated protein B"
note it returns a list, where you probably want a vector or array, so
something like
t(as.matrix(strsplit(geneName,"\\|")[[1]])) or
unlist(strsplit(geneName,"\\|") will give
"SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B"
Now lets assume you have a vector of genenames to be split, you can use
the sapply function.
geneNames <- rep(geneName,3)
geneNamesAsMatrix <-
t(sapply(geneNames,function(x){unlist(strsplit(x,"\\|"))}))
> rownames(geneNamesAsMatrix) <- NULL ## otherwise whole str is the row
name
> geneNamesAsMatrix
[,1] [,2] [,3]
[,4]
[1,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
[2,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
[3,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
Of course you could do this on the command line with perl using
something like
perl -ne 'my @F=split /\|/,$_;print join("\t", at F)' infile > outfile
Cheers
Chris
>Date: Sun, 06 Nov 2005 02:13:39 +0000
>From: J.delasHeras at ed.ac.uk
>Subject: Re: [BioC] Gene names
>To: bioconductor at stat.math.ethz.ch
>Message-ID: <20051106021339.3x6viekhogs0w8w0 at www.staffmail.ed.ac.uk>
>Content-Type: text/plain; charset=ISO-8859-1; format="flowed"
>
>Quoting Narendra Kaushik <kaushiknk at Cardiff.ac.uk>:
>
>
>
>>I have gene file in this format, everything in one column (no spaces at all):
>>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
>>Is there any way to convert it in this format (into four columns) except
>>manually?
>>
>>SFTPB NM_000542.1 4506904
>>surfactant, pulmonary-associated protein B
>>
>>Any suggestions?
>>
>>Narendra
>>
>>
>
>Maybe too obvious, but Excel is very good for this sort of thing.
>Functions like
>Search allow you to obtain the position of a particulat character (like
>"|") and
>knowing that you can select the text to the left or right to it... if you do
>that consecutively you can sort it like that. It'll take a minute.
>
>Jose
>
>
>
--
Dr Chris Wilkinson
Senior Research Officer | ARC Research Associate
Child Health Research Institute (CHRI)| Microarray Analysis Group
7th floor, Clarence Rieger Building | Room 121
Women's and Children's Hospital | School of Mathematical Sciences
72 King William Rd, | The University of Adelaide, 5005
North Adelaide, 5006 | CRICOS Provider Number 00123M
Math's Office (Room 121) Ph: 8303 3714
CHRI Office (CR2 52A) Ph: 8161 6363
Christopher.Wilkinson at adelaide.edu.au
http://mag.maths.adelaide.edu.au/crwilkinson.html
Organising Committee Member, 5th Australian Microarray Conference
29th Sept to 1st Oct 2005, Novatel Barossa Valley Resort
http://www.sapmea.asn.au/conventions/microarray/index.html
More information about the Bioconductor
mailing list