[BioC] Small bug in function 'countskip.FASTA.entries' from package altcdfenvs

Norman Pavelka norman.pavelka at unimib.it
Wed Nov 16 11:13:10 CET 2005


Hi Lingsheng,

On 15 Nov 2005, at 19:05, Lingsheng Dong wrote:
> Hi, Norman,
>
> Nice to see you are doing the similar project as I am doing.
>
> Another bug I found was in the function "get.RNA.ID":
> get.RNA.IDs <- function(x) {
> 	reg <- regexpr("(Hs#|NM)[^[:blank:]|]+", x)
> 	r <- substr(my.entries$headers, reg, reg + attr(reg, "match.length") 
> -1)
> 	return(r)
> }
> I am not sure how to correct it yet. But it couldn't get ID for 
> sequences without a "NMxxxxxx" ID in the header.

I won't call that a bug. You simply have to change the regular 
expression in order to match the IDs you have in your particular FASTA 
file.

I'm using the following function that simply gets the first string it 
encounters after the ">" sign in a FASTA header and strips away the 
space character after the string as well as all other characters that 
come after the space character. In this way you will get any ID 
regardless of how it begins with... You only have to check if the space 
character is OK also in your situation, or if another separator would 
be more appropriate. Oftern "|" or ";" signs are used to subdivide 
different pieces of information in a FASTA header.

get.transcript.ids <- function(x) {
		tmpstring <- sub("^>","",x)
		tmpstring <- sub(" .+","",tmpstring)
		return(tmpstring)
	}

> Still another problem you may want consider:
> The "matchprobes" function gives all possible matches. In my case, a 
> lot of probes match hundreds of target sequences. It means there will 
> be too many crossing hybredization probes if you put all probes 
> matching a target sequence into one probe set.
> I couldn't find a ready to use funciton to solve this problem yet. I 
> am thinking to export the matching result into a database software and 
> manually delete crossing hybridezaiton probes.
> Not sure if this a quick solution.
> Hope you can give some suggetion.

I also thought of that problem, but Laurent Gautier already gave some 
clues in his BMC Bioinformatics paper on how to handle this situation. 
Though I still didn't try, I guess that everything could be done very 
quickly inside R, without the need of exporting into an external 
database. If you like, I can share with you my experience, as soon as I 
have done some trials...

> Thanks.
> LIngsheng

Good luck!
Norman



More information about the Bioconductor mailing list