[BioC] NA geneSymbol with lumi

Fri Nov 16 02:06:39 CET 2007

HI Sebastian,
Yes I get this all the time as well. It does not seem to matter if you
use nuid or other illumina annotations... illuminaMousev1p1 for example.

As a quick fix I have ended up using the illumina annotations as
supplemental data for cases where there is a "NA" I look up the targetID
and use the illumina annotation for that targetID . In most cases the
missing ones are Riken cDNAs...This code will get you started note the
pitfalls with this code. The table ann used below is just the illumina
annotation data read into a data frame :
> dim(ann)
[1] 46643    13
> colnames(ann)
 [1] "Search_key"     "Target"         "ProbeId"        "Gid"           
 [5] "Transcript"     "Accession"      "Symbol"         "Type"          
 [9] "Start"          "Probe_Sequence" "Definition"     "Ontology"      
[13] "Synonym"       
(note in ann "Target" and "ProbeId" do not contain unique entries so
can't be used as rownames in the table)

Rough solution:

LL<-(mget(results[,"ID"], env=illuminaMousev1p1SYMBOL, ifnotfound=NA))
lost_targets<- labels(LL[is.na(LL)]) 
locations<-apply(as.matrix(lost_targets),1,function(x)
grep(x,ann[,"Target"],fixed=TRUE))
######## WARNING if length(unlist(locations))  !=
length(lost_nuid_loc) ######## this will screw up ;; as it means the
targetID was not found or it ######## may have been found multiple times
(have not had this happen-yet)
locations<-lapply(locations,function(x) x[1]  ) #in case more that one 
lost_ann<-ann[unlist(locations),]
LL[is.na(LL)]<-as.character(lost_ann[,"Symbol"]) # yes assumes same
order

Cheers
Paul

> 
> Hi,
> I am using the lumi package to analyse illumina microarray data.
> When it finally comes to getting the top 10 DE genes with topTable I
get 
> many hits with
> the geneSymbol <NA>. However, if I look up the ProbeID corresponding
to 
> the nuID
> that provide <NA>, I find that they do correspond to genes. Why aren't

> they being
> displayed in the topTable?
> thanks,
> Sebastien
> 
>                       ID geneSymbol     logFC         t      P.Value

> adj.P.Val        B
> 1917  fwfUovXT3rjAjqbpJU     S100A8 -5.307223 -50.43759 9.854174e-09 
> 0.0001383625 8.724832
> 12632 Qd_S7V4OkLjsX3jkt4      KRT6B -5.281406 -39.54237 3.896317e-08 
> 0.0002735409 8.229157
> 12149 BjSTT6BOqGLhpKKFGI       <NA> -3.118669 -30.01505 1.844180e-07 
> 0.0008631377 7.451766
> 7474  6ipCUUDxcp4ryIj6Uk       <NA> -3.155916 -24.45685 5.835502e-07 
> 0.0013366890 6.716048
> 3831  3nivfFfvk55Rd18lLk       <NA> -2.690362 -24.10891 6.324511e-07 
> 0.0013366890 6.659617
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 

I have looked into this problem a little more...

I downloaded the Human6_v2_sequence spreadsheet from the Illumina
website and found that many of the targets that provide NA as
gene symbol have no symbol in the Illumina database either.

For example:

	ID	geneSymbol
5903	ILMN_21212	FAM43A
3103	ILMN_1425	FOXO4
11993	ILMN_6504	PPL
5153	ILMN_19390	ST3GAL4
1723	ILMN_12716	CREB3L2
4484	ILMN_17676	TNS3
2700	ILMN_138461	<NA>
1358	ILMN_12133	FSCN1
3507	ILMN_15271	CITED4
12401	ILMN_73087	<NA>

ILMN_73087 provides NA as gene symbol and does not have a gene
symbol in the Illumina DB either.

However, ILMN_138461 provides NA as gene symbol but does have a
gene symbol in the Illumina DB. It is APM-1.

In addition ILMN_73087 has no entries in either the
Illumina or BioC DB but when I do a search for ILMN_73087 in
Ensembl I a hit that has multiple EntrezGene listings.

Is there any fix for the NA entries? Is this problem being addressed?
thanks,
Sebastien

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor