[BioC] NA geneSymbol with lumi
Paul Leo
p.leo at uq.edu.au
Fri Nov 16 02:06:39 CET 2007
HI Sebastian,
Yes I get this all the time as well. It does not seem to matter if you
use nuid or other illumina annotations... illuminaMousev1p1 for example.
As a quick fix I have ended up using the illumina annotations as
supplemental data for cases where there is a "NA" I look up the targetID
and use the illumina annotation for that targetID . In most cases the
missing ones are Riken cDNAs...This code will get you started note the
pitfalls with this code. The table ann used below is just the illumina
annotation data read into a data frame :
> dim(ann)
[1] 46643 13
> colnames(ann)
[1] "Search_key" "Target" "ProbeId" "Gid"
[5] "Transcript" "Accession" "Symbol" "Type"
[9] "Start" "Probe_Sequence" "Definition" "Ontology"
[13] "Synonym"
(note in ann "Target" and "ProbeId" do not contain unique entries so
can't be used as rownames in the table)
Rough solution:
LL<-(mget(results[,"ID"], env=illuminaMousev1p1SYMBOL, ifnotfound=NA))
lost_targets<- labels(LL[is.na(LL)])
locations<-apply(as.matrix(lost_targets),1,function(x)
grep(x,ann[,"Target"],fixed=TRUE))
######## WARNING if length(unlist(locations)) !=
length(lost_nuid_loc) ######## this will screw up ;; as it means the
targetID was not found or it ######## may have been found multiple times
(have not had this happen-yet)
locations<-lapply(locations,function(x) x[1] ) #in case more that one
lost_ann<-ann[unlist(locations),]
LL[is.na(LL)]<-as.character(lost_ann[,"Symbol"]) # yes assumes same
order
Cheers
Paul
>
> Hi,
> I am using the lumi package to analyse illumina microarray data.
> When it finally comes to getting the top 10 DE genes with topTable I
get
> many hits with
> the geneSymbol <NA>. However, if I look up the ProbeID corresponding
to
> the nuID
> that provide <NA>, I find that they do correspond to genes. Why aren't
> they being
> displayed in the topTable?
> thanks,
> Sebastien
>
> ID geneSymbol logFC t P.Value
> adj.P.Val B
> 1917 fwfUovXT3rjAjqbpJU S100A8 -5.307223 -50.43759 9.854174e-09
> 0.0001383625 8.724832
> 12632 Qd_S7V4OkLjsX3jkt4 KRT6B -5.281406 -39.54237 3.896317e-08
> 0.0002735409 8.229157
> 12149 BjSTT6BOqGLhpKKFGI <NA> -3.118669 -30.01505 1.844180e-07
> 0.0008631377 7.451766
> 7474 6ipCUUDxcp4ryIj6Uk <NA> -3.155916 -24.45685 5.835502e-07
> 0.0013366890 6.716048
> 3831 3nivfFfvk55Rd18lLk <NA> -2.690362 -24.10891 6.324511e-07
> 0.0013366890 6.659617
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
I have looked into this problem a little more...
I downloaded the Human6_v2_sequence spreadsheet from the Illumina
website and found that many of the targets that provide NA as
gene symbol have no symbol in the Illumina database either.
For example:
ID geneSymbol
5903 ILMN_21212 FAM43A
3103 ILMN_1425 FOXO4
11993 ILMN_6504 PPL
5153 ILMN_19390 ST3GAL4
1723 ILMN_12716 CREB3L2
4484 ILMN_17676 TNS3
2700 ILMN_138461 <NA>
1358 ILMN_12133 FSCN1
3507 ILMN_15271 CITED4
12401 ILMN_73087 <NA>
ILMN_73087 provides NA as gene symbol and does not have a gene
symbol in the Illumina DB either.
However, ILMN_138461 provides NA as gene symbol but does have a
gene symbol in the Illumina DB. It is APM-1.
In addition ILMN_73087 has no entries in either the
Illumina or BioC DB but when I do a search for ILMN_73087 in
Ensembl I a hit that has multiple EntrezGene listings.
Is there any fix for the NA entries? Is this problem being addressed?
thanks,
Sebastien
_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list