[BioC] NA geneSymbol with lumi

Lynn Amon lamon at fhcrc.org
Fri Nov 16 18:32:20 CET 2007


The illuminaMousev1p1 annotation package is made using the RefSeq
identifiers provided by Illumina.  You can see which identifier was used
by looking at as.list(illuminaMousev1psACCNUM()).  Some of these are
RefSeq IDs and some are GenBank IDs.  Only the RefSeq IDs are used for
getting the rest of the annotations by searching against NCBI
databases.  The probes with GenBank IDs are not used because those
probes are not in exonic sections of the transcript.  If you are looking
for more information on a probe which does not have a gene symbol in the
annotation package, you should start with illuminaMousev1psACCNUM()
rather than going back the Illumina manifest file.  Or, as was suggested
earlier, BLAST the probe sequence.  I may do that for the next
Bioconductor release rather than using the RefSeq IDs provided by Illumina.
Lynn Amon

Paul Leo wrote:
> HI Sebastian,
> Yes I get this all the time as well. It does not seem to matter if you
> use nuid or other illumina annotations... illuminaMousev1p1 for example.
>
> As a quick fix I have ended up using the illumina annotations as
> supplemental data for cases where there is a "NA" I look up the targetID
> and use the illumina annotation for that targetID . In most cases the
> missing ones are Riken cDNAs...This code will get you started note the
> pitfalls with this code. The table ann used below is just the illumina
> annotation data read into a data frame :
>   
>> dim(ann)
>>     
> [1] 46643    13
>   
>> colnames(ann)
>>     
>  [1] "Search_key"     "Target"         "ProbeId"        "Gid"           
>  [5] "Transcript"     "Accession"      "Symbol"         "Type"          
>  [9] "Start"          "Probe_Sequence" "Definition"     "Ontology"      
> [13] "Synonym"       
> (note in ann "Target" and "ProbeId" do not contain unique entries so
> can't be used as rownames in the table)
>
> Rough solution:
>
> LL<-(mget(results[,"ID"], env=illuminaMousev1p1SYMBOL, ifnotfound=NA))
> lost_targets<- labels(LL[is.na(LL)]) 
> locations<-apply(as.matrix(lost_targets),1,function(x)
> grep(x,ann[,"Target"],fixed=TRUE))
> ######## WARNING if length(unlist(locations))  !=
> length(lost_nuid_loc) ######## this will screw up ;; as it means the
> targetID was not found or it ######## may have been found multiple times
> (have not had this happen-yet)
> locations<-lapply(locations,function(x) x[1]  ) #in case more that one 
> lost_ann<-ann[unlist(locations),]
> LL[is.na(LL)]<-as.character(lost_ann[,"Symbol"]) # yes assumes same
> order
>
> Cheers
> Paul
>
>   
>> Hi,
>> I am using the lumi package to analyse illumina microarray data.
>> When it finally comes to getting the top 10 DE genes with topTable I
>>     
> get 
>   
>> many hits with
>> the geneSymbol <NA>. However, if I look up the ProbeID corresponding
>>     
> to 
>   
>> the nuID
>> that provide <NA>, I find that they do correspond to genes. Why aren't
>>     
>
>   
>> they being
>> displayed in the topTable?
>> thanks,
>> Sebastien
>>
>>                       ID geneSymbol     logFC         t      P.Value
>>     
>
>   
>> adj.P.Val        B
>> 1917  fwfUovXT3rjAjqbpJU     S100A8 -5.307223 -50.43759 9.854174e-09 
>> 0.0001383625 8.724832
>> 12632 Qd_S7V4OkLjsX3jkt4      KRT6B -5.281406 -39.54237 3.896317e-08 
>> 0.0002735409 8.229157
>> 12149 BjSTT6BOqGLhpKKFGI       <NA> -3.118669 -30.01505 1.844180e-07 
>> 0.0008631377 7.451766
>> 7474  6ipCUUDxcp4ryIj6Uk       <NA> -3.155916 -24.45685 5.835502e-07 
>> 0.0013366890 6.716048
>> 3831  3nivfFfvk55Rd18lLk       <NA> -2.690362 -24.10891 6.324511e-07 
>> 0.0013366890 6.659617
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at ...
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>>     
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>   
>>     
>
>
> I have looked into this problem a little more...
>
> I downloaded the Human6_v2_sequence spreadsheet from the Illumina
> website and found that many of the targets that provide NA as
> gene symbol have no symbol in the Illumina database either.
>
> For example:
>
> 	ID	geneSymbol
> 5903	ILMN_21212	FAM43A
> 3103	ILMN_1425	FOXO4
> 11993	ILMN_6504	PPL
> 5153	ILMN_19390	ST3GAL4
> 1723	ILMN_12716	CREB3L2
> 4484	ILMN_17676	TNS3
> 2700	ILMN_138461	<NA>
> 1358	ILMN_12133	FSCN1
> 3507	ILMN_15271	CITED4
> 12401	ILMN_73087	<NA>
>
> ILMN_73087 provides NA as gene symbol and does not have a gene
> symbol in the Illumina DB either.
>
> However, ILMN_138461 provides NA as gene symbol but does have a
> gene symbol in the Illumina DB. It is APM-1.
>
> In addition ILMN_73087 has no entries in either the
> Illumina or BioC DB but when I do a search for ILMN_73087 in
> Ensembl I a hit that has multiple EntrezGene listings.
>
> Is there any fix for the NA entries? Is this problem being addressed?
> thanks,
> Sebastien
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list