[BioC] NA geneSymbol with lumi
Lynn Amon
lamon at fhcrc.org
Fri Nov 16 18:32:20 CET 2007
The illuminaMousev1p1 annotation package is made using the RefSeq
identifiers provided by Illumina. You can see which identifier was used
by looking at as.list(illuminaMousev1psACCNUM()). Some of these are
RefSeq IDs and some are GenBank IDs. Only the RefSeq IDs are used for
getting the rest of the annotations by searching against NCBI
databases. The probes with GenBank IDs are not used because those
probes are not in exonic sections of the transcript. If you are looking
for more information on a probe which does not have a gene symbol in the
annotation package, you should start with illuminaMousev1psACCNUM()
rather than going back the Illumina manifest file. Or, as was suggested
earlier, BLAST the probe sequence. I may do that for the next
Bioconductor release rather than using the RefSeq IDs provided by Illumina.
Lynn Amon
Paul Leo wrote:
> HI Sebastian,
> Yes I get this all the time as well. It does not seem to matter if you
> use nuid or other illumina annotations... illuminaMousev1p1 for example.
>
> As a quick fix I have ended up using the illumina annotations as
> supplemental data for cases where there is a "NA" I look up the targetID
> and use the illumina annotation for that targetID . In most cases the
> missing ones are Riken cDNAs...This code will get you started note the
> pitfalls with this code. The table ann used below is just the illumina
> annotation data read into a data frame :
>
>> dim(ann)
>>
> [1] 46643 13
>
>> colnames(ann)
>>
> [1] "Search_key" "Target" "ProbeId" "Gid"
> [5] "Transcript" "Accession" "Symbol" "Type"
> [9] "Start" "Probe_Sequence" "Definition" "Ontology"
> [13] "Synonym"
> (note in ann "Target" and "ProbeId" do not contain unique entries so
> can't be used as rownames in the table)
>
> Rough solution:
>
> LL<-(mget(results[,"ID"], env=illuminaMousev1p1SYMBOL, ifnotfound=NA))
> lost_targets<- labels(LL[is.na(LL)])
> locations<-apply(as.matrix(lost_targets),1,function(x)
> grep(x,ann[,"Target"],fixed=TRUE))
> ######## WARNING if length(unlist(locations)) !=
> length(lost_nuid_loc) ######## this will screw up ;; as it means the
> targetID was not found or it ######## may have been found multiple times
> (have not had this happen-yet)
> locations<-lapply(locations,function(x) x[1] ) #in case more that one
> lost_ann<-ann[unlist(locations),]
> LL[is.na(LL)]<-as.character(lost_ann[,"Symbol"]) # yes assumes same
> order
>
> Cheers
> Paul
>
>
>> Hi,
>> I am using the lumi package to analyse illumina microarray data.
>> When it finally comes to getting the top 10 DE genes with topTable I
>>
> get
>
>> many hits with
>> the geneSymbol <NA>. However, if I look up the ProbeID corresponding
>>
> to
>
>> the nuID
>> that provide <NA>, I find that they do correspond to genes. Why aren't
>>
>
>
>> they being
>> displayed in the topTable?
>> thanks,
>> Sebastien
>>
>> ID geneSymbol logFC t P.Value
>>
>
>
>> adj.P.Val B
>> 1917 fwfUovXT3rjAjqbpJU S100A8 -5.307223 -50.43759 9.854174e-09
>> 0.0001383625 8.724832
>> 12632 Qd_S7V4OkLjsX3jkt4 KRT6B -5.281406 -39.54237 3.896317e-08
>> 0.0002735409 8.229157
>> 12149 BjSTT6BOqGLhpKKFGI <NA> -3.118669 -30.01505 1.844180e-07
>> 0.0008631377 7.451766
>> 7474 6ipCUUDxcp4ryIj6Uk <NA> -3.155916 -24.45685 5.835502e-07
>> 0.0013366890 6.716048
>> 3831 3nivfFfvk55Rd18lLk <NA> -2.690362 -24.10891 6.324511e-07
>> 0.0013366890 6.659617
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at ...
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>>
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>>
>
>
> I have looked into this problem a little more...
>
> I downloaded the Human6_v2_sequence spreadsheet from the Illumina
> website and found that many of the targets that provide NA as
> gene symbol have no symbol in the Illumina database either.
>
> For example:
>
> ID geneSymbol
> 5903 ILMN_21212 FAM43A
> 3103 ILMN_1425 FOXO4
> 11993 ILMN_6504 PPL
> 5153 ILMN_19390 ST3GAL4
> 1723 ILMN_12716 CREB3L2
> 4484 ILMN_17676 TNS3
> 2700 ILMN_138461 <NA>
> 1358 ILMN_12133 FSCN1
> 3507 ILMN_15271 CITED4
> 12401 ILMN_73087 <NA>
>
> ILMN_73087 provides NA as gene symbol and does not have a gene
> symbol in the Illumina DB either.
>
> However, ILMN_138461 provides NA as gene symbol but does have a
> gene symbol in the Illumina DB. It is APM-1.
>
> In addition ILMN_73087 has no entries in either the
> Illumina or BioC DB but when I do a search for ILMN_73087 in
> Ensembl I a hit that has multiple EntrezGene listings.
>
> Is there any fix for the NA entries? Is this problem being addressed?
> thanks,
> Sebastien
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list