[BioC] HapMap gene list

Wed Aug 4 22:49:56 CEST 2010

You are right! Sorry to bother you with this.
However, there is still something wrong. When I export the file again
(write.table) there are CDS and UTR included and when you run:

> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="	")
> nrow(hapmap)
[1] 171701
> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ]
> nrow(hapmap2)
[1] 12718
> hapmap2[205,]
       V1     V2   V3       V4       V5 V6 V7 V8
2759 chr1 UCSC_1 mRNA 11840109 11841579  .  -  .
V9
2759 ID=NM_002521;Alias=NPPB;Note=natriuretic peptide precursor B
preproprotein;summary=This gene is a member of the natriuretic peptide
family and encodes a secreted protein which functions as a cardiac
hormone. The protein undergoes two cleavage events%2C one within the
cell and a second after secretion into the blood. The proteins
biological actions include natriuresis%2C diuresis%2C
vasorelaxation%2C inhibition of renin and aldosterone secretion%2C and
a key role in cardiovascular homeostasis. A high concentration of this
protein in the bloodstream is indicative of heart failure. Mutations
in this gene have been associated with postmenopausal osteoporosis.
Publication Note:  This RefSeq record includes a subset of the
publications that are available for this gene. Please see the Entrez
Gene record to access additional
publications.\nchr1\tUCSC_1\tthree_prime_UTR\t11840109\t11840298\t.\t-\t.\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840299\t11840315\t.\t-\t1\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840858\t11841113\t.\t-\t0\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11841346\t11841477\t.\t-\t0\tParent=NM_002521\nchr1\tUCSC_1\tfive_prime_UTR\t11841478\t11841579\t.\t-\t.\tParent=NM_002521\nchr1\tUCSC_1\tmRNA\t11902712\t11909067\t.\t-\t.\tID=NM_138346;Alias=KIAA2013;Note=hypothetical
protein LOC90231\nchr1\tUCSC_1\tthree_prime_UTR\t11902712\t11902958\t.\t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11902959\t11902976\t.\t-\t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11905280\t11906133\t.\t-\t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11907849\t11908881\t.\t-\t0\tParent=NM_138346\nchr1\tUCSC_1\tfive_prime_UTR\t11908882\t11909067\t.\t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tmRNA\t11917333\t11958180\t.\t+\t.\tID=NM_000302;Alias=PLOD1;Note=lysyl
hydroxylase precursor;summary=Lysyl hydroxylase is a membrane-bound
homodimeric protein localized to the cisternae of the endoplasmic
reticulum. The enzyme (cofactors iron and ascorbate) catalyzes the
hydroxylation of lysyl residues in collagen-like peptides. The
resultant hydroxylysyl groups are attachment sites for carbohydrates
in col
... (shortend here)

I have no idea where R takes thes "\t.*" parts from but I think they
screw the whole dataframe somehow. Any suggestions?

Thanks

On Wed, Aug 4, 2010 at 7:08 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
> On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com <noxyport at gmail.com> wrote:
>> Hi,
>>
>> I have a problem with the gene list (gff version3 file) HapMap is
>> using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+III/gff/refGene_hg18_tests_11Apr07.gff.gz).
>> I tried loading the file into R and selecting all "mRNA" entries but
>> something seems to go wrong with it:
>>
>>> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="    ")
>>> nrow(hapmap)
>> [1] 171701
>>> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ]
>>> nrow(hapmap2)
>> [1] 12718
>>> hapmap[(2210:2220), (1:3)]
>
> Here, you want to use hapmap2 and not hapmap.
>
> Kasper
>
>
>> 2210 chr1 UCSC_1           mRNA
>> 2211 chr1 UCSC_1 five_prime_UTR
>> 2212 chr1 UCSC_1 five_prime_UTR
>> 2213 chr1 UCSC_1            CDS
>> 2214 chr1 UCSC_1            CDS
>> 2215 chr1 UCSC_1            CDS
>> 2216 chr1 UCSC_1            CDS
>> 2217 chr1 UCSC_1            CDS
>> 2218 chr1 UCSC_1            CDS
>> 2219 chr1 UCSC_1            CDS
>> 2220 chr1 UCSC_1            CDS
>>>
>>
>> Can anyone explain why this could be? Probably, the large descriptive
>> column (V9) but I don't see the failure.
>>
>> I have to admit that it is probably not the best way to use this file
>> but I do not find any other source (RefSeq, UCSC), which contains the
>> same genomic regions for the genes annotated as in HapMap. Which NCBI
>> 36 build did they use and where can I download a gene file with
>> chromosome, gene start and stop matching with HapMap?
>>
>> Thanks for your help!
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>