[Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package

Hervé Pagès hp@ge@@on@g|thub @end|ng |rom gm@||@com
Fri Apr 8 20:22:43 CEST 2022


On 08/04/2022 10:02, Yang Liao wrote:

> Thanks for the reply! We used the /flattenGTF/ function in Rsubread to 
> merge the overlapping exons in each gene; this procedure is documented 
> the manual page of /featureCounts/. We also checked/tested if there 
> are "tricky" genes in the annotation that we need to take extra 
> care/treatments (e.g. some genes can span multiple chromosomes and/or 
> strands). It is hard to automate all the checks reliably.
>
> Also, I think it can be helpful to the reproducibility of DGE analyses 
> if we can have a version of gene annotations relatively stable, not 
> changing when the RefSeq annotation changes between builds.


I see. thanks for clarifying.


Best,

H.


>
> All the best,
> Yang
> ------------------------------------------------------------------------
> *From:* Hervé Pagès <hpages.on.github using gmail.com>
> *Sent:* Saturday, 9 April 2022 2:45 AM
> *To:* Yang Liao <Yang.Liao using onjcri.org.au>; Kern, Lori 
> <Lori.Shepherd using RoswellPark.org>; bioc-devel using r-project.org 
> <bioc-devel using r-project.org>
> *Subject:* Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) 
> into the Bioconductor Rsubread package
> *This message originated from outside your organisation. Please be 
> careful while clicking links, opening attachments, or replying to this 
> email.*
> ------------------------------------------------------------------------
>
> On 08/04/2022 09:26, Yang Liao wrote:
>
>> Thank you, Hervé and Lori!
>>
>> Indeed, the RefSeq mm39 annotation is available in TxDB, but in our 
>> case, we built a special version that were specifically treated and 
>> tested for RNA-seq analysis,
>>
> Would be good to know what that means exactly. If Rsubread uses a 
> subset of RefSeq exons, the curation process should be documented 
> somewhere, for the sake of reproducibility.
>
> Best,
>
> H.
>
>> so we still hope to use the inbuilt version in Rsubread. Maybe we 
>> will use a public server to host the annotation files, or further 
>> compress it to fit the 5MB file limit.
>>
>> Thanks again for the very detailed and timely answers, and the 
>> example code!
>>
>> All the best,
>>
>> Yang
>>
>> *From: *Hervé Pagès <mailto:hpages.on.github using gmail.com>
>> *Sent: *Saturday, 9 April 2022 2:21 AM
>> *To: *Kern, Lori <mailto:Lori.Shepherd using RoswellPark.org>; Yang Liao 
>> <mailto:Yang.Liao using onjcri.org.au>; bioc-devel using r-project.org 
>> <mailto:bioc-devel using r-project.org>
>> *Subject: *Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) 
>> into the Bioconductor Rsubread package
>>
>> This message originated from outside your organisation. Please be 
>> careful while clicking links, opening attachments, or replying to 
>> this email.
>>
>> Also just a reminder that RefSeq exons for mm39 are already available
>> thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
>>
>>   library(TxDb.Mmusculus.UCSC.mm39.refGene)
>>
>>   txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
>>
>>   mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
>>
>>   mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
>>
>>   names(mm39_exons) <- NULL
>>
>>   mm39_exons
>>   # GRanges object with 243976 ranges and 1 metadata column:
>>   #                    seqnames          ranges strand | GeneID
>>   #                       <Rle> <IRanges> <Rle> | <character>
>>   #        [1]             chr1 4878046-4878205      + | 18777
>>   #        [2]             chr1 4878678-4878709      + | 18777
>>   #        [3]             chr1 4898807-4898872      + | 18777
>>   #        [4]             chr1 4900491-4900538      + | 18777
>>   #        [5]             chr1 4902534-4902604      + | 18777
>>   #        ...              ...             ...    ... . ...
>>   #   [243972] chrUn_JH584304v1     55112-55248      - | 66776
>>   #   [243973] chrUn_JH584304v1     55465-55701      - | 66776
>>   #   [243974] chrUn_JH584304v1     56986-57151      - | 66776
>>   #   [243975] chrUn_JH584304v1     58564-58835      - | 66776
>>   #   [243976] chrUn_JH584304v1     59592-59689      - | 66776
>>   #   -------
>>   #   seqinfo: 61 sequences (1 circular) from mm39 genome
>>
>> so there should be no need to add anything to AnnotationHub or to
>> Rsubread itself.
>>
>> Dump the exons to a tab-delimited file similar to
>> Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
>>
>>   df <- as.data.frame(mm39_exons)
>>
>>   df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
>> c("GeneID", "width"))])
>>
>>   stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
>> "end", "strand")))
>>
>>   colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
>>
>>   write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
>> row.names=FALSE)
>>
>> The entire process of obtaining the exons and dumping them to the file
>> takes about 2 seconds on my labtop ;-)
>>
>> H.
>>
>>
>> On 08/04/2022 06:00, Kern, Lori wrote:
>> > Exceptions to file size are not permitted. We would prefer the data 
>> be downloaded and distributed through the AnnotationHub as we are 
>> moving away from traditional data packages.
>> >
>> > Please see HubPub and the vignette on how to create a hub package 
>> https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html 
>> <https://protect-au.mimecast.com/s/zdqFCP7LL2SZr7GUzsH7F?domain=bioconductor.org>
>> >
>> > It this case, since it is a single file and creating an entirely 
>> separate annotation package seems over kill and unnecessary overhead, 
>> we would advise using the annotationhub directory in Rsubread.
>> > You may choose to host the data file yourself on a public 
>> accessible and reliable server (institutional level, AWS bucket, data 
>> lakes, zenodo); private servers and hosting data on github are not 
>> allowed by Bioconductor standards. If you are not able to host the 
>> data yourself, you may upload the data file to the Bioconductor Azure 
>> Data Lake as described in the vignette link above.
>> > Minimally, Rsubread would need to add the metadata.csv file that 
>> provides the necessary metadata information in inst/extdata. And add 
>> the biocViews term AnnotationHubSoftware.
>> >
>> > Please let us know when these files and changes are available and 
>> we can further assist adding the data officially to the AnnotationHub.
>> >
>> > Cheers,
>> >
>> >
>> >
>> > Lori Shepherd - Kern
>> >
>> > Bioconductor Core Team
>> >
>> > Roswell Park Comprehensive Cancer Center
>> >
>> > Department of Biostatistics & Bioinformatics
>> >
>> > Elm & Carlton Streets
>> >
>> > Buffalo, New York 14263
>> >
>> > ________________________________
>> > From: Bioc-devel <bioc-devel-bounces using r-project.org> 
>> <mailto:bioc-devel-bounces using r-project.org> on behalf of Yang Liao 
>> <Yang.Liao using onjcri.org.au> <mailto:Yang.Liao using onjcri.org.au>
>> > Sent: Friday, April 8, 2022 3:16 AM
>> > To: bioc-devel using r-project.org <mailto:bioc-devel using r-project.org> 
>> <bioc-devel using r-project.org> <mailto:bioc-devel using r-project.org>
>> > Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) 
>> into the Bioconductor Rsubread package
>> >
>> > Hi,
>> >
>> > We are the maintainers of the Bioconductor Rsubread package. We are 
>> trying to add a new gene annotation file (RefSeq GRCm39/mm39) into 
>> the Rsubread package for the users who have switched to the mm39 
>> reference genome for their analyses.
>> >
>> > We have built the annotation file, but we found that it was a 
>> little too large (~ 9 MBytes), larger than the 5MB limit. Hence the 
>> Git command refused to submit the file to the Rsubread (devel) 
>> repository, with an error message : "Error: file larger than 5 Mb".
>> >
>> > Is it possible if we can have an exemption to add the mm39 
>> annotation file into the Rsubread package?
>> >
>> > All the best,
>> > Yang
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org> mailing list
>> > 
>> https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel 
>> <https://protect-au.mimecast.com/s/SmQyCQnMM3I9E5ATPlJPe?domain=secure-web.cisco.com>
>> >
>> >
>> >
>> > This email message may contain legally privileged and/or 
>> confidential information. If you are not the intended recipient(s), 
>> or the employee or agent responsible for the delivery of this message 
>> to the intended recipient(s), you are hereby notified that any 
>> disclosure, copying, distribution, or use of this email message is 
>> prohibited. If you have received this message in error, please notify 
>> the sender immediately by e-mail and delete this email message from 
>> your computer. Thank you.
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org> mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>> <https://protect-au.mimecast.com/s/jOEBCRONN4uQkAmFPLCtI?domain=stat.ethz.ch>
>>
>> -- 
>> Hervé Pagès
>>
>> Bioconductor Core Team
>> hpages.on.github using gmail.com <mailto:hpages.on.github using gmail.com>
>>
> -- 
> Hervé Pagès
>
> Bioconductor Core Team
> hpages.on.github using gmail.com  <mailto:hpages.on.github using gmail.com>

-- 
Hervé Pagès

Bioconductor Core Team
hpages.on.github using gmail.com

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list