[Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package

Hervé Pagès hp@ge@@on@g|thub @end|ng |rom gm@||@com
Fri Apr 8 18:45:37 CEST 2022


On 08/04/2022 09:26, Yang Liao wrote:

> Thank you, Hervé and Lori!
>
> Indeed, the RefSeq mm39 annotation is available in TxDB, but in our 
> case, we built a special version that were specifically treated and 
> tested for RNA-seq analysis,
>
Would be good to know what that means exactly. If Rsubread uses a subset 
of RefSeq exons, the curation process should be documented somewhere, 
for the sake of reproducibility.

Best,

H.

> so we still hope to use the inbuilt version in Rsubread. Maybe we will 
> use a public server to host the annotation files, or further compress 
> it to fit the 5MB file limit.
>
> Thanks again for the very detailed and timely answers, and the example 
> code!
>
> All the best,
>
> Yang
>
> *From: *Hervé Pagès <mailto:hpages.on.github using gmail.com>
> *Sent: *Saturday, 9 April 2022 2:21 AM
> *To: *Kern, Lori <mailto:Lori.Shepherd using RoswellPark.org>; Yang Liao 
> <mailto:Yang.Liao using onjcri.org.au>; bioc-devel using r-project.org
> *Subject: *Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB) 
> into the Bioconductor Rsubread package
>
> This message originated from outside your organisation. Please be 
> careful while clicking links, opening attachments, or replying to this 
> email.
>
> Also just a reminder that RefSeq exons for mm39 are already available
> thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
>
>   library(TxDb.Mmusculus.UCSC.mm39.refGene)
>
>   txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
>
>   mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
>
>   mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
>
>   names(mm39_exons) <- NULL
>
>   mm39_exons
>   # GRanges object with 243976 ranges and 1 metadata column:
>   #                    seqnames          ranges strand | GeneID
>   #                       <Rle>       <IRanges> <Rle> | <character>
>   #        [1]             chr1 4878046-4878205      + | 18777
>   #        [2]             chr1 4878678-4878709      + | 18777
>   #        [3]             chr1 4898807-4898872      + | 18777
>   #        [4]             chr1 4900491-4900538      + | 18777
>   #        [5]             chr1 4902534-4902604      + | 18777
>   #        ...              ...             ...    ... . ...
>   #   [243972] chrUn_JH584304v1     55112-55248      - | 66776
>   #   [243973] chrUn_JH584304v1     55465-55701      - | 66776
>   #   [243974] chrUn_JH584304v1     56986-57151      - | 66776
>   #   [243975] chrUn_JH584304v1     58564-58835      - | 66776
>   #   [243976] chrUn_JH584304v1     59592-59689      - | 66776
>   #   -------
>   #   seqinfo: 61 sequences (1 circular) from mm39 genome
>
> so there should be no need to add anything to AnnotationHub or to
> Rsubread itself.
>
> Dump the exons to a tab-delimited file similar to
> Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
>
>   df <- as.data.frame(mm39_exons)
>
>   df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
> c("GeneID", "width"))])
>
>   stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
> "end", "strand")))
>
>   colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
>
>   write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
> row.names=FALSE)
>
> The entire process of obtaining the exons and dumping them to the file
> takes about 2 seconds on my labtop ;-)
>
> H.
>
>
> On 08/04/2022 06:00, Kern, Lori wrote:
> > Exceptions to file size are not permitted. We would prefer the data 
> be downloaded and distributed through the AnnotationHub as we are 
> moving away from traditional data packages.
> >
> > Please see HubPub and the vignette on how to create a hub package 
> https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html 
> <https://protect-au.mimecast.com/s/gmF9CE8wwNFnJQ0FPWmPK?domain=bioconductor.org>
> >
> > It this case, since it is a single file and creating an entirely 
> separate annotation package seems over kill and unnecessary overhead, 
> we would advise using the annotationhub directory in Rsubread.
> > You may choose to host the data file yourself on a public accessible 
> and reliable server (institutional level, AWS bucket, data lakes, 
> zenodo); private servers and hosting data on github are not allowed by 
> Bioconductor standards. If you are not able to host the data yourself, 
> you may upload the data file to the Bioconductor Azure Data Lake as 
> described in the vignette link above.
> > Minimally, Rsubread would need to add the metadata.csv file that 
> provides the necessary metadata information in inst/extdata. And add 
> the biocViews term AnnotationHubSoftware.
> >
> > Please let us know when these files and changes are available and we 
> can further assist adding the data officially to the AnnotationHub.
> >
> > Cheers,
> >
> >
> >
> > Lori Shepherd - Kern
> >
> > Bioconductor Core Team
> >
> > Roswell Park Comprehensive Cancer Center
> >
> > Department of Biostatistics & Bioinformatics
> >
> > Elm & Carlton Streets
> >
> > Buffalo, New York 14263
> >
> > ________________________________
> > From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of 
> Yang Liao <Yang.Liao using onjcri.org.au>
> > Sent: Friday, April 8, 2022 3:16 AM
> > To: bioc-devel using r-project.org <bioc-devel using r-project.org>
> > Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into 
> the Bioconductor Rsubread package
> >
> > Hi,
> >
> > We are the maintainers of the Bioconductor Rsubread package. We are 
> trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the 
> Rsubread package for the users who have switched to the mm39 reference 
> genome for their analyses.
> >
> > We have built the annotation file, but we found that it was a little 
> too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git 
> command refused to submit the file to the Rsubread (devel) repository, 
> with an error message : "Error: file larger than 5 Mb".
> >
> > Is it possible if we can have an exemption to add the mm39 
> annotation file into the Rsubread package?
> >
> > All the best,
> > Yang
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel using r-project.org mailing list
> > 
> https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel 
> <https://protect-au.mimecast.com/s/Vjh7CGv00PcqpVOSkFy_E?domain=secure-web.cisco.com>
> >
> >
> >
> > This email message may contain legally privileged and/or 
> confidential information. If you are not the intended recipient(s), or 
> the employee or agent responsible for the delivery of this message to 
> the intended recipient(s), you are hereby notified that any 
> disclosure, copying, distribution, or use of this email message is 
> prohibited. If you have received this message in error, please notify 
> the sender immediately by e-mail and delete this email message from 
> your computer. Thank you.
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel 
> <https://protect-au.mimecast.com/s/vP6bCJyBBVtym7AfOP23-?domain=stat.ethz.ch>
>
> -- 
> Hervé Pagès
>
> Bioconductor Core Team
> hpages.on.github using gmail.com
>
-- 
Hervé Pagès

Bioconductor Core Team
hpages.on.github using gmail.com

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list