[Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package

Hervé Pagès hp@ge@@on@g|thub @end|ng |rom gm@||@com
Fri Apr 8 18:21:25 CEST 2022


Also just a reminder that RefSeq exons for mm39 are already available 
thru the TxDb.Mmusculus.UCSC.mm39.refGene package:

   library(TxDb.Mmusculus.UCSC.mm39.refGene)

   txdb <- TxDb.Mmusculus.UCSC.mm39.refGene

   mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))

   mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))

   names(mm39_exons) <- NULL

   mm39_exons
   # GRanges object with 243976 ranges and 1 metadata column:
   #                    seqnames          ranges strand | GeneID
   #                       <Rle>       <IRanges> <Rle> | <character>
   #        [1]             chr1 4878046-4878205      + | 18777
   #        [2]             chr1 4878678-4878709      + | 18777
   #        [3]             chr1 4898807-4898872      + | 18777
   #        [4]             chr1 4900491-4900538      + | 18777
   #        [5]             chr1 4902534-4902604      + | 18777
   #        ...              ...             ...    ... . ...
   #   [243972] chrUn_JH584304v1     55112-55248      - | 66776
   #   [243973] chrUn_JH584304v1     55465-55701      - | 66776
   #   [243974] chrUn_JH584304v1     56986-57151      - | 66776
   #   [243975] chrUn_JH584304v1     58564-58835      - | 66776
   #   [243976] chrUn_JH584304v1     59592-59689      - | 66776
   #   -------
   #   seqinfo: 61 sequences (1 circular) from mm39 genome

so there should be no need to add anything to AnnotationHub or to 
Rsubread itself.

Dump the exons to a tab-delimited file similar to 
Rsubread/inst/annot/mm10_RefSeq_exon.txt with:

   df <- as.data.frame(mm39_exons)

   df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df), 
c("GeneID", "width"))])

   stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start", 
"end", "strand")))

   colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")

   write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t", 
row.names=FALSE)

The entire process of obtaining the exons and dumping them to the file 
takes about 2 seconds on my labtop ;-)

H.


On 08/04/2022 06:00, Kern, Lori wrote:
> Exceptions to file size are not permitted. We would prefer the data be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages.
>
> Please see HubPub and the vignette on how to create a hub package https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html
>
> It this case, since it is a single file and creating an entirely separate annotation package seems over kill and unnecessary overhead, we would advise using the annotationhub directory in Rsubread.
> You may choose to host the data file yourself on a public accessible and reliable server (institutional level, AWS bucket, data lakes, zenodo); private servers and hosting data on github are not allowed by Bioconductor standards.  If you are not able to host the data yourself, you may upload the data file to the Bioconductor Azure Data Lake as described in the vignette link above.
> Minimally, Rsubread would need to add the metadata.csv file that provides the necessary metadata information in inst/extdata.  And add the biocViews term AnnotationHubSoftware.
>
> Please let us know when these files and changes are available and we can further assist adding the data officially to the AnnotationHub.
>
> Cheers,
>
>
>
> Lori Shepherd - Kern
>
> Bioconductor Core Team
>
> Roswell Park Comprehensive Cancer Center
>
> Department of Biostatistics & Bioinformatics
>
> Elm & Carlton Streets
>
> Buffalo, New York 14263
>
> ________________________________
> From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Yang Liao <Yang.Liao using onjcri.org.au>
> Sent: Friday, April 8, 2022 3:16 AM
> To: bioc-devel using r-project.org <bioc-devel using r-project.org>
> Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
>
> Hi,
>
> We are the maintainers of the Bioconductor Rsubread package. We are trying to add a new gene annotation file (RefSeq GRCm39/mm39) into the Rsubread package for the users who have switched to the mm39 reference genome for their analyses.
>
> We have built the annotation file, but we found that it was a little too large (~ 9 MBytes), larger than the 5MB limit. Hence the Git command refused to submit the file to the Rsubread (devel) repository, with an error message : "Error: file larger than 5 Mb".
>
> Is it possible if we can have an exemption to add the mm39 annotation file into the Rsubread package?
>
> All the best,
> Yang
>
>          [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
>
>
>
> This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Bioconductor Core Team
hpages.on.github using gmail.com



More information about the Bioc-devel mailing list