[Bioc-devel] Adding a new annotation file (size ~ 9MB) into the Bioconductor Rsubread package
Hervé Pagès
hp@ge@@on@g|thub @end|ng |rom gm@||@com
Fri Apr 8 20:22:43 CEST 2022
On 08/04/2022 10:02, Yang Liao wrote:
> Thanks for the reply! We used the /flattenGTF/ function in Rsubread to
> merge the overlapping exons in each gene; this procedure is documented
> the manual page of /featureCounts/. We also checked/tested if there
> are "tricky" genes in the annotation that we need to take extra
> care/treatments (e.g. some genes can span multiple chromosomes and/or
> strands). It is hard to automate all the checks reliably.
>
> Also, I think it can be helpful to the reproducibility of DGE analyses
> if we can have a version of gene annotations relatively stable, not
> changing when the RefSeq annotation changes between builds.
I see. thanks for clarifying.
Best,
H.
>
> All the best,
> Yang
> ------------------------------------------------------------------------
> *From:* Hervé Pagès <hpages.on.github using gmail.com>
> *Sent:* Saturday, 9 April 2022 2:45 AM
> *To:* Yang Liao <Yang.Liao using onjcri.org.au>; Kern, Lori
> <Lori.Shepherd using RoswellPark.org>; bioc-devel using r-project.org
> <bioc-devel using r-project.org>
> *Subject:* Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB)
> into the Bioconductor Rsubread package
> *This message originated from outside your organisation. Please be
> careful while clicking links, opening attachments, or replying to this
> email.*
> ------------------------------------------------------------------------
>
> On 08/04/2022 09:26, Yang Liao wrote:
>
>> Thank you, Hervé and Lori!
>>
>> Indeed, the RefSeq mm39 annotation is available in TxDB, but in our
>> case, we built a special version that were specifically treated and
>> tested for RNA-seq analysis,
>>
> Would be good to know what that means exactly. If Rsubread uses a
> subset of RefSeq exons, the curation process should be documented
> somewhere, for the sake of reproducibility.
>
> Best,
>
> H.
>
>> so we still hope to use the inbuilt version in Rsubread. Maybe we
>> will use a public server to host the annotation files, or further
>> compress it to fit the 5MB file limit.
>>
>> Thanks again for the very detailed and timely answers, and the
>> example code!
>>
>> All the best,
>>
>> Yang
>>
>> *From: *Hervé Pagès <mailto:hpages.on.github using gmail.com>
>> *Sent: *Saturday, 9 April 2022 2:21 AM
>> *To: *Kern, Lori <mailto:Lori.Shepherd using RoswellPark.org>; Yang Liao
>> <mailto:Yang.Liao using onjcri.org.au>; bioc-devel using r-project.org
>> <mailto:bioc-devel using r-project.org>
>> *Subject: *Re: [Bioc-devel] Adding a new annotation file (size ~ 9MB)
>> into the Bioconductor Rsubread package
>>
>> This message originated from outside your organisation. Please be
>> careful while clicking links, opening attachments, or replying to
>> this email.
>>
>> Also just a reminder that RefSeq exons for mm39 are already available
>> thru the TxDb.Mmusculus.UCSC.mm39.refGene package:
>>
>> library(TxDb.Mmusculus.UCSC.mm39.refGene)
>>
>> txdb <- TxDb.Mmusculus.UCSC.mm39.refGene
>>
>> mm39_exons <- sort(unlist(exonsBy(txdb, by="gene")))
>>
>> mcols(mm39_exons) <- DataFrame(GeneID=names(mm39_exons))
>>
>> names(mm39_exons) <- NULL
>>
>> mm39_exons
>> # GRanges object with 243976 ranges and 1 metadata column:
>> # seqnames ranges strand | GeneID
>> # <Rle> <IRanges> <Rle> | <character>
>> # [1] chr1 4878046-4878205 + | 18777
>> # [2] chr1 4878678-4878709 + | 18777
>> # [3] chr1 4898807-4898872 + | 18777
>> # [4] chr1 4900491-4900538 + | 18777
>> # [5] chr1 4902534-4902604 + | 18777
>> # ... ... ... ... . ...
>> # [243972] chrUn_JH584304v1 55112-55248 - | 66776
>> # [243973] chrUn_JH584304v1 55465-55701 - | 66776
>> # [243974] chrUn_JH584304v1 56986-57151 - | 66776
>> # [243975] chrUn_JH584304v1 58564-58835 - | 66776
>> # [243976] chrUn_JH584304v1 59592-59689 - | 66776
>> # -------
>> # seqinfo: 61 sequences (1 circular) from mm39 genome
>>
>> so there should be no need to add anything to AnnotationHub or to
>> Rsubread itself.
>>
>> Dump the exons to a tab-delimited file similar to
>> Rsubread/inst/annot/mm10_RefSeq_exon.txt with:
>>
>> df <- as.data.frame(mm39_exons)
>>
>> df <- cbind(df[ , "GeneID", drop=FALSE], df[ , setdiff(colnames(df),
>> c("GeneID", "width"))])
>>
>> stopifnot(identical(colnames(df), c("GeneID", "seqnames", "start",
>> "end", "strand")))
>>
>> colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
>>
>> write.table(df, file="mm39_RefSeq_exon.txt", quote=FALSE, sep="\t",
>> row.names=FALSE)
>>
>> The entire process of obtaining the exons and dumping them to the file
>> takes about 2 seconds on my labtop ;-)
>>
>> H.
>>
>>
>> On 08/04/2022 06:00, Kern, Lori wrote:
>> > Exceptions to file size are not permitted. We would prefer the data
>> be downloaded and distributed through the AnnotationHub as we are
>> moving away from traditional data packages.
>> >
>> > Please see HubPub and the vignette on how to create a hub package
>> https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html
>> <https://protect-au.mimecast.com/s/zdqFCP7LL2SZr7GUzsH7F?domain=bioconductor.org>
>> >
>> > It this case, since it is a single file and creating an entirely
>> separate annotation package seems over kill and unnecessary overhead,
>> we would advise using the annotationhub directory in Rsubread.
>> > You may choose to host the data file yourself on a public
>> accessible and reliable server (institutional level, AWS bucket, data
>> lakes, zenodo); private servers and hosting data on github are not
>> allowed by Bioconductor standards. If you are not able to host the
>> data yourself, you may upload the data file to the Bioconductor Azure
>> Data Lake as described in the vignette link above.
>> > Minimally, Rsubread would need to add the metadata.csv file that
>> provides the necessary metadata information in inst/extdata. And add
>> the biocViews term AnnotationHubSoftware.
>> >
>> > Please let us know when these files and changes are available and
>> we can further assist adding the data officially to the AnnotationHub.
>> >
>> > Cheers,
>> >
>> >
>> >
>> > Lori Shepherd - Kern
>> >
>> > Bioconductor Core Team
>> >
>> > Roswell Park Comprehensive Cancer Center
>> >
>> > Department of Biostatistics & Bioinformatics
>> >
>> > Elm & Carlton Streets
>> >
>> > Buffalo, New York 14263
>> >
>> > ________________________________
>> > From: Bioc-devel <bioc-devel-bounces using r-project.org>
>> <mailto:bioc-devel-bounces using r-project.org> on behalf of Yang Liao
>> <Yang.Liao using onjcri.org.au> <mailto:Yang.Liao using onjcri.org.au>
>> > Sent: Friday, April 8, 2022 3:16 AM
>> > To: bioc-devel using r-project.org <mailto:bioc-devel using r-project.org>
>> <bioc-devel using r-project.org> <mailto:bioc-devel using r-project.org>
>> > Subject: [Bioc-devel] Adding a new annotation file (size ~ 9MB)
>> into the Bioconductor Rsubread package
>> >
>> > Hi,
>> >
>> > We are the maintainers of the Bioconductor Rsubread package. We are
>> trying to add a new gene annotation file (RefSeq GRCm39/mm39) into
>> the Rsubread package for the users who have switched to the mm39
>> reference genome for their analyses.
>> >
>> > We have built the annotation file, but we found that it was a
>> little too large (~ 9 MBytes), larger than the 5MB limit. Hence the
>> Git command refused to submit the file to the Rsubread (devel)
>> repository, with an error message : "Error: file larger than 5 Mb".
>> >
>> > Is it possible if we can have an exemption to add the mm39
>> annotation file into the Rsubread package?
>> >
>> > All the best,
>> > Yang
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org> mailing list
>> >
>> https://secure-web.cisco.com/1XvFzPMqQ7a8frH2_BceXyuYj9is7brCO5-zg5uMSvJEVkQ-vn8jVY7aTBnhRHXMpf7N68ZG2mVNzwI6Rrmb3HpVNtA6QrhPRMFqgJChaVipNIzFSsRr7AdXe95BoUi_rZOe43Aab2uHHlU4EC8Z27tzewixRcZAJOr6BkJoybxJeP18ksprpZEslRgiyKXCOBmgzfyS3vSmgT0_qriyw0e7FPh8lnZogFMieHtbPzs5uA_RvIZBo7ujAPEXmXx7L8j-iR2VXa_EGfGQSuDl_As3nEpBZn9N1Zr60_oMr1LaPW2Ld830p3AChnze_zVmY/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
>> <https://protect-au.mimecast.com/s/SmQyCQnMM3I9E5ATPlJPe?domain=secure-web.cisco.com>
>> >
>> >
>> >
>> > This email message may contain legally privileged and/or
>> confidential information. If you are not the intended recipient(s),
>> or the employee or agent responsible for the delivery of this message
>> to the intended recipient(s), you are hereby notified that any
>> disclosure, copying, distribution, or use of this email message is
>> prohibited. If you have received this message in error, please notify
>> the sender immediately by e-mail and delete this email message from
>> your computer. Thank you.
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org> mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> <https://protect-au.mimecast.com/s/jOEBCRONN4uQkAmFPLCtI?domain=stat.ethz.ch>
>>
>> --
>> Hervé Pagès
>>
>> Bioconductor Core Team
>> hpages.on.github using gmail.com <mailto:hpages.on.github using gmail.com>
>>
> --
> Hervé Pagès
>
> Bioconductor Core Team
> hpages.on.github using gmail.com <mailto:hpages.on.github using gmail.com>
--
Hervé Pagès
Bioconductor Core Team
hpages.on.github using gmail.com
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list