[Bioc-devel] read_bed()

Michael Lawrence |@wrence@m|ch@e| @end|ng |rom gene@com
Wed Sep 18 13:33:05 CEST 2019


I'm not sure if a function called read_bed() should be plotting or
printing. Is your BED file a known BED variant, i.e., maybe there is a
better name for the file type than "bed"?


On Wed, Sep 18, 2019 at 3:17 AM Bhagwat, Aditya
<Aditya.Bhagwat using mpi-bn.mpg.de> wrote:
>
> Actually,
>
> I will keep multicrispr::read_bed(), but wrap it around rtracklayer::import.bed, and additionally plot and print range summaries.
>
> Aditya
>
> ________________________________________
> From: Bioc-devel [bioc-devel-bounces using r-project.org] on behalf of Bhagwat, Aditya [Aditya.Bhagwat using mpi-bn.mpg.de]
> Sent: Wednesday, September 18, 2019 11:31 AM
> To: Michael Lawrence
> Cc: bioc-devel using r-project.org
> Subject: Re: [Bioc-devel] read_bed()
>
> (Typo corrected to avoid confusion)
>
> Michael,
>
> rtracklayer::import.bed() indeed works perfectly for me, so I am dropping multicrispr::read_bed().
>
> In order to avoid the overkill of `require(tracklayer)` for multicrispr <https://gitlab.gwdg.de/loosolab/software/multicrispr> users, does it make sense to import/re-export import.bed() in multicrispr? What is BioC convention/best practice in such cases?
>
> Aditya
>
>
>
> ________________________________________
> From: Bioc-devel [bioc-devel-bounces using r-project.org] on behalf of Bhagwat, Aditya [Aditya.Bhagwat using mpi-bn.mpg.de]
> Sent: Wednesday, September 18, 2019 8:35 AM
> To: Michael Lawrence
> Cc: bioc-devel using r-project.org
> Subject: Re: [Bioc-devel] read_bed()
>
> Thank you Michael :-)
>
> Aditya
> ________________________________________
> From: Michael Lawrence [lawrence.michael using gene.com]
> Sent: Tuesday, September 17, 2019 8:49 PM
> To: Bhagwat, Aditya
> Cc: Michael Lawrence; bioc-devel using r-project.org
> Subject: Re: [Bioc-devel] read_bed()
>
> I think you probably made a mistake when dropping the columns. When I
> provide the extraCols= argument (inventing my own names for things),
> it almost works, but it breaks due to NAs in the extra columns. The
> "." character is the standard way to express NA in BED files. I've
> added support for extra na.strings to version 1.45.6.
>
> For reference, the call is like:
>
> import("SRF.bed", extraCols=c(chr2="character", start2="integer",
> end2="integer", mDux="factor", type="factor", pos1="integer",
> pos2="integer", strand2="factor", from="factor", n="integer",
> code="character", anno="factor", id="character", biotype="character",
> score2="numeric" ), na.strings="NA")
>
>
> On Tue, Sep 17, 2019 at 7:23 AM Bhagwat, Aditya
> <Aditya.Bhagwat using mpi-bn.mpg.de> wrote:
> >
> > Hi Michael,
> >
> > I removed the additional metadata columns in SRF.bed
> > https://gitlab.gwdg.de/loosolab/software/multicrispr/blob/master/inst/extdata/SRF.bed
> >
> > But still can't get rtracklayer::import.bed working:
> >
> > > rtracklayer::import.bed(bedfile)
> > Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : scan() expected 'a real', got '1.168.595'
> > > bedfile
> > [1] "C:/Users/abhagwa/Documents/R/R-3.6.1/library/multicrispr/extdata/SRF.bed"
> >
> > Never mind, multicrispr function read_bed, based on data.table::fread is doing the job, so I will stick to that .
> >
> > Thank you for all feedback,
> >
> > Cheers,
> >
> > Aditya
> >
> >
> > ________________________________________
> > From: Bioc-devel [bioc-devel-bounces using r-project.org] on behalf of Bhagwat, Aditya [Aditya.Bhagwat using mpi-bn.mpg.de]
> > Sent: Tuesday, September 17, 2019 2:48 PM
> > To: Michael Lawrence
> > Cc: bioc-devel using r-project.org
> > Subject: Re: [Bioc-devel] read_bed()
> >
> > Oh :-) - Thankyou for explaining!
> > ________________________________________
> > From: Michael Lawrence [lawrence.michael using gene.com]
> > Sent: Tuesday, September 17, 2019 2:40 PM
> > To: Bhagwat, Aditya
> > Cc: Michael Lawrence; Shepherd, Lori; bioc-devel using r-project.org
> > Subject: Re: [Bioc-devel] read_bed()
> >
> > Having a "." in the function name does not make something "S3".
> > There's no dispatch from import() to import.bed(). Had I not been a
> > total newb when I created rtracklayer, I would have called the
> > function importBed() or something like that. Sorry for the confusion.
> >
> > On Tue, Sep 17, 2019 at 5:34 AM Bhagwat, Aditya
> > <Aditya.Bhagwat using mpi-bn.mpg.de> wrote:
> > >
> > > Oh, superb, thx!
> > >
> > > Interesting ... here you use S3 rather than S4 - I wonder the design intention underlying these choices (I'm asking because I am trying to figure out myself when to use S3 and when to use S4 and whether to mix the two).
> > >
> > > Aditya
> > >
> > > ________________________________________
> > > From: Michael Lawrence [lawrence.michael using gene.com]
> > > Sent: Tuesday, September 17, 2019 2:23 PM
> > > To: Bhagwat, Aditya
> > > Cc: Michael Lawrence; Shepherd, Lori; bioc-devel using r-project.org
> > > Subject: Re: [Bioc-devel] read_bed()
> > >
> > > The generic documentation does not mention it, but see ?import.bed.
> > > It's similar to colClasses on read.table().
> > >
> > > On Tue, Sep 17, 2019 at 5:15 AM Bhagwat, Aditya
> > > <Aditya.Bhagwat using mpi-bn.mpg.de> wrote:
> > > >
> > > > Thankyou Michael,
> > > >
> > > > How do I use the extraCols argument? The documentation does not mention an `extraCols` argument explicitly, so it must be one of the ellipsis arguments, but `?rtracklayer::import` does not mention it. Should I say extraCols = 10 (ten extra columns) or so?
> > > >
> > > > Aditya
> > > >
> > > > ________________________________________
> > > > From: Michael Lawrence [lawrence.michael using gene.com]
> > > > Sent: Tuesday, September 17, 2019 2:05 PM
> > > > To: Bhagwat, Aditya
> > > > Cc: Michael Lawrence; Shepherd, Lori; bioc-devel using r-project.org
> > > > Subject: Re: [Bioc-devel] read_bed()
> > > >
> > > > It breaks it because it's not standard BED; however, using the
> > > > extraCols= argument should work in this case. Requiring an explicit
> > > > format specification is intentional, because it provides validation
> > > > and type safety, and it communicates the format to a future reader.
> > > > This also looks a bit like a bedPE file, so you might consider using
> > > > the Pairs data structure.
> > > >
> > > > Michael
> > > >
> > > > On Tue, Sep 17, 2019 at 4:51 AM Bhagwat, Aditya
> > > > <Aditya.Bhagwat using mpi-bn.mpg.de> wrote:
> > > > >
> > > > > Hi Michael,
> > > > >
> > > > > Yeah, I also noticed that the attachment was eaten when it entered the bio-devel list.
> > > > >
> > > > > The file is also accessible in the extdata of the multicrispr:
> > > > > https://gitlab.gwdg.de/loosolab/software/multicrispr/blob/master/inst/extdata/SRF.bed
> > > > >
> > > > > A bedfile to GRanges importer requires columns 1 (chrom), 2 (chromStart), 3 (chromEnd), and column 6 (strand). All of these are present in SRF.bed.
> > > > >
> > > > > I am curious as to why you feel that having additional columns in a bedfile would break it?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Aditya
> > > > >
> > > > > ________________________________________
> > > > > From: Michael Lawrence [lawrence.michael using gene.com]
> > > > > Sent: Tuesday, September 17, 2019 1:41 PM
> > > > > To: Bhagwat, Aditya
> > > > > Cc: Shepherd, Lori; bioc-devel using r-project.org
> > > > > Subject: Re: [Bioc-devel] read_bed()
> > > > >
> > > > > I don't see an attachment, nor can I find the multicrispr package
> > > > > anywhere. The "addressed soon" was referring to the BEDX+Y formats,
> > > > > which was addressed many years ago, so I've updated the documentation.
> > > > > Broken BED files will never be supported.
> > > > >
> > > > > Michael
> > > > >
> > > > > On Tue, Sep 17, 2019 at 4:17 AM Bhagwat, Aditya
> > > > > <Aditya.Bhagwat using mpi-bn.mpg.de> wrote:
> > > > > >
> > > > > > Hi Lori,
> > > > > >
> > > > > > I remember now - I tried this function earlier, but it does not work for my bedfiles, like the one in attach.
> > > > > >
> > > > > > > bedfile <- system.file('extdata/SRF.bed', package = 'multicrispr')
> > > > > > >
> > > > > > > targetranges <- rtracklayer::import(bedfile, format = 'BED', genome = 'mm10')
> > > > > > Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : scan() expected 'an integer', got 'chr2'
> > > > > > >
> > > > > >
> > > > > > Perhaps this sentence in `?rtracklayer::import` points to the source of the error?
> > > > > >
> > > > > > many tools and organizations have extended BED with additional columns. These are not officially valid BED files, and as such rtracklayer does not yet support them (this will be addressed soon).
> > > > > >
> > > > > > Which brings the question: how soon is soon :-D ?
> > > > > >
> > > > > > Aditya
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Shepherd, Lori [Lori.Shepherd using RoswellPark.org]
> > > > > > Sent: Tuesday, September 17, 2019 1:02 PM
> > > > > > To: Bhagwat, Aditya; bioc-devel using r-project.org
> > > > > > Subject: Re: read_bed()
> > > > > >
> > > > > > Please look at rtracklayer::import() function that we recommend for reading of BAM files along with other common formats.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > >
> > > > > > Lori Shepherd
> > > > > >
> > > > > > Bioconductor Core Team
> > > > > >
> > > > > > Roswell Park Cancer Institute
> > > > > >
> > > > > > Department of Biostatistics & Bioinformatics
> > > > > >
> > > > > > Elm & Carlton Streets
> > > > > >
> > > > > > Buffalo, New York 14263
> > > > > >
> > > > > > ________________________________
> > > > > > From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Bhagwat, Aditya <Aditya.Bhagwat using mpi-bn.mpg.de>
> > > > > > Sent: Tuesday, September 17, 2019 6:58 AM
> > > > > > To: bioc-devel using r-project.org <bioc-devel using r-project.org>
> > > > > > Subject: [Bioc-devel] read_bed()
> > > > > >
> > > > > > Dear bioc-devel,
> > > > > >
> > > > > > I had two feedback requests regarding the function read_bed().
> > > > > >
> > > > > > 1) Did I overlook, and therefore, re-invent existing functionality?
> > > > > > 2) If not, would `read_bed` be suited for existence in a more foundational package, e.g. `GenomicRanges`, given the rather basal nature of this functionality?
> > > > > >
> > > > > > It reads a bedfile into a GRanges, converts the coordinates from 0-based (bedfile) to 1-based (GRanges)<https://www.biostars.org/p/84686>, adds BSgenome info (to allow for implicit range validity checking<https://support.bioconductor.org/p/124250>) and plots the karyogram<https://support.bioconductor.org/p/124328>.
> > > > > >
> > > > > > Thank you for your feedback.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Aditya
> > > > > >
> > > > > >
> > > > > > #' Read bedfile into GRanges
> > > > > > #'
> > > > > > #' @param bedfile file path
> > > > > > #' @param bsgenome BSgenome, e.g. BSgenome.Mmusculus.UCSC.mm10::Mmusculus
> > > > > > #' @param zero_based logical(1): whether bedfile GRanges are 0-based
> > > > > > #' @param rm_duplicates logical(1)
> > > > > > #' @param plot logical(1)
> > > > > > #' @param verbose logical(1)
> > > > > > #' @return \code{\link[GenomicRanges]{GRanges-class}}
> > > > > > #' @note By convention BED files are 0-based. GRanges are always 1-based.
> > > > > > #' A good discussion on these two alternative codings is given
> > > > > > #' by Obi Griffith on Biostars: https://www.biostars.org/p/84686/
> > > > > > #' @examples
> > > > > > #' bedfile <- system.file('extdata/SRF.bed', package = 'multicrispr')
> > > > > > #' bsgenome <- BSgenome.Mmusculus.UCSC.mm10::Mmusculus
> > > > > > #' (gr <- read_bed(bedfile, bsgenome))
> > > > > > #' @importFrom data.table :=
> > > > > > #' @export
> > > > > > read_bed <- function(
> > > > > > bedfile,
> > > > > > bsgenome,
> > > > > > zero_based = TRUE,
> > > > > > rm_duplicates = TRUE,
> > > > > > plot = TRUE,
> > > > > > verbose = TRUE
> > > > > > ){
> > > > > > # Assert
> > > > > > assert_all_are_existing_files(bedfile)
> > > > > > assert_is_a_bool(verbose)
> > > > > > assert_is_a_bool(rm_duplicates)
> > > > > > assert_is_a_bool(zero_based)
> > > > > >
> > > > > > # Comply
> > > > > > seqnames <- start <- end <- strand <- .N <- gap <- width <- NULL
> > > > > >
> > > > > > # Read
> > > > > > if (verbose) cmessage('\tRead %s', bedfile)
> > > > > > dt <- data.table::fread(bedfile, select = c(seq_len(3), 6),
> > > > > > col.names = c('seqnames', 'start', 'end', 'strand'))
> > > > > > data.table::setorderv(dt, c('seqnames', 'start', 'end', 'strand'))
> > > > > >
> > > > > > # Transform coordinates: 0-based -> 1-based
> > > > > > if (zero_based){
> > > > > > if (verbose) cmessage('\t\tConvert 0 -> 1-based')
> > > > > > dt[, start := start + 1]
> > > > > > }
> > > > > >
> > > > > > if (verbose) cmessage('\t\tRanges: %d ranges on %d chromosomes',
> > > > > > nrow(dt), length(unique(dt$seqnames)))
> > > > > >
> > > > > > # Drop duplicates
> > > > > > if (rm_duplicates){
> > > > > > is_duplicated <- cduplicated(dt)
> > > > > > if (any(is_duplicated)){
> > > > > > if (verbose) cmessage('\t\t %d after removing duplicates')
> > > > > > dt %<>% extract(!duplicated)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > # Turn into GRanges
> > > > > > gr <- add_seqinfo(as(dt, 'GRanges'), bsgenome)
> > > > > >
> > > > > > # Plot and return
> > > > > > title <- paste0(providerVersion(bsgenome), ': ', basename(bedfile))
> > > > > > if (plot) plot_karyogram(gr, title)
> > > > > > gr
> > > > > > }
> > > > > >
> > > > > >
> > > > > > [[alternative HTML version deleted]]
> > > > > >
> > > > > > _______________________________________________
> > > > > > Bioc-devel using r-project.org mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > > > >
> > > > > > This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
> > > > > > _______________________________________________
> > > > > > Bioc-devel using r-project.org mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Michael Lawrence
> > > > > Scientist, Bioinformatics and Computational Biology
> > > > > Genentech, A Member of the Roche Group
> > > > > Office +1 (650) 225-7760
> > > > > michafla using gene.com
> > > > >
> > > > > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Lawrence
> > > > Scientist, Bioinformatics and Computational Biology
> > > > Genentech, A Member of the Roche Group
> > > > Office +1 (650) 225-7760
> > > > michafla using gene.com
> > > >
> > > > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> > >
> > >
> > >
> > > --
> > > Michael Lawrence
> > > Scientist, Bioinformatics and Computational Biology
> > > Genentech, A Member of the Roche Group
> > > Office +1 (650) 225-7760
> > > michafla using gene.com
> > >
> > > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> >
> >
> >
> > --
> > Michael Lawrence
> > Scientist, Bioinformatics and Computational Biology
> > Genentech, A Member of the Roche Group
> > Office +1 (650) 225-7760
> > michafla using gene.com
> >
> > Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
> >
> > _______________________________________________
> > Bioc-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
> --
> Michael Lawrence
> Scientist, Bioinformatics and Computational Biology
> Genentech, A Member of the Roche Group
> Office +1 (650) 225-7760
> michafla using gene.com
>
> Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



-- 
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube



More information about the Bioc-devel mailing list