[BioC] conditional merge of duplicated rows in data.frame

Martin Morgan mtmorgan at fhcrc.org
Mon Jan 20 21:03:27 CET 2014


On 01/20/2014 08:54 AM, Ninni Nahm [guest] wrote:
>
> Hi all!
>
> I have the following problem.
>
>    chr13  1260 1275   chr13_38134720_38136919
>    chr13  1261 1276   chr13_38134720_38136919
>    chr15   839  854   chr15_63332831_63335030
>    chr15   840  856   chr15_63332831_63335030
>    chr15   837  852   chr15_63332831_63335030
>    chr15   842  857   chr15_63332831_63335030
> In the 2. and 3. column are positions which I want to combine whenever the value in column 4 is the same. For example, I would want:
>
>    chr13  1260 1276   chr13_38134720_38136919
>    chr15   837  857   chr15_63332831_63335030
> Any help is highly appreciated!!!

Hi -- Once you've read in the data

 > df = read.table(stdin())
0:   chr13  1260 1275   chr13_38134720_38136919
1:   chr13  1261 1276   chr13_38134720_38136919
2:   chr15   839  854   chr15_63332831_63335030
3:   chr15   840  856   chr15_63332831_63335030
4:   chr15   837  852   chr15_63332831_63335030
6:   chr15   842  857   chr15_63332831_63335030
7:


you could use the GenomicRanges package to make a 'GRanges' object with the 
chromosome coordinates

 > library(GenomicRanges)
 > gr = with(df, GRanges(V1, IRanges(V2, V3)))

then split gr by the fourth column, reduce() the adjacent ranges within each 
group, and (if there is one range per group) unlist to a GRanges. Optionally, 
you might wish to coerce back to a data.frame (though it will often make sense 
to continue your analysis with GRanges)

 > as.data.frame(unlist(reduce(split(gr, df$V4))))
                         seqnames start  end width strand
chr13_38134720_38136919    chr13  1260 1276    17      *
chr15_63332831_63335030    chr15   837  857    21      *

Hope that helps,

Martin

>
>   -- output of sessionInfo():
>
> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list