[BioC] conditional merge of duplicated rows in data.frame
Martin Morgan
mtmorgan at fhcrc.org
Mon Jan 20 21:03:27 CET 2014
On 01/20/2014 08:54 AM, Ninni Nahm [guest] wrote:
>
> Hi all!
>
> I have the following problem.
>
> chr13 1260 1275 chr13_38134720_38136919
> chr13 1261 1276 chr13_38134720_38136919
> chr15 839 854 chr15_63332831_63335030
> chr15 840 856 chr15_63332831_63335030
> chr15 837 852 chr15_63332831_63335030
> chr15 842 857 chr15_63332831_63335030
> In the 2. and 3. column are positions which I want to combine whenever the value in column 4 is the same. For example, I would want:
>
> chr13 1260 1276 chr13_38134720_38136919
> chr15 837 857 chr15_63332831_63335030
> Any help is highly appreciated!!!
Hi -- Once you've read in the data
> df = read.table(stdin())
0: chr13 1260 1275 chr13_38134720_38136919
1: chr13 1261 1276 chr13_38134720_38136919
2: chr15 839 854 chr15_63332831_63335030
3: chr15 840 856 chr15_63332831_63335030
4: chr15 837 852 chr15_63332831_63335030
6: chr15 842 857 chr15_63332831_63335030
7:
you could use the GenomicRanges package to make a 'GRanges' object with the
chromosome coordinates
> library(GenomicRanges)
> gr = with(df, GRanges(V1, IRanges(V2, V3)))
then split gr by the fourth column, reduce() the adjacent ranges within each
group, and (if there is one range per group) unlist to a GRanges. Optionally,
you might wish to coerce back to a data.frame (though it will often make sense
to continue your analysis with GRanges)
> as.data.frame(unlist(reduce(split(gr, df$V4))))
seqnames start end width strand
chr13_38134720_38136919 chr13 1260 1276 17 *
chr15_63332831_63335030 chr15 837 857 21 *
Hope that helps,
Martin
>
> -- output of sessionInfo():
>
> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list