[R] unique/subset problem

Fri Jan 26 21:53:40 CET 2007

oh, i forgot, you can also convert factor into string like
dataset$genome1 <- as.character(dataset$genome1)

so you don't have to use
as.numeric(dataset$score) if you use "as.is=T" when you read.table

HTH,

weiwei

On 1/26/07, Weiwei Shi <helprhelp at gmail.com> wrote:
> check
> ?read.table
>
> and add "as.is=T" in the option. So you read string as character now
> and avoid the factor things.
>
> Then repeat your work.
>
> For example
> > x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10)
> > str(x0,1)
> `data.frame':   10 obs. of  7 variables:
>  $ V1: Factor w/ 10 levels "-4086733916",..: 10 9 8 7 6 5 4 3 2 1
>  $ V2: Factor w/ 10 levels "-1963744741",..: 10 8 7 4 5 6 3 9 1 2
>  $ V3: Factor w/ 7 levels "-1687428658",..: 7 4 4 2 5 1 6 6 3 4
>  $ V4: Factor w/ 2 levels "5","MECHANISM": 2 1 1 1 1 1 1 1 1 1
>  $ V5: Factor w/ 2 levels "0","TYPE": 2 1 1 1 1 1 1 1 1 1
>  $ V6: Factor w/ 2 levels "USER_","alexey": 1 2 2 2 2 2 2 2 2 2
>  $ V7: Factor w/ 2 levels "3","TRUST": 2 1 1 1 1 1 1 1 1 1
> > x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10, as.is=T)
> > str(x0,1)
> `data.frame':   10 obs. of  7 variables:
>  $ V1: chr  "LINK_ID" "-4293537751" "-4247422653" "-4223137153" ...
>  $ V2: chr  "ID1" "65259" "1020286" "-518245428" ...
>  $ V3: chr  "ID2" "6436" "6436" "-2099509019" ...
>  $ V4: chr  "MECHANISM" "5" "5" "5" ...
>  $ V5: chr  "TYPE" "0" "0" "0" ...
>  $ V6: chr  "USER_" "alexey" "alexey" "alexey" ...
>  $ V7: chr  "TRUST" "3" "3" "3" ...
>
> HTH,
>
> weiwei
>
> On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
> > Hi
> > I read in my dataset using
> > dt <read.table("filename")
> > calling unique(levels(dt$genome1))  yields the
> > following
> >
> >  "aero"      "aful"      "aquae"     "atum_D"
> > "bbur"      "bhal"      "bmel"      "bsub"
> >  [9] "buch"      "cace"      "ccre"      "cglu"
> > "cjej"      "cper"      "cpneuA"    "cpneuC"
> > [17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"
> > "hinf"      "hpyl"      "linn"      "llact"
> > [25] "lmon"      "mgen"      "mjan"      "mlep"
> > "mlot"      "mpneu"     "mpul"      "mthe"
> > [33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"
> > "paer"      "paero"     "pmul"      "pyro"
> > [41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
> > "sent"      "smel"      "spneu"     "spyo"
> > [49] "ssol"      "stok"      "styp"      "synecho"
> > "tacid"     "tmar"      "tpal"      "tvol"
> > [57] "uure"      "vcho"      "xfas"      "ypes"
> >
> > It shows 60 genomes, which is correct.
> >
> > I extracted a subset as follows
> > possible_relatives_subset <- subset(dt, Y < -5)
> > I am pasting the results below
> >      genome1   genome2 parameterX          Y
> > 21       sent ecoliO157  0.00590 -200.633493
> > 22       sent      paer  0.18603 -100.200570
> > 27       styp ecoliO157  0.00484 -240.708645
> > 28       styp      paer  0.18497 -30.250127
> > 41       paer      sent  0.18603 -60.200570
> > 44       paer      styp  0.18497 -80.250127
> > 49       paer      hinf  0.18913 -90.056333
> > 53       paer      vcho  0.18703 -10.153929
> > 55       paer      pmul  0.18587 -100.208042
> > 67       paer      buch  0.21485  -80.898667
> > 70       paer      ypes  0.18460 -107.267454
> > 82       paer      xfas  0.26268  -61.920552
> > 95       hinf ecoliO157  0.07654 -163.018417
> > 96       hinf      paer  0.18913 -10.056333
> > 103      vcho ecoliO157  0.09518 -140.921153
> > 104      vcho      paer  0.18703 -10.153929
> > 107      pmul ecoliO157  0.07328 -165.215225
> > 108      pmul      paer  0.18587 -10.208042
> > 131      buch ecoliO157  0.15412 -11.746939
> > 132      buch      paer  0.21485  -8.898667
> > 137      ypes ecoliO157  0.02705 -19.171851
> > 138      ypes      paer  0.18460 -10.267454
> > 171 ecoliO157      sent  0.00590 -20.633493
> > 174 ecoliO157      styp  0.00484 -20.708645
> > 179 ecoliO157      hinf  0.07654 -6.018417
> > 183 ecoliO157      vcho  0.09518 -14.921153
> > 185 ecoliO157      pmul  0.07328 -6.215225
> > 197 ecoliO157      buch  0.15412 -11.746939
> > 200 ecoliO157      ypes  0.02705 -9.171851
> > 211 ecoliO157      xfas  0.25833  -71.091552
> > 217      xfas ecoliO157  0.25833  -75.091552
> > 218      xfas      paer  0.26268  -64.920552
> >
> > I think  even a cursory look will tell us that there
> > are not as many unique genomes in the subset results.
> > (around 8/10).
> > However when I do
> > unique(levels(possible_relatives_subset$genome1)), I
> > get
> >
> > [1] "aero"      "aful"      "aquae"     "atum_D"
> > "bbur"      "bhal"      "bmel"      "bsub"
> >  [9] "buch"      "cace"      "ccre"      "cglu"
> > "cjej"      "cper"      "cpneuA"    "cpneuC"
> > [17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"
> > "hinf"      "hpyl"      "linn"      "llact"
> > [25] "lmon"      "mgen"      "mjan"      "mlep"
> > "mlot"      "mpneu"     "mpul"      "mthe"
> > [33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"
> > "paer"      "paero"     "pmul"      "pyro"
> > [41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
> > "sent"      "smel"      "spneu"     "spyo"
> > [49] "ssol"      "stok"      "styp"      "synecho"
> > "tacid"     "tmar"      "tpal"      "tvol"
> > [57] "uure"      "vcho"      "xfas"      "ypes"
> >
> > Where am I going wrong?
> > I tried calling unique without the levels too, which
> > gives me the following response
> >
> > [1] sent      styp      paer      hinf      vcho
> > pmul      buch      ypes      ecoliO157 xfas
> > 60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
> > buch cace ccre cglu cjej cper cpneuA ... ypes
> >
> > --- Weiwei Shi <helprhelp at gmail.com> wrote:
> >
> > > Then you need to provide more details about the
> > > calls you made and your dataset.
> > > For example, you can tell us by
> > > str(prunedrelatives, 1)
> > >
> > > how did you call unique on prunedrelative and so on?
> > > I made a test
> > > data it gave me what you wanted (omitted here).
> > >
> > > On 1/26/07, lalitha viswanath
> > > <lalithaviswanath at yahoo.com> wrote:
> > > > Hi
> > > > The pruned dataset has 8 unique genomes in it
> > > while
> > > > the dataset before pruning has 65 unique genomes
> > > in
> > > > it.
> > > > However calling unique on the pruned dataset seems
> > > to
> > > > return 65 no matter what.
> > > >
> > > > Any assistance in this matter would be
> > > appreciated.
> > > >
> > > > Thanks
> > > > Lalitha
> > > > --- Weiwei Shi <helprhelp at gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Even you removed "many" genomes1 by setting
> > > score<
> > > > > -5; it is not
> > > > > necessary saying you changed the uniqueness.
> > > > >
> > > > > To check this, you can do like
> > > > > p0 <- unique(dataset[dataset$score< -5,
> > > "genome1"])
> > > > > # same as subset
> > > > > p1 <- unique(dataset[dataset$score>= -5,
> > > "genome1"])
> > > > >
> > > > > setdiff(p1, p0)
> > > > >
> > > > > if the output above has NULL, then it means even
> > > > > though you remove
> > > > > many genomes1, but it does not help changing the
> > > > > uniqueness.
> > > > >
> > > > > HTH,
> > > > >
> > > > > weiwei
> > > > >
> > > > >
> > > > >
> > > > > On 1/25/07, lalitha viswanath
> > > > > <lalithaviswanath at yahoo.com> wrote:
> > > > > > Hi
> > > > > > I am new to R programming and am using subset
> > > to
> > > > > > extract part of a data as follows
> > > > > >
> > > > > > names(dataset) =
> > > > > > c("genome1","genome2","dist","score");
> > > > > > prunedrelatives <- subset(dataset, score <
> > > -5);
> > > > > >
> > > > > > However when I use unique to find the number
> > > of
> > > > > unique
> > > > > > genomes now present in prunedrelatives I get
> > > > > results
> > > > > > identical to calling unique(dataset$genome1)
> > > > > although
> > > > > > subset has eliminated many genomes and
> > > records.
> > > > > >
> > > > > > I would greatly appreciate your input about
> > > using
> > > > > > "unique" correctly  in this regard.
> > > > > >
> > > > > > Thanks
> > > > > > Lalitha
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > ____________________________________________________________________________________
> > > > > > TV dinner still cooling?
> > > > > > Check out "Tonight's Picks" on Yahoo! TV.
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help at stat.math.ethz.ch mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide
> > > > > http://www.R-project.org/posting-guide.html
> > > > > > and provide commented, minimal,
> > > self-contained,
> > > > > reproducible code.
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Weiwei Shi, Ph.D
> > > > > Research Scientist
> > > > > GeneGO, Inc.
> > > > >
> > > > > "Did you always know?"
> > > > > "No, I did not. But I believed..."
> > > > > ---Matrix III
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > ____________________________________________________________________________________
> > > > Bored stiff? Loosen up...
> > > > Download and play hundreds of games for free on
> > > Yahoo! Games.
> > > > http://games.yahoo.com/games/front
> > > >
> > >
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > > Research Scientist
> > > GeneGO, Inc.
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> >
> >
> >
> >
> > ____________________________________________________________________________________
> > We won't tell. Get more on shows you hate to love
> > (and love to hate): Yahoo! TV's Guilty Pleasures list.
> > http://tv.yahoo.com/collections/265
> >
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>


-- 
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III