[R] unique/subset problem
lalitha viswanath
lalithaviswanath at yahoo.com
Fri Jan 26 21:43:13 CET 2007
Hi
I read in my dataset using
dt <read.table("filename")
calling unique(levels(dt$genome1)) yields the
following
"aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
It shows 60 genomes, which is correct.
I extracted a subset as follows
possible_relatives_subset <- subset(dt, Y < -5)
I am pasting the results below
genome1 genome2 parameterX Y
21 sent ecoliO157 0.00590 -200.633493
22 sent paer 0.18603 -100.200570
27 styp ecoliO157 0.00484 -240.708645
28 styp paer 0.18497 -30.250127
41 paer sent 0.18603 -60.200570
44 paer styp 0.18497 -80.250127
49 paer hinf 0.18913 -90.056333
53 paer vcho 0.18703 -10.153929
55 paer pmul 0.18587 -100.208042
67 paer buch 0.21485 -80.898667
70 paer ypes 0.18460 -107.267454
82 paer xfas 0.26268 -61.920552
95 hinf ecoliO157 0.07654 -163.018417
96 hinf paer 0.18913 -10.056333
103 vcho ecoliO157 0.09518 -140.921153
104 vcho paer 0.18703 -10.153929
107 pmul ecoliO157 0.07328 -165.215225
108 pmul paer 0.18587 -10.208042
131 buch ecoliO157 0.15412 -11.746939
132 buch paer 0.21485 -8.898667
137 ypes ecoliO157 0.02705 -19.171851
138 ypes paer 0.18460 -10.267454
171 ecoliO157 sent 0.00590 -20.633493
174 ecoliO157 styp 0.00484 -20.708645
179 ecoliO157 hinf 0.07654 -6.018417
183 ecoliO157 vcho 0.09518 -14.921153
185 ecoliO157 pmul 0.07328 -6.215225
197 ecoliO157 buch 0.15412 -11.746939
200 ecoliO157 ypes 0.02705 -9.171851
211 ecoliO157 xfas 0.25833 -71.091552
217 xfas ecoliO157 0.25833 -75.091552
218 xfas paer 0.26268 -64.920552
I think even a cursory look will tell us that there
are not as many unique genomes in the subset results.
(around 8/10).
However when I do
unique(levels(possible_relatives_subset$genome1)), I
get
[1] "aero" "aful" "aquae" "atum_D"
"bbur" "bhal" "bmel" "bsub"
[9] "buch" "cace" "ccre" "cglu"
"cjej" "cper" "cpneuA" "cpneuC"
[17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
"hinf" "hpyl" "linn" "llact"
[25] "lmon" "mgen" "mjan" "mlep"
"mlot" "mpneu" "mpul" "mthe"
[33] "mtub" "mtub_cdc" "nost" "pabyssi"
"paer" "paero" "pmul" "pyro"
[41] "rcon" "rpxx" "saur_mu50" "saur_n315"
"sent" "smel" "spneu" "spyo"
[49] "ssol" "stok" "styp" "synecho"
"tacid" "tmar" "tpal" "tvol"
[57] "uure" "vcho" "xfas" "ypes"
Where am I going wrong?
I tried calling unique without the levels too, which
gives me the following response
[1] sent styp paer hinf vcho
pmul buch ypes ecoliO157 xfas
60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
buch cace ccre cglu cjej cper cpneuA ... ypes
--- Weiwei Shi <helprhelp at gmail.com> wrote:
> Then you need to provide more details about the
> calls you made and your dataset.
> For example, you can tell us by
> str(prunedrelatives, 1)
>
> how did you call unique on prunedrelative and so on?
> I made a test
> data it gave me what you wanted (omitted here).
>
> On 1/26/07, lalitha viswanath
> <lalithaviswanath at yahoo.com> wrote:
> > Hi
> > The pruned dataset has 8 unique genomes in it
> while
> > the dataset before pruning has 65 unique genomes
> in
> > it.
> > However calling unique on the pruned dataset seems
> to
> > return 65 no matter what.
> >
> > Any assistance in this matter would be
> appreciated.
> >
> > Thanks
> > Lalitha
> > --- Weiwei Shi <helprhelp at gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Even you removed "many" genomes1 by setting
> score<
> > > -5; it is not
> > > necessary saying you changed the uniqueness.
> > >
> > > To check this, you can do like
> > > p0 <- unique(dataset[dataset$score< -5,
> "genome1"])
> > > # same as subset
> > > p1 <- unique(dataset[dataset$score>= -5,
> "genome1"])
> > >
> > > setdiff(p1, p0)
> > >
> > > if the output above has NULL, then it means even
> > > though you remove
> > > many genomes1, but it does not help changing the
> > > uniqueness.
> > >
> > > HTH,
> > >
> > > weiwei
> > >
> > >
> > >
> > > On 1/25/07, lalitha viswanath
> > > <lalithaviswanath at yahoo.com> wrote:
> > > > Hi
> > > > I am new to R programming and am using subset
> to
> > > > extract part of a data as follows
> > > >
> > > > names(dataset) =
> > > > c("genome1","genome2","dist","score");
> > > > prunedrelatives <- subset(dataset, score <
> -5);
> > > >
> > > > However when I use unique to find the number
> of
> > > unique
> > > > genomes now present in prunedrelatives I get
> > > results
> > > > identical to calling unique(dataset$genome1)
> > > although
> > > > subset has eliminated many genomes and
> records.
> > > >
> > > > I would greatly appreciate your input about
> using
> > > > "unique" correctly in this regard.
> > > >
> > > > Thanks
> > > > Lalitha
> > > >
> > > >
> > > >
> > > >
> > >
> >
>
____________________________________________________________________________________
> > > > TV dinner still cooling?
> > > > Check out "Tonight's Picks" on Yahoo! TV.
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal,
> self-contained,
> > > reproducible code.
> > > >
> > >
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > > Research Scientist
> > > GeneGO, Inc.
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> >
> >
> >
> >
> >
>
____________________________________________________________________________________
> > Bored stiff? Loosen up...
> > Download and play hundreds of games for free on
> >
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
____________________________________________________________________________________
We won't tell. Get more on shows you hate to love
More information about the R-help
mailing list