[R] unique/subset problem
Weiwei Shi
helprhelp at gmail.com
Fri Jan 26 21:51:27 CET 2007
check
?read.table
and add "as.is=T" in the option. So you read string as character now
and avoid the factor things.
Then repeat your work.
For example
> x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10)
> str(x0,1)
`data.frame': 10 obs. of 7 variables:
$ V1: Factor w/ 10 levels "-4086733916",..: 10 9 8 7 6 5 4 3 2 1
$ V2: Factor w/ 10 levels "-1963744741",..: 10 8 7 4 5 6 3 9 1 2
$ V3: Factor w/ 7 levels "-1687428658",..: 7 4 4 2 5 1 6 6 3 4
$ V4: Factor w/ 2 levels "5","MECHANISM": 2 1 1 1 1 1 1 1 1 1
$ V5: Factor w/ 2 levels "0","TYPE": 2 1 1 1 1 1 1 1 1 1
$ V6: Factor w/ 2 levels "USER_","alexey": 1 2 2 2 2 2 2 2 2 2
$ V7: Factor w/ 2 levels "3","TRUST": 2 1 1 1 1 1 1 1 1 1
> x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10, as.is=T)
> str(x0,1)
`data.frame': 10 obs. of 7 variables:
$ V1: chr "LINK_ID" "-4293537751" "-4247422653" "-4223137153" ...
$ V2: chr "ID1" "65259" "1020286" "-518245428" ...
$ V3: chr "ID2" "6436" "6436" "-2099509019" ...
$ V4: chr "MECHANISM" "5" "5" "5" ...
$ V5: chr "TYPE" "0" "0" "0" ...
$ V6: chr "USER_" "alexey" "alexey" "alexey" ...
$ V7: chr "TRUST" "3" "3" "3" ...
HTH,
weiwei
On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
> Hi
> I read in my dataset using
> dt <read.table("filename")
> calling unique(levels(dt$genome1)) yields the
> following
>
> "aero" "aful" "aquae" "atum_D"
> "bbur" "bhal" "bmel" "bsub"
> [9] "buch" "cace" "ccre" "cglu"
> "cjej" "cper" "cpneuA" "cpneuC"
> [17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
> "hinf" "hpyl" "linn" "llact"
> [25] "lmon" "mgen" "mjan" "mlep"
> "mlot" "mpneu" "mpul" "mthe"
> [33] "mtub" "mtub_cdc" "nost" "pabyssi"
> "paer" "paero" "pmul" "pyro"
> [41] "rcon" "rpxx" "saur_mu50" "saur_n315"
> "sent" "smel" "spneu" "spyo"
> [49] "ssol" "stok" "styp" "synecho"
> "tacid" "tmar" "tpal" "tvol"
> [57] "uure" "vcho" "xfas" "ypes"
>
> It shows 60 genomes, which is correct.
>
> I extracted a subset as follows
> possible_relatives_subset <- subset(dt, Y < -5)
> I am pasting the results below
> genome1 genome2 parameterX Y
> 21 sent ecoliO157 0.00590 -200.633493
> 22 sent paer 0.18603 -100.200570
> 27 styp ecoliO157 0.00484 -240.708645
> 28 styp paer 0.18497 -30.250127
> 41 paer sent 0.18603 -60.200570
> 44 paer styp 0.18497 -80.250127
> 49 paer hinf 0.18913 -90.056333
> 53 paer vcho 0.18703 -10.153929
> 55 paer pmul 0.18587 -100.208042
> 67 paer buch 0.21485 -80.898667
> 70 paer ypes 0.18460 -107.267454
> 82 paer xfas 0.26268 -61.920552
> 95 hinf ecoliO157 0.07654 -163.018417
> 96 hinf paer 0.18913 -10.056333
> 103 vcho ecoliO157 0.09518 -140.921153
> 104 vcho paer 0.18703 -10.153929
> 107 pmul ecoliO157 0.07328 -165.215225
> 108 pmul paer 0.18587 -10.208042
> 131 buch ecoliO157 0.15412 -11.746939
> 132 buch paer 0.21485 -8.898667
> 137 ypes ecoliO157 0.02705 -19.171851
> 138 ypes paer 0.18460 -10.267454
> 171 ecoliO157 sent 0.00590 -20.633493
> 174 ecoliO157 styp 0.00484 -20.708645
> 179 ecoliO157 hinf 0.07654 -6.018417
> 183 ecoliO157 vcho 0.09518 -14.921153
> 185 ecoliO157 pmul 0.07328 -6.215225
> 197 ecoliO157 buch 0.15412 -11.746939
> 200 ecoliO157 ypes 0.02705 -9.171851
> 211 ecoliO157 xfas 0.25833 -71.091552
> 217 xfas ecoliO157 0.25833 -75.091552
> 218 xfas paer 0.26268 -64.920552
>
> I think even a cursory look will tell us that there
> are not as many unique genomes in the subset results.
> (around 8/10).
> However when I do
> unique(levels(possible_relatives_subset$genome1)), I
> get
>
> [1] "aero" "aful" "aquae" "atum_D"
> "bbur" "bhal" "bmel" "bsub"
> [9] "buch" "cace" "ccre" "cglu"
> "cjej" "cper" "cpneuA" "cpneuC"
> [17] "cpneuJ" "ctraM" "ecoliO157" "hbsp"
> "hinf" "hpyl" "linn" "llact"
> [25] "lmon" "mgen" "mjan" "mlep"
> "mlot" "mpneu" "mpul" "mthe"
> [33] "mtub" "mtub_cdc" "nost" "pabyssi"
> "paer" "paero" "pmul" "pyro"
> [41] "rcon" "rpxx" "saur_mu50" "saur_n315"
> "sent" "smel" "spneu" "spyo"
> [49] "ssol" "stok" "styp" "synecho"
> "tacid" "tmar" "tpal" "tvol"
> [57] "uure" "vcho" "xfas" "ypes"
>
> Where am I going wrong?
> I tried calling unique without the levels too, which
> gives me the following response
>
> [1] sent styp paer hinf vcho
> pmul buch ypes ecoliO157 xfas
> 60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
> buch cace ccre cglu cjej cper cpneuA ... ypes
>
> --- Weiwei Shi <helprhelp at gmail.com> wrote:
>
> > Then you need to provide more details about the
> > calls you made and your dataset.
> > For example, you can tell us by
> > str(prunedrelatives, 1)
> >
> > how did you call unique on prunedrelative and so on?
> > I made a test
> > data it gave me what you wanted (omitted here).
> >
> > On 1/26/07, lalitha viswanath
> > <lalithaviswanath at yahoo.com> wrote:
> > > Hi
> > > The pruned dataset has 8 unique genomes in it
> > while
> > > the dataset before pruning has 65 unique genomes
> > in
> > > it.
> > > However calling unique on the pruned dataset seems
> > to
> > > return 65 no matter what.
> > >
> > > Any assistance in this matter would be
> > appreciated.
> > >
> > > Thanks
> > > Lalitha
> > > --- Weiwei Shi <helprhelp at gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Even you removed "many" genomes1 by setting
> > score<
> > > > -5; it is not
> > > > necessary saying you changed the uniqueness.
> > > >
> > > > To check this, you can do like
> > > > p0 <- unique(dataset[dataset$score< -5,
> > "genome1"])
> > > > # same as subset
> > > > p1 <- unique(dataset[dataset$score>= -5,
> > "genome1"])
> > > >
> > > > setdiff(p1, p0)
> > > >
> > > > if the output above has NULL, then it means even
> > > > though you remove
> > > > many genomes1, but it does not help changing the
> > > > uniqueness.
> > > >
> > > > HTH,
> > > >
> > > > weiwei
> > > >
> > > >
> > > >
> > > > On 1/25/07, lalitha viswanath
> > > > <lalithaviswanath at yahoo.com> wrote:
> > > > > Hi
> > > > > I am new to R programming and am using subset
> > to
> > > > > extract part of a data as follows
> > > > >
> > > > > names(dataset) =
> > > > > c("genome1","genome2","dist","score");
> > > > > prunedrelatives <- subset(dataset, score <
> > -5);
> > > > >
> > > > > However when I use unique to find the number
> > of
> > > > unique
> > > > > genomes now present in prunedrelatives I get
> > > > results
> > > > > identical to calling unique(dataset$genome1)
> > > > although
> > > > > subset has eliminated many genomes and
> > records.
> > > > >
> > > > > I would greatly appreciate your input about
> > using
> > > > > "unique" correctly in this regard.
> > > > >
> > > > > Thanks
> > > > > Lalitha
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> ____________________________________________________________________________________
> > > > > TV dinner still cooling?
> > > > > Check out "Tonight's Picks" on Yahoo! TV.
> > > > >
> > > > > ______________________________________________
> > > > > R-help at stat.math.ethz.ch mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide
> > > > http://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal,
> > self-contained,
> > > > reproducible code.
> > > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Shi, Ph.D
> > > > Research Scientist
> > > > GeneGO, Inc.
> > > >
> > > > "Did you always know?"
> > > > "No, I did not. But I believed..."
> > > > ---Matrix III
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> ____________________________________________________________________________________
> > > Bored stiff? Loosen up...
> > > Download and play hundreds of games for free on
> > Yahoo! Games.
> > > http://games.yahoo.com/games/front
> > >
> >
> >
> > --
> > Weiwei Shi, Ph.D
> > Research Scientist
> > GeneGO, Inc.
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
>
>
>
>
> ____________________________________________________________________________________
> We won't tell. Get more on shows you hate to love
> (and love to hate): Yahoo! TV's Guilty Pleasures list.
> http://tv.yahoo.com/collections/265
>
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
More information about the R-help
mailing list