[R] Cleaning

Boris Steipe boris.steipe at utoronto.ca
Thu Nov 12 05:33:44 CET 2015


If what you posted here is what you typed, your syntax is wrong.
I strongly advise you to consult the two links here:

http://adv-r.had.co.nz/Reproducibility.html
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
... and please read the posting guide and don't post in HTML.


B.


On Nov 11, 2015, at 10:03 PM, Ashta <sewashm at gmail.com> wrote:

> Sarah,
> 
> Thank you very much.   For the other variables
> I was trying to do the same job in different way because it is easier to
> list it
> 
> Example
> 
> test < which(dat$var1  !="BAA" | dat$var1 !="FAG" )
> {
>    dat <- dat[-test,]}   and I did not get the  right result. What am I
> missing here?
> 
> 
> 
> 
> 
> On Wed, Nov 11, 2015 at 7:54 PM, Sarah Goslee <sarah.goslee at gmail.com>
> wrote:
> 
>> On Wed, Nov 11, 2015 at 8:44 PM, Ashta <sewashm at gmail.com> wrote:
>>> Hi Sarah,
>>> 
>>> I used the following to clean my data, the program crushed several times.
>>> 
>>> test <- dat[dat$Var1 == "YYZ" | dat$Var1 =="MSN" ,]
>>> 
>>> What is the difference between these two
>>> 
>>> test <- dat[dat$Var1  %in% "YYZ" | dat$Var1 %in% "MSN" ,]
>> 
>> Besides that you're using %in% wrong? I told you how to proceed.
>> 
>> myvalues <- c("YYZ", "MSN")
>> 
>> test <- subset(dat, Var1 %in% myvalues)
>> 
>> 
>>> subset(dat, Var1 %in% myvalues)
>>  X Var1 Freq
>> 3 3  MSN 1040
>> 4 4  YYZ  300
>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Nov 11, 2015 at 6:38 PM, Sarah Goslee <sarah.goslee at gmail.com>
>>> wrote:
>>>> 
>>>> Please keep replies on the list so others may participate in the
>>>> conversation.
>>>> 
>>>> If you have a character vector containing the potential values, you
>>>> might look at %in% for one approach to subsetting your data.
>>>> 
>>>> Var1 %in% myvalues
>>>> 
>>>> Sarah
>>>> 
>>>> On Wed, Nov 11, 2015 at 7:10 PM, Ashta <sewashm at gmail.com> wrote:
>>>>> Thank you Sarah for your prompt response!
>>>>> 
>>>>> I have the list of values of the variable Var1 it is around 20.
>>>>> How can I modify this one to include all the 20 valid values?
>>>>> 
>>>>> test <- dat[dat$Var1 == "YYZ" | dat$Var1 =="MSN" ,]
>>>>> 
>>>>> Is there a way (efficient )  of doing it?
>>>>> 
>>>>> Thank you again
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Nov 11, 2015 at 6:02 PM, Sarah Goslee <sarah.goslee at gmail.com
>>> 
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> On Wed, Nov 11, 2015 at 6:51 PM, Ashta <sewashm at gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I have a data frame with  huge rows and columns.
>>>>>>> 
>>>>>>> When I looked at the data,  it has several garbage values need to
>> be
>>>>>>> 
>>>>>>> cleaned. For a sample I am showing you the frequency distribution
>>>>>>> of one variables
>>>>>>> 
>>>>>>>    Var1 Freq
>>>>>>> 1    :    3
>>>>>>> 2    ]    6
>>>>>>> 3    MSN 1040
>>>>>>> 4    YYZ  300
>>>>>>> 5    \\    4
>>>>>>> 6    +     3
>>>>>>> 7.   ?>   15
>>>>>> 
>>>>>> Please use dput() to provide your data. I made a guess at what you
>> had
>>>>>> in R, but could be wrong.
>>>>>> 
>>>>>> 
>>>>>>> and continues.
>>>>>>> 
>>>>>>> I want to keep those rows that contain only a valid variable value
>>>>>>> 
>>>>>>> In this  case MSN and YYZ. I tried the following
>>>>>>> 
>>>>>>> *test <- dat[dat$Var1 == "YYZ" | dat$Var1 =="MSN" ,]*
>>>>>>> 
>>>>>>> but I am not getting the desired result.
>>>>>> 
>>>>>> What are you getting? How does it differ from the desired result?
>>>>>> 
>>>>>>> I have
>>>>>>> 
>>>>>>> Any help or idea?
>>>>>> 
>>>>>> I get:
>>>>>> 
>>>>>>> dat <- structure(list(X = 1:7, Var1 = c(":", "]", "MSN", "YYZ",
>>>>>>> "\\\\",
>>>>>> + "+", "?>"), Freq = c(3L, 6L, 1040L, 300L, 4L, 3L, 15L)), .Names =
>>>>>> c("X",
>>>>>> + "Var1", "Freq"), class = "data.frame", row.names = c(NA, -7L))
>>>>>>> 
>>>>>>> test <- dat[dat$Var1 == "YYZ" | dat$Var1 =="MSN" ,]
>>>>>>> test
>>>>>>  X Var1 Freq
>>>>>> 3 3  MSN 1040
>>>>>> 4 4  YYZ  300
>>>>>> 
>>>>>> Which seems reasonable to me.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>        [[alternative HTML version deleted]]
>>>>>> 
>>>>>> Please don't post in HTML either: it introduces all sorts of errors
>> to
>>>>>> your message.
>>>>>> 
>>>>>> Sarah
>>>>>> 
>>> 
>>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list