[R] detecting noise in data?

Bert Gunter gunter.berton at gene.com
Wed Jan 25 00:12:02 CET 2012


Statistical inference for group differences on groups determined from the data yields incorrect results. Groups must be prespecified.

Bert

On Jan 24, 2012, at 2:55 PM, "HARROLD, Tim" <THARR at doh.health.nsw.gov.au> wrote:

> You might want to provide an example? It's a pretty vague problem at the moment.
> 
> If the data can be easily picked out by human eyes, you might want to think about your criteria you're using to pick out a contaminated result. If you can express it in such a way that you don't need to scan each observation (e.g. if a snapper weighs >= 300000kg then somebody entered that data incorrectly) then you can create an indicator variable and continue with your analysis.
> 
> Other than that - some sort of cluster analysis might be able to pick up on 2 distinct groups provided within each group there's a reasonable level of homogeneity. Then from there, you can do a basic inference test for group means to detect whether there are significant differences detected between groups.
> 
> Cheers,
> Tim
> 
> 
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Michael
> Sent: Wednesday, 25 January 2012 9:31 AM
> To: r-help
> Subject: Re: [R] detecting noise in data?
> 
> Hi all,
> 
> I just wanted to add that I am looking for a solution that's in R ... to
> handle this...
> 
> And also, in a given sample, the correct data are of the majority and the
> noise are of the minority.
> 
> Thank you!
> 
> On Tue, Jan 24, 2012 at 4:09 PM, Michael <comtech.usa at gmail.com> wrote:
> 
>> Hi all,
>> 
>> I have data which are unfortuantely comtaminated by noise.
>> 
>> We knew that the noise is at different level than the correct data, i.e.
>> the noise data can be easily picked out by human eyes.
>> 
>> It looks as if there are two people that generated the two very different
>> data with different mean levels, and they got mixed together.
>> 
>> i.e. assming the two data are following unknown distribution DF,
>> 
>> and the two mean levels are u1 and u2... (unknown)
>> 
>> Then the correct data are generated by DF(u1)
>> 
>> and the noise are generated by DF(u2),
>> 
>> and they got mixed...
>> 
>> Now, how do I flag those suspicious data? At least is there a way I could
>> answer the question:
>> 
>> Given a sample of mixed data - are these data generated from the
>> above-mentioned two sources, or the data are indeed generated from one
>> source only.
>> 
>> i.e. are there two substantially distinct species in the given data?
>> 
>> Thanks a lot!
>> 
>> 
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> ______________________________________________________________________________________________________________________
> This email has been scanned for the NSW Ministry of Health by the Websense Hosted Email Security System. 
> Emails and attachments are monitored to ensure compliance with the NSW Ministry of Health's Electronic Messaging Policy.
> ______________________________________________________________________________________________________________________
> 
> 
> ______________________________________________________________________________________________________________________
> Disclaimer: This message is intended for the addressee named and may contain confidential information. 
> If you are not the intended recipient, please delete it and notify the sender. 
> Views expressed in this message are those of the individual sender, and are not necessarily the views of the NSW Ministry of Health.
> ______________________________________________________________________________________________________________________
> This email has been scanned for the NSW Ministry of Health by the Websense Hosted Email Security System. 
> Emails and attachments are monitored to ensure compliance with the NSW Ministry of Health's Electronic Messaging Policy.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list