[R] Character SNP data to binary MAF data

Thomas Lumley tlumley at u.washington.edu
Thu Jan 29 09:33:31 CET 2009


The first step is to convert your data to all uppercase with toupper().

Then it depends on how tidy the data are: are there missing data, are some SNPs monomorphic in your sample, etc.

If there are no missing data you can use

N<-ncol(the_data)
halfN <- N/2

maf_one_row <-function(arow) {
    rval<-numeric(N)
    if (sum(i<-arow=="A")>halfN) {
         rval[]<-1
    } else if (sum(i<-arow=="C")>halfN){
         rval[i]<-1
    } else if (sum(i<-arow=="T"))>halfN){
         rval[i]<-1
    } else if (sum(i<-arow=="G")>halfN){
         rval[i]<-1
    }
    rval
}

apply(the_data, 1, maf_one_row)

YOu could also use table() to find the two alleles, but you have to make sure that the code still works when there is only one allele.

      -thomas

On Thu, 29 Jan 2009, Hadassa Brunschwig wrote:

> Hi
>
> An example is as follows. Consider the character 3x6 matrix:
>
> a A a T A t
> G g t T T t
> A a C C c c
>
> For each row I would like to identify the most frequent letter and
> assign a 1 to it and 0
> to the less frequent character. That is, in row 1 the most frequent
> letter is A (I do not differentiate between capital and non-capital
> letters), in row 2 T and in row 3 C. After the binary conversion
> the resulting matrix would look like that:
>
> 1 1 1 0 1 0
> 0 0 1 1 1 1
> 0 0 1 1 1 1
>
> Any suggestions on how to do that (and I am sure I am not the first
> one to try this).
>
> Thanks
> Hadassa
>
>
> On Thu, Jan 29, 2009 at 1:50 AM, Jorge Ivan Velez
> <jorgeivanvelez at gmail.com> wrote:
>>
>> Hi Hadassa,
>> Do you have a sample of your data and the output you want? It might be
>> useful for us in order to provide any help to you.
>> Regards,
>>
>> Jorge
>>
>>
>> On Wed, Jan 28, 2009 at 8:36 AM, Hadassa Brunschwig
>> <hadassa.brunschwig at mail.huji.ac.il> wrote:
>>>
>>> Hi
>>>
>>> I am sure there is a function out there already but I couldn't find it.
>>> I have SNP data, that is, a matrix which contains in each row two
>>> characters (they are different in each row) and I would like to
>>> convert this matrix to a binary one according to the minor allele
>>> frequency. For non-geneticists: I want to have a binary matrix
>>> for which in each row the 0 stands for the less frequent character
>>> and 1 for the more frequent character.
>>>
>>> Thanks for any suggestions.
>>> Hadassa
>>>
>>> --
>>> Hadassa Brunschwig
>>> PhD Student
>>> Department of Statistics
>>> The Hebrew University of Jerusalem
>>> http://www.stat.huji.ac.il
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Hadassa Brunschwig
> PhD Student
> Department of Statistics
> The Hebrew University of Jerusalem
> http://www.stat.huji.ac.il
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list