[R] Obtaining summary of frequencies of value occurrences for a variable in a multivariate dataset.

Allan Kamau kamauallan at yahoo.com
Sat Jul 28 14:48:47 CEST 2007


Hi Jim,
The problem description.
I am trying to identify mutations in a given gene from
a particular genome (biological genome sequence).
I have two CSV files consisting of sequences. One file
consists of reference (documented,curated accepted as
standard) sequences. The other consists of sample
sequences I am trying to identify mutations within. In
both files the an individual sequence is contained in
a single record, it’s amino acid residues ( the actual
sequence of alphabets each representing a given amino
acid for example “A” stands for “Alanine”, “C” for
Cysteine and so on) are each allocated a single field
in the CSV file.
The sequences in both files have been well aligned,
each contain 115 residues with the first residue is
contained in the field 5. The fields 1 to 4 are
allocated for metadata (name of sequence and so on).
My task is to compile a residue occurrence count for
each residue present in a given field in the reference
sequence dataset and use this information when reading
each sequence in the sample dataset to identify a
mutation. For example for position 9 of the sample
sequence “bb” a “P” is found and according to our
reference sequence dataset of summaries, at position 9
“P” may not even exist or may have an occurrence of
10% or so will be classified as mutation, (I could
employ a cut of parameter for mutation
classification).


Allan.

--- jim holtman <jholtman at gmail.com> wrote:

> results=()#character()
> myVariableNames=names(x.val)
> results[length(myVariableNames)]<-NA
> 
> for (i in myVariableNames){
>     results[i]<-names(x.val[[i]])    # this does not
> work it returns a
> NULL (how can i convert this to x.val$"somevalue" ?
> )
> }
> 
> 
> 
> On 7/27/07, Allan Kamau <kamauallan at yahoo.com>
> wrote:
> > Hi All,
> > I am having difficulties finding a way to find a
> substitute to the command "names(v.val$PR14)" so
> that I could generate the command on the fly for all
> PR14 to PR200 (please see the previous discussion
> below to understand what the object x.val contains)
> . I have tried the following
> >
> > >results=()#character()
> > >myVariableNames=names(x.val)
> > >results[length(myVariableNames)]<-NA
> >
> > >for
> as.vector(unlist(strsplit(str,",")),mode="list")
> > +    results[i]<-names(x.val$i)    # this does not
> work it returns a NULL (how can i convert this to
> x.val$"somevalue" ? )
> > >}
> >
> > Allan.
> >
> >
> > ----- Original Message ----
> > From: Allan Kamau <kamauallan at yahoo.com>
> > To: r-help at stat.math.ethz.ch
> > Sent: Thursday, July 26, 2007 10:03:17 AM
> > Subject: Re: [R] Obtaining summary of frequencies
> of value occurrences for a variable in a
> multivariate dataset.
> >
> > Thanks so much Jim, Andaikalavan, Gabor and others
> for the help and suggestions.
> > The solution will result in a matrix containing
> nested matrices to enable each variable name, each
> variables distinct value and the count of the
> distinct value to be accessible individually.
> > The main matrix will contain the variable names,
> the first level nested matrices will consist of the
> variables unique values, and each such variable
> entry will contain a one element vector to contain
> the count or occurrence frequency.
> > This matrix can now be used in comparing other
> similar datasets for variable values and their
> frequencies.
> >
> > Building on the input received so far, a probable
> solution in building the matrix will include the
> following.
> >
> >
> > 1)I reading the csv file (containing column
> headers)
> >
>
>my_data=read.table("<path/to/my/data.csv>",header=TRUE,sep=",",dec=".",fill=TRUE)
> >
> > 2)I group the values in each variable producing an
> occurrence count(frequency)
> > >x.val<-apply(my_data,2,table)
> >
> > 3)I obtain a vector of the names of the variables
> in the table
> > >names(x.val)
> >
> > 4)Now I make use of the names (obtained in step 3)
> to obtain a vector of distinct values in a given
> variable (in the example below the variable name is
> $PR14)
> > >names(v.val$PR14)
> >
> > 5)I obtain a vector (with one element) of the
> frequency of a value obtained from the step above
> (in our example the value is "V")
> > >as.vector(x.val$PR14["V"])
> >
> > Todo:
> > Now I will need to place the steps above in a
> script (consisting of loops) to build the matrix,
> step 4 and 5 seem tricky to do programatically.
> >
> > Allan.
> >
> >
> > ----- Original Message ----
> > From: jim holtman <jholtman at gmail.com>
> > To: Allan Kamau <kamauallan at yahoo.com>
> > Cc: Adaikalavan Ramasamy <ramasamy at cancer.org.uk>;
> r-help at stat.math.ethz.ch
> > Sent: Wednesday, July 25, 2007 1:50:55 PM
> > Subject: Re: [R] Obtaining summary of frequencies
> of value occurrences for a variable in a
> multivariate dataset.
> >
> > Also if you want to access the individual values,
> you can just leave
> > it as a list:
> >
> > > x.val <- apply(x, 2, table)
> > > # access each value
> > > x.val$PR14["V"]
> > V
> > 8
> >
> >
> >
> > On 7/25/07, Allan Kamau <kamauallan at yahoo.com>
> wrote:
> > > A subset of the data looks as follows
> > >
> > > > df[1:10,14:20]
> > >   PR10 PR11 PR12 PR13 PR14 PR15 PR16
> > > 1     V    T    I    K    V    G    D
> > > 2     V    S    I    K    V    G    G
> > > 3     V    T    I    R    V    G    G
> > > 4     V    S    I    K    I    G    G
> > > 5     V    S    I    K    V    G    G
> > > 6     V    S    I    R    V    G    G
> > > 7     V    T    I    K    I    G    G
> > > 8     V    S    I    K    V    E    G
> > > 9     V    S    I    K    V    G    G
> > > 10    V    S    I    K    V    G    G
> > >
> > > The result I would like is as follows
> > >
> > > PR10        PR11          PR12   ...
> > > [V:10]    [S:7,T:3]    [I:10]
> > >
> > > The result can be in a matrix or a vector and
> each variablename, value and frequency should be
> accessible so as to be used for comparisons with
> another dataset later.
> > > The frequency can be a count or a percentage.
> > >
> > >
> > > Allan.
> > >
> > >
> > > ----- Original Message ----
> > > From: Adaikalavan Ramasamy
> <ramasamy at cancer.org.uk>
> > > To: Allan Kamau <kamauallan at yahoo.com>
> > > Cc: r-help at stat.math.ethz.ch
> > > Sent: Tuesday, July 24, 2007 10:21:51 PM
> > > Subject: Re: [R] Obtaining summary of
> frequencies of value occurrences for a variable in a
> multivariate dataset.
> > >
> > > The name of the table should give you the
> "value". And if you have a
> > > matrix, you just need to convert it into a
> vector first.
> > >
> > >  > m <- matrix( LETTERS[ c(1:3, 3:5, 2:4) ],
> nc=3 )
> > >  > m
> > >      [,1] [,2] [,3]
> > > [1,] "A"  "C"  "B"
> > > [2,] "B"  "D"  "C"
> > > [3,] "C"  "E"  "D"
> > >  > tb <- table( as.vector(m) )
> > >  > tb
> > >
> > > A B C D E
> > > 1 2 3 2 1
> > >  > paste( names(tb), ":", tb, sep="" )
> > > [1] "A:1" "B:2" "C:3" "D:2" "E:1"
> > >
> > > If this is not what you want, then please give a
> simple example.
> > >
> > > Regards, Adai
> > >
> > >
> > >
> > > Allan Kamau wrote:
> > > > Hi all,
> > > > If the question below as been answered before
> I
> > > > apologize for the posting.
> > > > I would like to get the frequencies of
> occurrence of
> > > > all values in a given variable in a
> multivariate
> > > > dataset. In short for each variable (or field)
> a
> > > > summary of values contained with in a
> value:frequency
> 
=== message truncated ===



More information about the R-help mailing list