[R] Recoding multiple columns consistently
Jim Lemon
jim at bitwrit.com.au
Wed Aug 29 13:01:00 CEST 2007
Ron Crump wrote:
> Hi,
>
> I have a dataframe that contains pedigree information;
> that is individual, sire and dam identities as separate
> columns. It also has date of birth.
>
> These identifiers are not numeric, or not sequential.
>
> Obviously, an identifier can appear in one or two columns,
> depending on whether it was a parent or not. These should
> be consistent.
>
> Not all identifiers appear in the individual column - it
> is possible for a parent not to have its own record if its
> parents were not known.
>
> Missing parental (sire and/or dam) identifiers can occur.
>
> I need to export the data for use in another program that
> requires the pedigree to be coded as integers, increasing
> with date of birth (therefore sire and dam always have
> lower identifiers than their offspring) and with missing
> values coded as 0.
>
> How would I go about doing this?
>
Hi Ron,
Without the genealogical coding system for the output, I can only make a
guess. It seems as though you are going from a series of records for
which the index is the individual, followed by fields containing sire,
dam and date of birth (perhaps not in that order).
I think you want to transform this into a network (maybe hierarchical
unless consanguinuity intervenes) with individuals coded as positive
integers (and maybe some or all of the original information attached to
those identifiers). At a guess, I would recode the birthdates as
integers, preserving the order and including a rule for breaking ties.
Assuming that you want an inverted tree for each individual, construct a
linked list beginning with the individual with two pointers to the
parents (their integer identifiers). Each parent has two links pointing
to their parents, and so on. Whenever a pointer is zero, the linking
stops. I don't know whether this can be represented in any of the tree
diagrams in R, but it certainly could be coded.
I think a bit more information for non-genealogists about the formats
might elicit a more specific answer.
> And a second, simpler related question, if I have a column with
> n different values (may be strings or non-sequential integers)
> identifying levels (possibly with repeated occurences), how
> can I recode them to be sequential from 1 to n?
>
> I can solve both problems in fortran, so could use loops to
> do it in R, but feel there should be quicker, more elegant,
> "more R" solution.
>
sounds like "sort"
Jim
More information about the R-help
mailing list