[R] anonymizing subject identifiers for survival analysis

Christopher W. Ryan cryan at binghamton.edu
Fri May 13 17:55:13 CEST 2016


Excellent, thanks. Much simpler.

--Chris

Christopher W. Ryan, MD, MS
cryanatbinghamtondotedu
https://www.linkedin.com/in/ryancw

Early success is a terrible teacher. You’re essentially being rewarded
for a lack of preparation, so when you find yourself in a situation
where you must prepare, you can’t do it. You don’t know how.
--Chris Hadfield, An Astronaut's Guide to Life on Earth

William Dunlap wrote:
> You can also use match(code, unique(code)), as in
>   transform(dd.2, codex2 = paste0("Person", match(code, unique(code))))
> It is not guaranteed that x!=y implies digest(x)!=digest(y), but it is
> extremely
> unlikely to fail.  This match idiom guarantees that.
> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
> 
> On Thu, May 12, 2016 at 1:06 PM, Christopher W Ryan
> <cryan at binghamton.edu <mailto:cryan at binghamton.edu>> wrote:
> 
>     I would like to conduct a survival analysis, examining a subject's
>     time to *next* appearance in a database, after their first appearance.
>     It is a database of dated events.
> 
>     I need to obfuscate or anonymize or mask the subject identifiers (a
>     combination of name and birthdate). And obviously any given subject
>     should have the same anonymous code ever time he/she appears in the
>     database.  I'm not talking "safe from the NSA" here. And I won't be
>     releasing it. It's just sensitive data and I don't want to be working
>     every day with cleartext versions of it.
> 
>     I've looked at packages digest, anonymizer, and anonymize.  What do
>     you think of this approach:
> 
>     # running R 3.1.1 on Windows 7 Enterprise
>     library(digest)
>     dd <- data.frame(id=1:6, name = c("Harry", "Ron", "Hermione", "Luna",
>     "Ginny", "Harry"), dob = c("1990-01-01", "1990-06-15", "1990-04-08",
>     "1999-11-26", "1990-07-21", "1990-01-01"))
>     dd.2 <- transform(dd, code=paste0(tolower(name), tolower(dob), sep=""))
>     library(digest)
>     anonymize <- function(x, algo="sha256"){
>       unq_hashes <- vapply(x, function(object) digest(object, algo=algo),
>     FUN.VALUE="", USE.NAMES=TRUE)
>       unname(unq_hashes[x])
>     }
>     dd.2$codex <- anonymize(dd.2$code)
>     dd.2
>     table(duplicated(dd.2$codex))
> 
>     Thanks.
> 
>     --Chris Ryan
>     Broome County Health Department
> 
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
> 
>



More information about the R-help mailing list