[R] anonymizing subject identifiers for survival analysis
William Dunlap
wdunlap at tibco.com
Fri May 13 16:27:46 CEST 2016
You can also use match(code, unique(code)), as in
transform(dd.2, codex2 = paste0("Person", match(code, unique(code))))
It is not guaranteed that x!=y implies digest(x)!=digest(y), but it is
extremely
unlikely to fail. This match idiom guarantees that.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Thu, May 12, 2016 at 1:06 PM, Christopher W Ryan <cryan at binghamton.edu>
wrote:
> I would like to conduct a survival analysis, examining a subject's
> time to *next* appearance in a database, after their first appearance.
> It is a database of dated events.
>
> I need to obfuscate or anonymize or mask the subject identifiers (a
> combination of name and birthdate). And obviously any given subject
> should have the same anonymous code ever time he/she appears in the
> database. I'm not talking "safe from the NSA" here. And I won't be
> releasing it. It's just sensitive data and I don't want to be working
> every day with cleartext versions of it.
>
> I've looked at packages digest, anonymizer, and anonymize. What do
> you think of this approach:
>
> # running R 3.1.1 on Windows 7 Enterprise
> library(digest)
> dd <- data.frame(id=1:6, name = c("Harry", "Ron", "Hermione", "Luna",
> "Ginny", "Harry"), dob = c("1990-01-01", "1990-06-15", "1990-04-08",
> "1999-11-26", "1990-07-21", "1990-01-01"))
> dd.2 <- transform(dd, code=paste0(tolower(name), tolower(dob), sep=""))
> library(digest)
> anonymize <- function(x, algo="sha256"){
> unq_hashes <- vapply(x, function(object) digest(object, algo=algo),
> FUN.VALUE="", USE.NAMES=TRUE)
> unname(unq_hashes[x])
> }
> dd.2$codex <- anonymize(dd.2$code)
> dd.2
> table(duplicated(dd.2$codex))
>
> Thanks.
>
> --Chris Ryan
> Broome County Health Department
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list