[R] Advice on obscuring unique IDs in R
Marc Schwartz
marc_schwartz at me.com
Wed Jan 5 22:43:33 CET 2011
On Jan 5, 2011, at 3:19 PM, Anthony Staines wrote:
> Dear colleagues,
>
> This may be a question with a really obvious answer, but I
> can't find it. I have access to a large file with real
> medical record identifiers (mixed strings of characters and
> numbers) in it. These represent medical events for many
> thousands of people. It's important to be able to link
> events for the same people.
>
> It's much more important that the real record numbers are
> strongly obscured. I'm interested in some kind of strong
> one-way hash function to which I can feed the real numbers
> and get back unique codes for each record identifier fed
> in. I can do this on the health service system, and I have
> to do this before making further use of the data!
>
> There is the 'digest' function, in the digest package, but
> this seems to work on the whole vector of IDs, producing, in
> my case, a vector with 60,000 identical entries.
>
> H.Out$P_ID = digest(H.In$MRNr,serialize=FALSE, algo='md5')
>
> I could do this in Perl, but I'd have to do quite a bit of
> work to get it installed.
>
> Any quick suggestions?
> Anthony Staines
Try using sapply():
L <- replicate(60000, paste(sample(letters, 10, replace = TRUE), collapse = ""))
> str(L)
chr [1:60000] "dfederergw" "nwphehurvb" "avzmvltrhn" ...
> head(L)
[1] "dfederergw" "nwphehurvb" "avzmvltrhn" "ecmeiasmbk" "kmlcxydygl"
[6] "wpftnyrzwe"
# Use sapply() to run digest() over each element of L
> system.time(L.Digest <- sapply(L, digest))
user system elapsed
6.920 0.031 7.361
> str(L.Digest)
Named chr [1:60000] "6d5861904ee004d251504cb0f731a69a" ...
- attr(*, "names")= chr [1:60000] "dfederergw" "nwphehurvb" "avzmvltrhn" "ecmeiasmbk" ...
> head(L.Digest)
dfederergw nwphehurvb
"6d5861904ee004d251504cb0f731a69a" "bf8ee61f69c83468988cad681a9f7ad0"
avzmvltrhn ecmeiasmbk
"ba1c66af41359cf1a3f5e91f22c6dfe5" "95ca2deaa6c1118852c9ffed71994a7f"
kmlcxydygl wpftnyrzwe
"f3647a7937a2c484123ef33bb52a27ac" "e84f17180703e4805493d88a760be682"
HTH,
Marc Schwartz
More information about the R-help
mailing list