[R] Advice on obscuring unique IDs in R

Marc Schwartz marc_schwartz at me.com
Wed Jan 5 22:43:33 CET 2011


On Jan 5, 2011, at 3:19 PM, Anthony Staines wrote:

> Dear colleagues,
> 
> This may be a question with a really obvious answer, but I
> can't find it. I have access to a large file with real
> medical record identifiers (mixed strings of characters and
> numbers) in it. These represent medical events for many
> thousands of people. It's important to be able to link
> events for the same people.
> 
> It's much more important that the real record numbers are
> strongly obscured. I'm interested in some kind of strong
> one-way hash function to which I can feed the real numbers
> and get back unique codes for each record  identifier fed
> in. I can do this on the health service system, and I have
> to do this before making further use of the data!
> 
> There is the 'digest' function, in the digest package, but
> this seems to work on the whole vector of IDs, producing, in
> my case, a vector with 60,000 identical entries.
> 
> H.Out$P_ID = digest(H.In$MRNr,serialize=FALSE, algo='md5')
> 
> I could do this in Perl, but I'd have to do quite a bit of
> work to get it installed.
> 
> Any quick suggestions?
> Anthony Staines


Try using sapply():


L <- replicate(60000, paste(sample(letters, 10, replace = TRUE), collapse = ""))

> str(L)
 chr [1:60000] "dfederergw" "nwphehurvb" "avzmvltrhn" ...

> head(L)
[1] "dfederergw" "nwphehurvb" "avzmvltrhn" "ecmeiasmbk" "kmlcxydygl"
[6] "wpftnyrzwe"


# Use sapply() to run digest() over each element of L

> system.time(L.Digest <- sapply(L, digest))
   user  system elapsed 
  6.920   0.031   7.361 


> str(L.Digest)
 Named chr [1:60000] "6d5861904ee004d251504cb0f731a69a" ...
 - attr(*, "names")= chr [1:60000] "dfederergw" "nwphehurvb" "avzmvltrhn" "ecmeiasmbk" ...


> head(L.Digest)
                        dfederergw                         nwphehurvb 
"6d5861904ee004d251504cb0f731a69a" "bf8ee61f69c83468988cad681a9f7ad0" 
                        avzmvltrhn                         ecmeiasmbk 
"ba1c66af41359cf1a3f5e91f22c6dfe5" "95ca2deaa6c1118852c9ffed71994a7f" 
                        kmlcxydygl                         wpftnyrzwe 
"f3647a7937a2c484123ef33bb52a27ac" "e84f17180703e4805493d88a760be682" 


HTH,

Marc Schwartz



More information about the R-help mailing list