[R] Frequency of Combinations
Boris Steipe
boris.steipe at utoronto.ca
Mon Apr 24 18:47:34 CEST 2017
This can be easily done in base R.
The solution below is pedestrian, transparent and explicit. Thus it's easy to
debug and validate(!) each step.
Prepare:
(1) Save your Excel spreadsheet as a text file with tab-separated values.
(2) Initialize a 128 * 128 matrix to hold your results. It should have
your drug-codes as row names and as column names.
(3) Create a vector of drug codes as character(!) (not as integers). Treating them
as strings avoids being bitten by whatever unexpected stuff you'll find in your data.
Compile:
(4) read your data line by line. For each line:
use strsplit() to get a vector of contents of your spreadsheet cells,
drop the patient ID while you are doing this (IDs could overlap drug codes).
use match() to compare your drug codes with the vector of contents. match()
returns a vector of positions.
if the length of this vector is > 1:
use a nested loop to consider all combinations of matched drugs. For each
combination of drug codes, increment the corresponding value in your matrix.
(5) Now all counts are in the matrix, but the same combination may appear
in different order: add the value of each element below the diagonal
of your matrix to the corresponding value above the diagonal.
(The values _in_ the diagonal would correspond to drugs that have been
entered more than once.)
(6) Finally, replace the matrix row- and column names with the actual drug names.
Analyze:
There are many ways. Myself, I would create a data frame with the counts in one column,
and the two drug names in the second and third column, then order() the rows
by count.
All in all, that's about 20 lines of code to prepare and compile, 10 more for the
result. Post again if this wasn't clear, or if you need help with the actual syntax.
Cheers,
B.
> On Apr 24, 2017, at 1:07 AM, abo dalash <abo_dlsh at hotmail.com> wrote:
>
> Hi there
>
> I have data set with 500,000 patients (rows) and the first column is Patient I'D Number, columns from 2 to 20 are Drug1,Drug 2,...,Drug 20 so each row starts with the patient ID and the remaining of cells in the row are codes for names of the treatments taken by the patient. Number of treatments differ between patients. For example, there are patients with 3 treatments only and patients with 20 drugs. The unique number of treatments in the entire data set is about 6700 drugs. However, I'm interested in studying only 128 drugs, these drugs are listed in a second sheet as code numbers associated with their meanings (names of drugs representing the code). I'm interested in identifying the most frequently used DRUG COMBINATIONS between only the 128 drugs among the 6700 drugs. The structure of the Excell file to be used in analysis is like this:
>
> -Sheet 1( the entire data set):-
> 1 Patient ID Drug1 Drug2 .... Drug 20.
> 2 1125 45 46 55
> 3 1126 60 55 45
> .
> .
> 500,000
>
> -Sheet 2 (list of codes meanings for only the drugs of interest):
>
> 1 Drug code meaning
> 2 45 Simvastatin
> 3 55 Aspirin
> 4 60 Paracetamol
> .
> 128
>
> The desired output I'm looking for :
>
> Drug codes Meaning Frequency
> 45+55 Simvastatin 2
> +Aspirin
> 60+55 Aspirin+ 1
> Paracetamol
> 60+45 Simvastatin+ 1
> Paracetamol
>
> Please note the the final output does not include any combination containing drug 46 as this is not in the list of drugs preferred to be studied which are mentioned in sheet 2.
>
> Could you please help me which R codes and packages should be used to run this analyisis?
>
> Regards
> Mustafa
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list