[R] Any method to speed up this problem?

Thu Jun 18 16:36:27 CEST 2009

On Jun 18, 2009, at 9:28 AM, njhuang86 wrote:

>
> Hi all,
>
> Suppose I have a vector like this:
>
> [1] "STAT1"  "STAT1"  "STAT1"  "STAT1"  "GAPDH"  "GAPDH"  "GAPDH"   
> "ACTB"
> "ACTB"
> [10] "ACTB"   "DDR1"   "RFC2"   "HSPA6"  "PAX8"   "GUCA1A" "UBE1L"   
> "THRA"
> "PTPN21"
> [19] "CCL5"   "CYP2E1"  "STAT1"  "THRA"  "PAX8"
>
> I would like to produce a vector such that it has the same length as  
> the one
> above but it tells me where the duplicates are. So essentially, if I  
> could
> represent each gene symbol as a specific number, and have the  
> duplicates be
> the same number, that would be ideal. Right now, I'm using the unique
> command along with two nested for loops to do the job... But it's  
> really
> taking too long... Any suggestions would be greatly appreciated.  
> Thank you!

Is this what you want?

 > Vec
  [1] "STAT1"  "STAT1"  "STAT1"  "STAT1"  "GAPDH"  "GAPDH"  "GAPDH"
  [8] "ACTB"   "ACTB"   "ACTB"   "DDR1"   "RFC2"   "HSPA6"  "PAX8"
[15] "GUCA1A" "UBE1L"  "THRA"   "PTPN21" "CCL5"   "CYP2E1" "STAT1"
[22] "THRA"   "PAX8"

 > as.numeric(factor(Vec))
  [1] 11 11 11 11  5  5  5  1  1  1  4 10  7  8  6 13 12  9  2  3 11 12
[23]  8

?

HTH,

Marc Schwartz