[R] [datatable-help] R beginner

jim holtman jholtman at gmail.com
Mon Mar 16 00:06:55 CET 2015


Here is how I processed the files you sent.  I used XLConnect to access
EXCEL.

> require(XLConnect)
> # I happen to use XLConnect for accessing EXCEL
> miRNA <- readWorksheetFromFile("/temp/miRNA.xls", 1)  # read in the
spreadsheets
> genes <- readWorksheetFromFile("/temp/gene.xls", 1)
> str(miRNA)  # see what the structures are so I know the names
'data.frame':   8 obs. of  2 variables:
 $ MiRNA.name    : chr  "mir256" "mir785" "mir415" "mir1245" ...
 $ position.start: num  458965 896 9689877 11525478 11689689 ...
> str(genes)
'data.frame':   10 obs. of  4 variables:
 $ gene.name    : chr  "AA" "AB" "AC" "AD" ...
 $ chromosome   : num  1 1 2 4 8 9 8 11 12 13
 $ positon.Start: num  251 4256 245226 449623 8214587 ...
 $ position.end : num  2586 8922 247899 490256 8358964 ...
>
> require(sqldf)
> # the use of backquotes (`) is because the names include a period (.)
that is special to SQL
> matches <- sqldf("
+     select m.*, g.*
+     from miRNA as m
+     join genes as g
+         on m.`position.start` between g.`positon.Start` and
g.`position.end`
+ ")
>
> matches
  MiRNA.name position.start gene.name chromosome positon.Start position.end
1     mir256         458965        AD          4        449623       490256
2     mir785            896        AA          1           251         2586
3     mir415        9689877        AF          9       9545878      9896698
4    mir1245       11525478        AH         11      11458789     11895251
5      mir96       11689689        AH         11      11458789     11895251
6     mir145           5890        AB          1          4256         8922
7     mir785       12247847        AI         12      12047896     12365478
8     mir895         246789        AC          2        245226       247899
>



Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Sun, Mar 15, 2015 at 6:29 PM, cyrille sage <
cyrille.laurent.sage at gmail.com> wrote:

> HI Jim
>
> thanks a lot for your answer
>
> I have some question
>
> I attached two excel files name gene and miRNA as an example of what kind
> of data I will have to deal with. this two files are just example.
>
> I was thinking that I will have first to load the file ine R using
>
> Library(gdata)
> gene<-read.xls("path:/gene.xls")
> mir<-read.xls("path:/miRNA.xls")
>
> so now I have two list (gene and miRNA) in R
>
> but I do not know how to call specifically the column (3 and 4) in the
> gene file and the column 2 for the MiRNA list for the comparison and to do
> the comparison according to the number of miRNA in my list.
>
> for example can I do this
>
> start<-sample(gene[3, ],length (gene [3]))
>
> I told you i am a biginner
>
> Papy
>
> On Sun, Mar 15, 2015 at 5:16 PM, jim holtman <jholtman at gmail.com> wrote:
>
>> You didn't provide any test data, so I made some up with the sizes you
>> gave.  This uses the 'sqldf' package and took about 2 minutes to come up
>> with the matches.
>>
>> > n <- 200000
>> > mi <- 4500
>> > start <- sample(n * 10, n)  # start times
>> > int <- sample(1000, n, TRUE)  # interval between start and end
>> > genes <- data.frame(gene = paste0('gene', 1:n)
>> +                 , start = start
>> +                 , end = start + int
>> +                 , stringsAsFactors = FALSE
>> +                 )
>> > miRNA <- data.frame(name = paste0('mi', 1:mi)
>> +                 , pos = sample(n * 9, mi)
>> +                 , stringsAsFactors = FALSE
>> +                 )
>> > require(sqldf)
>> Loading required package: sqldf
>> Loading required package: gsubfn
>> Loading required package: proto
>> Loading required package: RSQLite
>> Loading required package: DBI
>> > matches <- sqldf("
>> +     select m.*, g.*
>> +     from miRNA as m
>> +     join genes as g
>> +         on m.pos between g.start and g.end
>> + ")
>> Loading required package: tcltk
>> >
>> > str(matches)
>> 'data.frame':   225045 obs. of  5 variables:
>>  $ name : chr  "mi1" "mi1" "mi1" "mi1" ...
>>  $ pos  : int  279341 279341 279341 279341 279341 279341 279341 279341
>> 279341 279341 ...
>>  $ gene : chr  "gene3133" "gene14326" "gene14997" "gene17652" ...
>>  $ start: int  279000 278623 279157 279296 278379 279055 279180 279273
>> 278938 278960 ...
>>  $ end  : int  279924 279444 280150 279930 279347 279861 279782 280268
>> 279791 279796 ...
>> > head(matches)
>>   name    pos      gene  start    end
>> 1  mi1 279341  gene3133 279000 279924
>> 2  mi1 279341 gene14326 278623 279444
>> 3  mi1 279341 gene14997 279157 280150
>> 4  mi1 279341 gene17652 279296 279930
>> 5  mi1 279341 gene21208 278379 279347
>> 6  mi1 279341 gene30889 279055 279861
>>
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Sun, Mar 15, 2015 at 3:41 PM, Papysounours <
>> Cyrille.laurent.sage at gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am just starting R programming because i need it to analyse new
>>> sequencing
>>> data. I got two list of data (excel table) one is gene list with
>>> chromosomal
>>> position (like start:123456 end:124567), the other is miRNA list with
>>> only
>>> one position (like 123789).
>>>  In the first liste i have around 20000 row (meaning 20000 gene name to
>>> compare to) and for the second around 4500 row (4500 miRNA).
>>> I want to compare the position of each individual miRNA position (
>>> genestart<=miRNA<=geneend ) to the entire list of gene in order to get
>>> in a
>>> new table the name of the miRNA (first colum of the miRNA list) and the
>>> name
>>> of the gene  (first colum of the gene list) related to the miRNA.
>>> Hope thisis not to much to ask.
>>> Papy
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/R-beginner-tp4704684.html
>>> Sent from the datatable-help mailing list archive at Nabble.com.
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list