[R] help for fine mappting
greg holly
mak.hholly at gmail.com
Wed Jun 15 17:20:40 CEST 2016
dear all;
I am sorry for this posting. I have got help from Jim, Bert, Jeff and PIKAL
on similar issue before. I tried to modify Jim`s code to the real data but
it did not work. Now I am posting first two rows the imitation of real data
using dput() format (please see at the bottom). I have two data sets,
data=map and data=ref. The first to rows of each data set are given below.
Data map has more than 27 million and data ref has about 560 rows.
Basically I need run two different tasks. My R codes for these task are
given below but they do not work properly. I sincerely do appreciate your
helps.
Regards,
Greg
Task 1)
For example, the first and second columns for row 1 in data ref are chr1,
6457839 and 6638389. So I need write an R code normally first look the
first row in ref (which they are chre1 6457839 and 6638389) than summing
the column of "map$post_prob" and give the number of map$snp falls between
6457839 and 6638389 that their cumulative sum is >0.85. Then do the same
for the second, third....in ref. At the end I would like a table gave below
(need_ouput). Please notice the all value specified info in ref data file
are exist in map$CHR and map$POS columns.
Task2)
Again example, the first and second columns for row 1 in data ref are chr1,
6457839 and 6638389. So I need that R gives me the minimum map$p for the 2
chr1, 6457839 and 6638389 (as there are many snps between these regions and
would like choose the smallest one in those regions. Than do the same for
the second, third....rows in ref.
Then put the results of Task1 and Task2 into need_ouput file
#R codes modified from Jim
map2<-map[order(map$CHR, map$POS, -map$post_prob),]
# get a field for the counts
ref$n<-NA
# and a field for the minimum p values
ref$min_p<-NA
# get the number of rows in "ref"
nref<-dim(ref)[1]
for(i in 1:nref) {
CHR<- which(map2$CHR==ref$CHR[i])
POS_start<-which(map2$POS==ref$POS_start[i])
POS_end<-which(map2$POS==ref$POS_end[i])
cat("CHR", "CHR"," POS_start",POS_start,"POS_end",POS_end,"\n")
# get the range of matches
POSrange<-range(c(CHR,POS_start,POS_end))
# convert this to a sequence spanning all matches
allPOS<-POSrange[1]:POSrange[2]
ref$n[i]<-sum(map2$post_prob[allPOS] > 0.99)
ref$min_p[i]<-min(map2$p[allPOS])
}
dput(map)
structure(list(CHR = structure(c(1L, 1L), .Label = "chr1", class =
"factor"),
snp = structure(1:2, .Label = c("rs4747841", "rs4749917"), class
= "factor"),
Allel1 = structure(1:2, .Label = c("A", "T"), class = "factor"),
Allel2 = structure(c(2L, 1L), .Label = c("C", "G"), class =
"factor"),
fr = c(0.551, 0.436), effec = c(-0.0011, 0.0011), SE = c(0.0029,
0.0029), p = c(0.7, 0.7), POS = c(9960129L, 9960259L), post_prob
= c(1.248817e-158,
1.248817e-158)), .Names = c("CHR", "snp", "Allel1", "Allel2",
"fr", "effec", "SE", "p", "POS", "post_prob"), class = "data.frame",
row.names = c(NA,
-2L))
dput(ref)
structure(list(CHR = structure(1:2, .Label = c("chr10", "chr14"
), class = "factor"), POS_start = c(6457839L, 21005246L), POS_end =
c(6638389L,
21550658L)), .Names = c("CHR", "POS_start", "POS_end"), class =
"data.frame", row.names = c(NA,
-2L))
dput(need_output)
structure(list(CHR = structure(1:2, .Label = c("chr1", "chr22"
), class = "factor"), POS = c(312127953L, 46487552L), POS_start =
c(32036927L,
45766451L), POS_end = c(3232240262, 46801601), snp = structure(1:2, .Label
= c("rs1143427",
"rs55958907"), class = "factor"), alle1l = structure(1:2, .Label = c("G",
"T"), class = "factor"), allel2 = structure(1:2, .Label = c("A",
"G"), class = "factor"), fr = c(0.278, 0.974), effec = c(0.6,
0.106), SE = c(0.015, 0.027), P = c(0.000156, 7.63e-05), post_prob =
c(0.229,
0.125), n = c(612L, 4218L)), .Names = c("CHR", "POS", "POS_start",
"POS_end", "snp", "alle1l", "allel2", "fr", "effec", "SE", "P",
"post_prob", "n"), class = "data.frame", row.names = c(NA, -2L
))
[[alternative HTML version deleted]]
More information about the R-help
mailing list