[BioC] paper - download - pubmed
Nooshin Omranian
n_omranian at yahoo.com
Wed Jan 16 22:25:30 CET 2013
Hi Chris ,
I'm so thankful for your very nice explanation and clue. I will come
back to you if I have more questions or problem in the code.
Many thanks!
Nooshin
On 1/16/2013 7:24 PM, Chris Stubben wrote:
> Sorry, not sure why references[1] were added automatically to html
> links within the code, but this reply should work if you copy and
> paste (I hope).
> Chris
>
> >>
> >> So, the problem is not that, for each paper I have to download the
> >> pdfs (which are available if I go to the pubmed and search directly
> >> there) and the corresponding supplementary files.
> >>
> Nooshin,
> You can download pdfs from Pubmed Central if you have one PMC id.
> download.file(
> "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3446303/pdf",
> "PMC3446303.pdf")
>
> However, NCBI clearly states that you may NOT use any kind of
> automated process to download articles in bulk from the main PMC site,
> so I would use the ftp site for Open Access articles (see
> http://www.ncbi.nlm.nih.gov/pmc/tools/ftp ). The ftp site also has
> the supplemental files included. First, read the list of available files
>
> pmcftp <- read.delim(
> "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt" , skip=1,
> header=FALSE, stringsAsFactors=FALSE)
> nrow(pmcftp)
> [1] 552677
> names(pmcftp)<-c("dir", "citation", "id")
>
> Then match PMC ids and loop through the results to download and untar
> the files
>
> y <- subset(pmcftp, id %in% c("PMC3446303", "PMC3463124") )
> y
> 509377 75/e9/Genome_Biol_2012_Apr_24_13(4)_R29.tar.gz Genome
> Biol. 2012 Apr 24; 13(4):R29 PMC3446303
> 514389 04/0f/Bioinformatics_2012_Oct_1_28(19)_2532-2533.tar.gz
> Bioinformatics. 2012 Oct 1; 28(19):2532-2533 PMC3463124
>
> for( i in 1: nrow(y) ){
> destfile <- paste(y$id[i], ".tar.gz", sep="")
> download.file( paste("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc", y$dir[i],
> sep="/"), destfile )
> untar( destfile, compressed=TRUE)
> }
>
> Also, if you need to get a list of PMC ids in R, I have a package
> called genomes on BioC that includes E-utility scripts. So something
> like this query would get the 49 pmc ids for articles with
> Bioconductor in the title.
>
> x2 <- esummary(esearch("bioconductor[TITLE] AND open access[FILTER]",
> db="pmc"), version="2.0")
>
> Esummary uses a generic parser by default, so PMCids are mashed
> together in a column with other Ids
> ids <-gsub(".*(PMC[0-9]*)", "\\1", x2$ArticleIds)
> y <- subset(pmcftp, id %in% ids)
>
> You could run esummary and add parse=FALSE to get the XML results and
> parse that any way you like. Or even use esearch and set usehistory="n"
> ids2 <- paste("PMC", esearch("bioconductor[TITLE] AND open
> access[FILTER]", db="pmc", usehistory="n", retmax=100), sep="")
>
>
More information about the Bioconductor
mailing list