[R] Downloading a directory of text files into R
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Wed Jul 26 07:52:20 CEST 2023
Às 23:06 de 25/07/2023, Bob Green escreveu:
> Hello,
>
> I am seeking advice as to how I can download the 833 files from this
> site:"http://home.brisnet.org.au/~bgreen/Data/"
>
> I want to be able to download them to perform a textual analysis.
>
> If the 833 files, which are in a Directory with two subfolders were on
> my computer I could read them through readtext. Using readtext I get the
> error:
>
> > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*")
> Error in download_remote(file, ignore_missing, cache, verbosity) :
> Remote URL does not end in known extension. Please download the file
> manually.
>
> > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
> Error in download_remote(file, ignore_missing, cache, verbosity) :
> Remote URL does not end in known extension. Please download the file
> manually.
>
> Any suggestions are appreciated.
>
> Bob
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,
The following code downloads all files in the posted link.
suppressPackageStartupMessages({
library(rvest)
})
# destination directory, change this at will
dest_dir <- "~/Temp"
# first get the two subfolders from the Data webpage
link <- "http://home.brisnet.org.au/~bgreen/Data/"
page <- read_html(link)
page %>%
html_elements("a") %>%
html_text() %>%
grep("/$", ., value = TRUE) -> sub_folder
# create relevant disk sub-directories, if
# they do not exist yet
for(subf in sub_folder) {
d <- file.path(dest_dir, subf)
if(!dir.exists(d)) {
success <- dir.create(d)
msg <- paste("created directory", d, "-", success)
message(msg)
}
}
# prepare to download the files
dest_dir <- file.path(dest_dir, sub_folder)
source_url <- paste0(link, sub_folder)
success <- mapply(\(src, dest) {
# read each Data subfolder
# and get the file names therein
# then lapply 'download.file' to each filename
pg <- read_html(src)
pg %>%
html_elements("a") %>%
html_text() %>%
grep("\\.txt$", ., value = TRUE) %>%
lapply(\(x) {
s <- paste0(src, x)
d <- file.path(dest, x)
tryCatch(
download.file(url = s, destfile = d),
warning = function(w) w,
error = function(e) e
)
})
}, source_url, dest_dir)
lengths(success)
# http://home.brisnet.org.au/~bgreen/Data/Hanson1/
# 84
# http://home.brisnet.org.au/~bgreen/Data/Hanson2/
# 749
# matches the question's number
sum(lengths(success))
# [1] 833
Hope this helps,
Rui Barradas
More information about the R-help
mailing list