[Rd] read.csv

Reed A. Cartwright r@c@rtwr|ght @end|ng |rom gm@||@com
Tue Apr 16 20:21:49 CEST 2024


Gene names being misinterpreted by spreadsheet software (read.csv is
no different) is a classic issue in bioinformatics. It seems like
every practitioner ends up encountering this issue in due time. E.g.

https://pubmed.ncbi.nlm.nih.gov/15214961/

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

https://www.nature.com/articles/d41586-021-02211-4

https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates


On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao <jinghuazhao using hotmail.com> wrote:
>
> Dear R-developers,
>
> I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes,
>
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
>
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data,
>
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>    cat(protein,":\n")
>    f <- paste0(protein,".csv")
>    if(file.exists(f))
>    {
>      p <- read.csv(f)
>      print(p)
>      if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>    }
> }
>
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
>
> dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings.
>
> Best wishes,
>
>
> Jing Hua
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$



More information about the R-devel mailing list