[R] dealing with a messy dataset

jim holtman jholtman at gmail.com
Thu Oct 5 16:49:35 CEST 2017


It looks like fixed width.  I just used the last position of each
field to get the size and used the 'readr' package;

    > input <- "And XVIII          000214.5+450520  0.69 17   9 0.00
-8.7 26.8 6.44  6.78 < 6.65  -44  0.5 MESSIER031               0.6
1.54
    + PAndAS-03          000356.4+405319  0.10 17     0.00  -3.6 27.8
4.38                    2.8 MESSIER031               2.8  1.75
    + PAndAS-04          000442.9+472142  0.05 22     0.00  -6.6 23.1
5.59              -108  2.5 MESSIER031               2.5  1.75
    + PAndAS-05          000524.1+435535  0.06 31     0.00  -4.5 25.6
4.75               103  2.8 MESSIER031               2.8  1.75
    + ESO409-015         000531.8-280553  3.00 78  23 0.00 -14.6 24.1
8.10  8.25   8.10  769 -2.0 NGC0024                 -1.5 -2.05
    + AGC748778          000634.4+153039  0.61 70   3 0.00 -10.4 24.9
6.39  5.70   6.64  486 -1.9 NGC0253                 -1.5 -2.72
    + And XX             000730.7+350756  0.20 33   5 0.00  -5.8 27.1
5.26  5.70        -182  2.4 MESSIER031               2.4  1.75"
    >
    > start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
    +            92, 114, 121, 127)
    > read_fwf(input, fwf_widths(diff(start)))
    # A tibble: 7 x 17
              X1              X2    X3    X4    X5    X6    X7    X8
 X9   X10   X11   X12   X13   X14
           <chr>           <chr> <dbl> <int> <int> <dbl> <dbl> <dbl>
<dbl> <dbl> <chr> <dbl> <int> <dbl>
    1  And XVIII 000214.5+450520  0.69    17     9     0  -8.7  26.8
6.44  6.78     <  6.65   -44   0.5
    2  PAndAS-03 000356.4+405319  0.10    17    NA     0  -3.6  27.8
4.38    NA  <NA>    NA    NA   2.8
    3  PAndAS-04 000442.9+472142  0.05    22    NA     0  -6.6  23.1
5.59    NA  <NA>    NA  -108   2.5
    4  PAndAS-05 000524.1+435535  0.06    31    NA     0  -4.5  25.6
4.75    NA  <NA>    NA   103   2.8
    5 ESO409-015 000531.8-280553  3.00    78    23     0 -14.6  24.1
8.10  8.25  <NA>  8.10   769  -2.0
    6  AGC748778 000634.4+153039  0.61    70     3     0 -10.4  24.9
6.39  5.70  <NA>  6.64   486  -1.9
    7     And XX 000730.7+350756  0.20    33     5     0  -5.8  27.1
5.26  5.70  <NA>    NA  -182   2.4
    # ... with 3 more variables: X15 <chr>, X16 <dbl>, X17 <dbl>
    >


Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Oct 5, 2017 at 10:12 AM, jean-philippe
<jeanphilippe.fontaine at gssi.infn.it> wrote:
> dear R-users,
>
>
> I am facing a quite regular and basic problem when it comes to dealing with
> datasets, but I cannot find any satisfying answer so far.
> I have a messy dataset of galaxies like that :
>
> And XVIII          000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  6.78 <
> 6.65  -44  0.5 MESSIER031               0.6  1.54
> PAndAS-03          000356.4+405319  0.10 17     0.00  -3.6 27.8 4.38
> 2.8 MESSIER031               2.8  1.75
> PAndAS-04          000442.9+472142  0.05 22     0.00  -6.6 23.1 5.59
> -108  2.5 MESSIER031               2.5  1.75
> PAndAS-05          000524.1+435535  0.06 31     0.00  -4.5 25.6 4.75
> 103  2.8 MESSIER031               2.8  1.75
> ESO409-015         000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  8.25
> 8.10  769 -2.0 NGC0024                 -1.5 -2.05
> AGC748778          000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  5.70
> 6.64  486 -1.9 NGC0253                 -1.5 -2.72
> And XX             000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  5.70
> -182  2.4 MESSIER031               2.4  1.75
>
> What I would like to do is to read this dataset, but I would like that the
> space between And and XVIII is not interpreted as 2 different columns but as
> the name of the galaxy in one column.
> How is it possible to do so?
>
> For instance I did this data1<-read.table("lvg_table2.txt",skip=70,fill=T)
> where I used fill=T because the rows don't have the same number of features
> since R splits the name of the galaxies into 2 columns because of the space.
>
>
> Best Regards, thanks in advance
>
>
> Jean-Philippe Fontaine
>
> --
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list