[R] dealing with a messy dataset

Boris Steipe boris.steipe at utoronto.ca
Thu Oct 5 19:54:56 CEST 2017


Just for the record - and posterity: this is the Wrong way to go about defining a fixed width format and the strategy has a significant probability of corrupting data in ways that are hard to spot and hard to debug. If you _have_ the specification, then _use_ the specification.

Consider what you are actually doing here:

start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 70, 76, 78, 83, 88,
           93, 114, 122, 127)
diff(start)  # produces a vector of assumed column widths
# [1] 19 15  6  3  4  5  6  5  6  6  2  5  5  5 21  8  5

First:
Jim's solution uses fwf_widths(), which is not correct in the first place since not all positions of the record are actually specified: position 19, 35, 41, 44, 48, 53, 59 etc. etc. are columns that are intentionally left blank to visually separate numbers in the file. fwf_widths() assumes there are no gaps in the data. Can you assume these blank columns are always " "? Maybe in your case you can get away with this, but in the general case that's a risky assumption to make. As you said, you only care about two columns anyway, and those happen to be correct - but that's all the more reason not to even touch the others.

Second:
Jim's solution is a visual approximation, based on a small number of sample rows you have provided. As the spec shows, this gets it wrong because your sample does not exhaust all possibilities and he has no way of knowing the semantics of the data. The spec has 23 columns of data, you are reading only 17. For example the second column in that approach reads 15 characters, and now contains RAh, RAm, RAs, DE-, DEd, DEm and DEs all rolled into one string.

Third:
When writing code like this, you end up with a whole line of "magic numbers". This is unmaintainable. I would require my students to write something like the following, which comments what the columns _mean_:

cNames <- character()
cStarts <- integer()
cEnds <- integer()

cNames[1]  <- "Name" # Galaxy name in well-known catalogs
cStarts[1] <- 1L
cEnds[1]   <- 18L

cNames[2]  <- "logMHI" # Log hydrogen mass
cStarts[2] <- 78L
cEnds[2]   <- 82L


and then read

myCatalogue <- read_fwf(input,
                        col_positions = fwf_positions(cStart, cEnd),
                        col_names = cNames)

It may take you longer to type this up, but the chance of getting it right is much higher, and most importantly, it's crystal clear, explicit, extensible, and your summer student will see exactly what's going on here.


I hope this is helpful, and it comes closer to "how to deal with fixed-width files".
Cheers,
 
Boris



> On Oct 5, 2017, at 12:46 PM, jean-philippe <jeanphilippe.fontaine at gssi.infn.it> wrote:
> 
> dear Jim,
> 
> 
> Yes I fixed the problem. Thanks again all of you for your contribution!
> This worked :
> 
> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 70, 76, 78, 83, 88,
>           +            93, 114, 122, 127)
> data1<-read_fwf("lvg_table2.txt",skip=70, fwf_widths(diff(start)))
> 
> Well now I know how to deal with fixed-width files :)
> 
> 
> Cheers
> 
> 
> Jean-Philippe
> 
> On 05/10/2017 18:42, jim holtman wrote:
>> You should be able to use that header information to create the
>> correct parameters to the read_fwf function to read in the data.
>> 
>> Jim Holtman
>> Data Munger Guru
>> 
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>> 
>> 
>> On Thu, Oct 5, 2017 at 11:02 AM, jean-philippe
>> <jeanphilippe.fontaine at gssi.infn.it> wrote:
>>> dear Jim,
>>> 
>>> Thanks for your reply and your proposition.
>>> 
>>> I forgot to provide the header of the dataframe, here it is:
>>> ================================================================================
>>> Byte-by-byte Description of file: lvg_table2.dat
>>> --------------------------------------------------------------------------------
>>>    Bytes Format Units       Label   Explanations
>>> --------------------------------------------------------------------------------
>>>    1- 18 A18    ---         Name    Galaxy name in well-known catalogs
>>>   20- 21 I2     h           RAh     Hour of Right Ascension (J2000)
>>>   22- 23 I2     min         RAm     Minute of Right Ascension (J2000)
>>>   24- 27 F4.1   s           RAs     Second of Right Ascension (J2000)
>>>       28 A1     ---         DE-     Sign of the Declination (J2000)
>>>   29- 30 I2     deg         DEd     Degree of Declination (J2000)
>>>   31- 32 I2     arcmin      DEm     Arcminute of Declination (J2000)
>>>   33- 34 I2     arcsec      DEs     Arcsecond of Declination (J2000)
>>>   36- 40 F5.2   kpc         a26     ? Major linear diameter (1)
>>>   42- 43 I2     deg         inc     ? Inclination
>>>   45- 47 I3     km/s        Vm      ? Amplitude of rotational velocity (2)
>>>   49- 52 F4.2   mag         AB      ? Internal B band extinction (3)
>>>   54- 58 F5.1   mag         BMag    ? Absolute B band magnitude (4)
>>>   60- 63 F4.1   mag/arcsec2 SBB     ? Average B band surface brightness (5)
>>>   65- 69 F5.2   [solLum]    logKLum ? Log K_S_ band luminosity (6)
>>>   71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
>>>       77 A1     ---       l_logMHI  Limit flag on logMHI
>>>   78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
>>>   84- 87 I4     km/s        VLG     ? Radial velocity (9)
>>>   89- 92 F4.1   ---         Theta1  ? Tidal index (10)
>>>   94-116 A23    ---         MD      Main disturber name (11)
>>>  118-121 F4.1   ---         Theta5  ? Another tidal index (12)
>>>  123-127 F5.2   [-]         Thetaj  ? Log K band luminosity density (13)
>>> --------------------------------------------------------------------------------
>>> 
>>> The idea for me is to select only the galaxy name and the logMHI values for
>>> these galaxies, so quite a simple job when the dataset is tidy enough. I was
>>> thinking as usual to use select from dplyr.
>>> That is why I was just asking how to read this kind of files which, for me
>>> so far, are uncommon.
>>> 
>>> Doing what you propose, it formats most of the columns correctly except few
>>> ones, I will see how I can change some width to get it correctly:
>>> 
>>>           X1              X2    X3    X4    X5    X6    X7    X8 X9    X10
>>> X11   X12   X13   X14          X15   X16     X17
>>>        (chr)           (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)
>>> (chr) (int) (chr) (chr) (chr)        (chr) (dbl)   (chr)
>>> 1   UGC12894 000022.5+392944  2.78    33    21     0 -13.3  25.2 7.5 8  8.1
>>> 7   7.9 2  61 9 -1.    3 NGC7640    -1 0  0.12
>>> 2        WLM 000158.1-152740  3.25    90    22     0 -14.1 24.8 7.7 0 8.2
>>> 7   7.8 4  -1 6  0. 0 MESSIER031     0 2  1.75
>>> 3  And XVIII 000214.5+450520  0.69    17     9     0  -8.7  26.8 6.4 4  6.7
>>> 8 < 6.6 5  -4 4  0. 5 MESSIER031     0 6  1.54
>>> 4  PAndAS-03 000356.4+405319  0.10    17    NA     0  -3.6  27.8 4.3      8
>>> NA    NA    NA    2. 8 MESSIER031     2 8  1.75
>>> 5  PAndAS-04 000442.9+472142  0.05    22    NA     0  -6.6  23.1 5.5      9
>>> NA    NA   -10 8  2. 5 MESSIER031     2 5  1.75
>>> 6  PAndAS-05 000524.1+435535  0.06    31    NA     0  -4.5  25.6 4.7      5
>>> NA    NA    10 3  2. 8 MESSIER031     2 8  1.75
>>> 7 ESO409-015 000531.8-280553  3.00    78    23     0 -14.6  24.1 8.1 0  8.2
>>> 5   8.1 0  76 9 -2.    0 NGC0024    -1 5 -2.05
>>> 8  AGC748778 000634.4+153039 0.61 70     3     0 -10.4  24.9 6.3 9  5.7
>>> 0   6.6 4  48 6 -1.    9 NGC0253    -1 5 -2.72
>>> 9     And XX 000730.7+350756  0.20    33     5     0  -5.8  27.1 5.2 6  5.7
>>> 0    NA   -18 2  2. 4 MESSIER031     2 4  1.75
>>> 
>>> 
>>> Cheers, thanks again
>>> 
>>> 
>>> Jean-Philippe
>>> On 05/10/2017 16:49, jim holtman wrote:
>>>> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>>>>      +            92, 114, 121, 127)
>>>>      > read_fwf(input, fwf_widths(diff(start)))
>>> 
>>> --
>>> Jean-Philippe Fontaine
>>> PhD Student in Astroparticle Physics,
>>> Gran Sasso Science Institute (GSSI),
>>> Viale Francesco Crispi 7,
>>> 67100 L'Aquila, Italy
>>> Mobile: +393487128593, +33615653774
>>> 
> 
> -- 
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list