[R] dealing with a messy dataset

jean-philippe jeanphilippe.fontaine at gssi.infn.it
Thu Oct 5 18:46:18 CEST 2017


dear Jim,


Yes I fixed the problem. Thanks again all of you for your contribution!
This worked :

start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 70, 76, 78, 83, 88,
            +            93, 114, 122, 127)
data1<-read_fwf("lvg_table2.txt",skip=70, fwf_widths(diff(start)))

Well now I know how to deal with fixed-width files :)


Cheers


Jean-Philippe

On 05/10/2017 18:42, jim holtman wrote:
> You should be able to use that header information to create the
> correct parameters to the read_fwf function to read in the data.
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Thu, Oct 5, 2017 at 11:02 AM, jean-philippe
> <jeanphilippe.fontaine at gssi.infn.it> wrote:
>> dear Jim,
>>
>> Thanks for your reply and your proposition.
>>
>> I forgot to provide the header of the dataframe, here it is:
>> ================================================================================
>> Byte-by-byte Description of file: lvg_table2.dat
>> --------------------------------------------------------------------------------
>>     Bytes Format Units       Label   Explanations
>> --------------------------------------------------------------------------------
>>     1- 18 A18    ---         Name    Galaxy name in well-known catalogs
>>    20- 21 I2     h           RAh     Hour of Right Ascension (J2000)
>>    22- 23 I2     min         RAm     Minute of Right Ascension (J2000)
>>    24- 27 F4.1   s           RAs     Second of Right Ascension (J2000)
>>        28 A1     ---         DE-     Sign of the Declination (J2000)
>>    29- 30 I2     deg         DEd     Degree of Declination (J2000)
>>    31- 32 I2     arcmin      DEm     Arcminute of Declination (J2000)
>>    33- 34 I2     arcsec      DEs     Arcsecond of Declination (J2000)
>>    36- 40 F5.2   kpc         a26     ? Major linear diameter (1)
>>    42- 43 I2     deg         inc     ? Inclination
>>    45- 47 I3     km/s        Vm      ? Amplitude of rotational velocity (2)
>>    49- 52 F4.2   mag         AB      ? Internal B band extinction (3)
>>    54- 58 F5.1   mag         BMag    ? Absolute B band magnitude (4)
>>    60- 63 F4.1   mag/arcsec2 SBB     ? Average B band surface brightness (5)
>>    65- 69 F5.2   [solLum]    logKLum ? Log K_S_ band luminosity (6)
>>    71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
>>        77 A1     ---       l_logMHI  Limit flag on logMHI
>>    78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
>>    84- 87 I4     km/s        VLG     ? Radial velocity (9)
>>    89- 92 F4.1   ---         Theta1  ? Tidal index (10)
>>    94-116 A23    ---         MD      Main disturber name (11)
>>   118-121 F4.1   ---         Theta5  ? Another tidal index (12)
>>   123-127 F5.2   [-]         Thetaj  ? Log K band luminosity density (13)
>> --------------------------------------------------------------------------------
>>
>> The idea for me is to select only the galaxy name and the logMHI values for
>> these galaxies, so quite a simple job when the dataset is tidy enough. I was
>> thinking as usual to use select from dplyr.
>> That is why I was just asking how to read this kind of files which, for me
>> so far, are uncommon.
>>
>> Doing what you propose, it formats most of the columns correctly except few
>> ones, I will see how I can change some width to get it correctly:
>>
>>            X1              X2    X3    X4    X5    X6    X7    X8 X9    X10
>> X11   X12   X13   X14          X15   X16     X17
>>         (chr)           (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)
>> (chr) (int) (chr) (chr) (chr)        (chr) (dbl)   (chr)
>> 1   UGC12894 000022.5+392944  2.78    33    21     0 -13.3  25.2 7.5 8  8.1
>> 7   7.9 2  61 9 -1.    3 NGC7640    -1 0  0.12
>> 2        WLM 000158.1-152740  3.25    90    22     0 -14.1 24.8 7.7 0 8.2
>> 7   7.8 4  -1 6  0. 0 MESSIER031     0 2  1.75
>> 3  And XVIII 000214.5+450520  0.69    17     9     0  -8.7  26.8 6.4 4  6.7
>> 8 < 6.6 5  -4 4  0. 5 MESSIER031     0 6  1.54
>> 4  PAndAS-03 000356.4+405319  0.10    17    NA     0  -3.6  27.8 4.3      8
>> NA    NA    NA    2. 8 MESSIER031     2 8  1.75
>> 5  PAndAS-04 000442.9+472142  0.05    22    NA     0  -6.6  23.1 5.5      9
>> NA    NA   -10 8  2. 5 MESSIER031     2 5  1.75
>> 6  PAndAS-05 000524.1+435535  0.06    31    NA     0  -4.5  25.6 4.7      5
>> NA    NA    10 3  2. 8 MESSIER031     2 8  1.75
>> 7 ESO409-015 000531.8-280553  3.00    78    23     0 -14.6  24.1 8.1 0  8.2
>> 5   8.1 0  76 9 -2.    0 NGC0024    -1 5 -2.05
>> 8  AGC748778 000634.4+153039 0.61 70     3     0 -10.4  24.9 6.3 9  5.7
>> 0   6.6 4  48 6 -1.    9 NGC0253    -1 5 -2.72
>> 9     And XX 000730.7+350756  0.20    33     5     0  -5.8  27.1 5.2 6  5.7
>> 0    NA   -18 2  2. 4 MESSIER031     2 4  1.75
>>
>>
>> Cheers, thanks again
>>
>>
>> Jean-Philippe
>> On 05/10/2017 16:49, jim holtman wrote:
>>> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>>>       +            92, 114, 121, 127)
>>>       > read_fwf(input, fwf_widths(diff(start)))
>>
>> --
>> Jean-Philippe Fontaine
>> PhD Student in Astroparticle Physics,
>> Gran Sasso Science Institute (GSSI),
>> Viale Francesco Crispi 7,
>> 67100 L'Aquila, Italy
>> Mobile: +393487128593, +33615653774
>>

-- 
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774



More information about the R-help mailing list