[R] dealing with a messy dataset
Boris Steipe
boris.steipe at utoronto.ca
Thu Oct 5 17:10:38 CEST 2017
Since you have an authoritative description of the format, by all means use that - not a guess based on a visual inspection of where data appears in a sample row.
B.
> On Oct 5, 2017, at 11:02 AM, jean-philippe <jeanphilippe.fontaine at gssi.infn.it> wrote:
>
> dear Jim,
>
> Thanks for your reply and your proposition.
>
> I forgot to provide the header of the dataframe, here it is:
> ================================================================================
> Byte-by-byte Description of file: lvg_table2.dat
> --------------------------------------------------------------------------------
> Bytes Format Units Label Explanations
> --------------------------------------------------------------------------------
> 1- 18 A18 --- Name Galaxy name in well-known catalogs
> 20- 21 I2 h RAh Hour of Right Ascension (J2000)
> 22- 23 I2 min RAm Minute of Right Ascension (J2000)
> 24- 27 F4.1 s RAs Second of Right Ascension (J2000)
> 28 A1 --- DE- Sign of the Declination (J2000)
> 29- 30 I2 deg DEd Degree of Declination (J2000)
> 31- 32 I2 arcmin DEm Arcminute of Declination (J2000)
> 33- 34 I2 arcsec DEs Arcsecond of Declination (J2000)
> 36- 40 F5.2 kpc a26 ? Major linear diameter (1)
> 42- 43 I2 deg inc ? Inclination
> 45- 47 I3 km/s Vm ? Amplitude of rotational velocity (2)
> 49- 52 F4.2 mag AB ? Internal B band extinction (3)
> 54- 58 F5.1 mag BMag ? Absolute B band magnitude (4)
> 60- 63 F4.1 mag/arcsec2 SBB ? Average B band surface brightness (5)
> 65- 69 F5.2 [solLum] logKLum ? Log K_S_ band luminosity (6)
> 71- 75 F5.2 [solMass] logM26 ? Log mass within Holmberg radius (7)
> 77 A1 --- l_logMHI Limit flag on logMHI
> 78- 82 F5.2 [solMass] logMHI ? Log hydrogen mass (8)
> 84- 87 I4 km/s VLG ? Radial velocity (9)
> 89- 92 F4.1 --- Theta1 ? Tidal index (10)
> 94-116 A23 --- MD Main disturber name (11)
> 118-121 F4.1 --- Theta5 ? Another tidal index (12)
> 123-127 F5.2 [-] Thetaj ? Log K band luminosity density (13)
> --------------------------------------------------------------------------------
>
> The idea for me is to select only the galaxy name and the logMHI values for these galaxies, so quite a simple job when the dataset is tidy enough. I was thinking as usual to use select from dplyr.
> That is why I was just asking how to read this kind of files which, for me so far, are uncommon.
>
> Doing what you propose, it formats most of the columns correctly except few ones, I will see how I can change some width to get it correctly:
>
> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17
> (chr) (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr) (chr) (int) (chr) (chr) (chr) (chr) (dbl) (chr)
> 1 UGC12894 000022.5+392944 2.78 33 21 0 -13.3 25.2 7.5 8 8.1 7 7.9 2 61 9 -1. 3 NGC7640 -1 0 0.12
> 2 WLM 000158.1-152740 3.25 90 22 0 -14.1 24.8 7.7 0 8.2 7 7.8 4 -1 6 0. 0 MESSIER031 0 2 1.75
> 3 And XVIII 000214.5+450520 0.69 17 9 0 -8.7 26.8 6.4 4 6.7 8 < 6.6 5 -4 4 0. 5 MESSIER031 0 6 1.54
> 4 PAndAS-03 000356.4+405319 0.10 17 NA 0 -3.6 27.8 4.3 8 NA NA NA 2. 8 MESSIER031 2 8 1.75
> 5 PAndAS-04 000442.9+472142 0.05 22 NA 0 -6.6 23.1 5.5 9 NA NA -10 8 2. 5 MESSIER031 2 5 1.75
> 6 PAndAS-05 000524.1+435535 0.06 31 NA 0 -4.5 25.6 4.7 5 NA NA 10 3 2. 8 MESSIER031 2 8 1.75
> 7 ESO409-015 000531.8-280553 3.00 78 23 0 -14.6 24.1 8.1 0 8.2 5 8.1 0 76 9 -2. 0 NGC0024 -1 5 -2.05
> 8 AGC748778 000634.4+153039 0.61 70 3 0 -10.4 24.9 6.3 9 5.7 0 6.6 4 48 6 -1. 9 NGC0253 -1 5 -2.72
> 9 And XX 000730.7+350756 0.20 33 5 0 -5.8 27.1 5.2 6 5.7 0 NA -18 2 2. 4 MESSIER031 2 4 1.75
>
>
> Cheers, thanks again
>
>
> Jean-Philippe
> On 05/10/2017 16:49, jim holtman wrote:
>> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>> + 92, 114, 121, 127)
>> > read_fwf(input, fwf_widths(diff(start)))
>
> --
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list