[R] Eliminating 'Unprintable ASCII' characters
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Nov 25 09:26:08 CET 2009
I think you mean the control characters: there are other unprintable
characters (del for example). They are the character range
[\001-\037]. E.g.
> test <- intToUtf8(1:40, multiple=TRUE)
> grepl("[\001-\037]", test)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE
If you want to include del, use "[\001-\037\177]". I have omitted nul
(\000) which cannot occur in R character strings.
You didn't give us the sessionInfo() output the posting guide asked
you for, so I am presuming you are not doing this in an unusual
locale: I wouldn't trust the regexp code in one of the stateful
locales used for Japanese.
On Wed, 25 Nov 2009, Steven Kang wrote:
> Hi all,
>
> I have a csv file containing words with *UNPRINTABLE ASCII* characters
> (described in the following table).
>
> Are there any viable method in eliminating these characters?
>
> I realise that *EXTENDED ASCII* characters (i.e , ?, ?, ?, ? etc) can be
> removed or replaced via *"gsub"* or *"gregexpr"* functions. But am not
> certain with the *UNPRINTABLE ASCII* characters.
>
> Your help in resolving this problem would be highly appreciated.
>
> Thanks
>
>
>
>
> Steven
>
>
>
>
> ASCII control characters (character code 0-31)The first 32 characters in
> the ASCII-table are unprintable control codes and are used to control
> peripherals such as printers.
> *DEC* *OCT* *HEX* *BIN* *Symbol* *HTML Number* *HTML Name* *Description*
> 0 000 00 00000000 NUL Null char 1 001 01 00000001 SOH Start
> of Heading 2 002 02 00000010 STX Start of Text 3 003 03 00000011
> ETX End of Text 4 004 04 00000100 EOT End of Transmission
> 5 005 05 00000101 ENQ Enquiry 6 006 06 00000110 ACK
> Acknowledgment 7 007 07 00000111 BEL Bell 8 010 08 00001000 BS
> Back Space 9 011 09 00001001 HT Horizontal Tab 10 012 0A
> 00001010 LF
Line Feed 11 013 0B 00001011 VT Vertical Tab
> 12 014 0C 00001100 FF Form Feed 13 015 0D 00001101 CR
> Carriage
> Return 14 016 0E 00001110 SO Shift Out / X-On 15 017 0F 00001111 SI
> Shift In / X-Off 16 020 10 00010000 DLE Data Line Escape
> 17 021 11 00010001 DC1 Device Control 1 (oft. XON) 18 022 12
> 00010010 DC2 Device Control 2 19 023 13 00010011 DC3 Device
> Control 3 (oft. XOFF) 20 024 14 00010100 DC4 Device Control 4 21
> 025 15 00010101 NAK Negative Acknowledgement 22 026 16 00010110 SYN
> Synchronous Idle 23 027 17 00010111 ETB End of Transmit
> Block 24 030 18 00011000 CAN Cancel 25 031 19 00011001 EM End
> of Medium 26 032 1A 00011010 SUB Substitute 27 033 1B 00011011 ESC
> Escape 28 034 1C 00011100 FS File Separator 29 035 1D
> 00011101 GS Group Separator 30 036 1E 00011110 RS Record
> Separator 31 037 1F 00011111 US Unit Separator
>
> [[alternative HTML version deleted]]
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list