	     Unicode Conversion Module for Ruby
			version 0.4.0

		       Yoshida Masato



- Introduction

This is the module to convert ISO/IEC 10646 (Unicode) string
and Japanese string each other.

Supported character encodings are UCS-4, UTF-16, UTF-8,
EUC-JP, CP932 (a variant of Shift_JIS for Japanese Windows).

This cannot detect character encoding automatically.

Note that EUC-JP conversion table has been changed.


- Install

This can work with ruby-1.4. I recommend you to use
ruby-1.4.2 or later.

Extract this package.

  cd ext
  gzip -dc < uconv-0.2.tar.gz | tar xvf -
  cd uconv

If you do not need EUC-JP or CP932 conversion, you can
undefine USE_EUC or USE_SJIS to reduce the size of this
module.

And make and install usually.
For example, when Ruby supports dynamic linking on your OS,

  ruby extconf.rb
  make
  make install


- Usage

If you do not link this module with Ruby statically, 

  require "uconv"

before using.


- Module Function

  UTF-16 and UCS-4 strings must be little-endian without
  using u16swap (u2swap) and u4swap.

  The functions that had treated USC-2 now can treat UTF-16.

  All ZERO WIDTH NO-BREAK SPACE (U+FEFF) are regarded as
  BYTE ORDER MARK (BOM) and deleted in some functions.

  The function matrix is the following.

             |               dest
             |  EUC-JP    CP932     UTF-8    UTF-16    UCS-4
    ---------+------------------------------------------------
       EUC-JP|  \         -         euctou8  euctou16  -
    s  CP932 |  -         \         sjistou8 sjistou16 -
    r  UTF-8 |  u8toeuc   u8tosjis  \        u8tou16   u8tou4
    c  UTF-16|  u16toeuc  u16tosjis u16tou8  u16swap   u16tou4
       USC-4 |  -         -         u4tou8   u4tou16   u4swap


  utf16 = Uconv.u16swap(utf16)
  ucs2 = Uconv.u2swap(ucs2)
  utf16 = Uconv.u16swap!(utf16)
  ucs2 = Uconv.u2swap!(ucs2)
    Byte-swap a UTF-16 string. The little-endian string is
    converted to the big-endian string.
    Bang functions change the the parameter string directly.

  ucs4 = Uconv.u4swap(ucs4)
  ucs4 = Uconv.u4swap!(ucs4)
    Byte-swap a UCS-4 string. The 1234-ordered string is
    converted into the 4321-ordered string.
    Bang function changes the the parameter string directly.

  utf16 = Uconv.u8tou16(utf8)
  ucs2 = Uconv.u8tou2(utf8)
    Convert a UTF-8 string into an UTF-16 string. The
    Illegal UTF-8 sequence raises the exception. The
    character except for a range from U-00000000 to
    U-0010FFFF also raises the exception.

  utf8 = Uconv.u16tou8(utf16)
  utf8 = Uconv.u2tou8(ucs2)
    Convert a UTF-16 string into a UTF-8 string. ZWNBSPs
    (U+FEFF) are deleted. Illegal surrogate pair raises
    the exception.

  utf8 = Uconv.u4tou8(ucs4)
    Convert a UTF-16 string into a UTF-8 string. ZWNBSPs
    (U+FEFF) are deleted.

  ucs4 = Uconv.u8tou4(utf8)
    Convert a UTF-8 string into an UCS-4 string. The Illegal
    UTF-8 sequence raises the exception. 

  utf16 = Uconv.u4tou16(ucs4)
    Convert a UTF-8 string into an UTF-16 string. The
    character except for a range from U-00000000 to
    U-0010FFFF also raises the exception.

  ucs = Uconv.u16tou4(utf16)
    Convert a UTF-16 string into a UTF-8 string. Illegal
    surrogate pair raises the exception.

  euc  = Uconv.u16toeuc(utf16)
  euc  = Uconv.u2toeuc(ucs2)
    Convert a UTF-16 string into an EUC-JP string. If
    "Uconv.unknown_unicode_handler" function is not defined,
    the character that cannot be converted is converted into '?'.
    ZWNBSPs (U+FEFF) are deleted.

  utf16 = Uconv.euctou16(euc)
  ucs2 = Uconv.euctou2(euc)
    Convert an EUC-JP string into a UTF-16 string.

  euc  = Uconv.u8toeuc(utf8)
    Convert a UTF-8 string into an EUC-JP string. This is
    equal to u16toeuc(u8tou16(utf8)).

  utf8 = Uconv.euctou8(euc)
    Convert an EUC-JP string into a UTF-8 string. This is
    equal to u16tou8(euctou16(euc)).

  sjis  = Uconv.u16tosjis(utf16)
  sjis  = Uconv.u2tosjis(ucs2)
    Convert a UTF-16 string into an CP932 string. If
    "Uconv.unknown_unicode_handler" function is not defined,
    the character that cannot be converted is converted into '?'.
    ZWNBSPs (U+FEFF) are deleted.

  utf16 = Uconv.sjistou16(sjis)
  ucs2 = Uconv.sjistou2(sjis)
    Convert an CP932 string into a UTF-16 string. 

  sjis  = Uconv.u8tosjis(utf8)
    Convert a UTF-8 string into an CP932 string. This is
    equal to u16tosjis(u8tou16(utf8)).

  utf8 = Uconv.sjistou8(sjis)
    Convert an CP932 string into a UTF-8 string. This is
    equal to u16tou8(euctou16(sjis)).
 
  euc = Uconv.unknown_unicode_handler(unicode)
    When a UTF-16 or a UTF-8 string is converted into an EUC-JP
    or CP932 string, this function is called if the
    character that cannot converted is detected. The
    parameter is a Unicode character code in integer. You
    must return a string. This function is not defined
    initially.

  unicode = Uconv.unknown_euc_handler(euc)
    When an EUC-JP string is converted into a UTF-16 or UTF-8
    string, this function was called if the undefined
    character by JIS X 0208 or JIS X 0212 is detected. 
    The parameter is a EUC-JP string (2bytes or 3bytes).
    You must return a Unicode value in integer (0-65535).

  unicode = Uconv.unknown_sjis_handler(euc)
    When an CP932 string is converted into a UTF-16 or UTF-8
    string, this function was called if the undefined
    character by CP932 is detected. The parameter is a
    CP932 string (1byte or 2bytes).
    You must return a Unicode value in integer (0-65535).

When you manipulate the UTF-16 or UTF-8 string as binary
string, you may need $KCODE="NONE".


- Copying

This extension module is copyrighted free software by
Yoshida Masato.

You can redistribute it and/or modify it under the same term
as Ruby.


- Author

 Yoshida Masato <yoshidam@inse.co.jp>, <yoshidam@yoshidam.net>


- History

 Nov  5, 1999 version 0.4.0 Support CP932
 Mar 29, 1999 version 0.3.1 Remove xmallocs
 Feb 22, 1999 version 0.3.0 Support UCS-4 and UTF-16
 Jan 13, 1999 version 0.2.2 Support Japanese supplement characters
 Aug 15, 1998 version 0.2.1 Append this README file
 Jul 24, 1998 version 0.2
 Jul  8, 1998 version 0.1
