UTF-9 - Note

2005年にUTF-9という規格ができていました。知らなかった。
http://www.ietf.org/rfc/rfc4042.txt
http://nikki.hio.jp/files/2005/rfc4042.txt
UTF-8ではCJK文字が8*3ビットになってしまうのですが、UTF-9ですと2*9ビットに収まってしまうスグレモノ。
ただしこれを扱うには9ビット単位の操作が必要。12ビットマシンならともかく、1バイト8ビットのCPUの上で、16ビットを単位とするのは効率が悪すぎます。
ここは間違いなく、(最近D言語がbit型を放棄したので)恐らく世界で唯一*1、ビット単位で自在に変数を確保できる言語であるAdaの出番でしょう、というわけで、RFC中のサンプルの変換コードを移植してみました。

package UTF9 is

   type UTF9_Character is new Wide_Wide_Character
      range Wide_Wide_Character'Val (0) .. Wide_Wide_Character'Val (8#777#);
   for UTF9_Character'Size use 9;

   type UTF9_String is array (Positive range <>) of UTF9_Character;
   for UTF9_String'Component_Size use 9;

   function UTF9_to_UCS4 (Single_Character : UTF9_String)
      return Wide_Wide_Character;
   function UCS4_to_UTF9 (Single_Character : Wide_Wide_Character)
      return UTF9_String;

end UTF9;

with Interfaces; use Interfaces;
package body UTF9 is

   function UTF9_to_UCS4 (Single_Character : UTF9_String)
      return Wide_Wide_Character
   is
      I : Natural := Single_Character'First;
      nonet : Unsigned_32 := UTF9_Character'Pos (Single_Character (I));
      ucs4 : Unsigned_32 := nonet and 8#377#;
   begin
      while (nonet and 8#400#) /= 0 loop
         I := I + 1;
         nonet := UTF9_Character'Pos (Single_Character (I));
         ucs4 := Shift_Left (ucs4, 8) or (nonet and 8#377#);
      end loop;
      return Wide_Wide_Character'Val (ucs4);
   end UTF9_to_UCS4;

   function UCS4_to_UTF9 (Single_Character : Wide_Wide_Character)
      return UTF9_String
   is
      ucs4 : Unsigned_32 := Wide_Wide_Character'Pos (Single_Character);
      Result : UTF9_String (1 .. 4);
      I : Natural := 1;
   begin
      if ucs4 > 8#400# then
         if ucs4 > 8#200000# then
            if ucs4 > 8#100000000# then
               Result (I) := UTF9_Character'Val
                  (8#400# or (Shift_Right (ucs4, 24) and 8#377#));
               I := I + 1;
            end if;
            Result (I) := UTF9_Character'Val
               (8#400# or (Shift_Right (ucs4, 16) and 8#377#));
            I := I + 1;
         end if;
         Result (I) := UTF9_Character'Val
            (8#400# or (Shift_Right (ucs4, 8) and 8#377#));
         I := I + 1;
      end if;
      Result (I) := UTF9_Character'Val (ucs4 and 8#377#);
      return Result (1 .. I);
   end UCS4_to_UTF9;

end UTF9;

テスト。

with Ada.Text_IO; use Ada.Text_IO;
with Ada.Integer_Text_IO; use Ada.Integer_Text_IO;
with UTF9; use UTF9;
procedure Test is
   --  GOTHIC LETTER AHSA
   subtype UTF9_3 is UTF9_String (1 .. 3);
   S : UTF9_3 := UCS4_to_UTF9 (Wide_Wide_Character'Val (16#10330#));
begin
   for I in S'Range loop
      Put (UTF9_Character'Pos (S (I)), Base => 8);
   end loop;
   New_Line;
   Put (UTF9_3'Size);
   New_Line;
end Test;

...>test
     8#401#     8#403#      8#60# --  RFCに例示された結果と一致
         27 --  3文字で27ビット占有

うむ、完璧。まさか8進数なんてものを使うようになる日が来るとは思いませんでした。
このように、Adaでは9ビット単位の配列をも扱えます。もしUTF-9が一般の文字コードとして使われるようになったら、1バイトが8ビットの環境では、ありとあらゆる文字列処理において、他の言語との効率及び扱いやすさにおける格差は明白となるでしょう。さぁ、いずれ来るUTF-9普及に向けて、皆さんAdaを使おうではありませんか。
ちなみに規格の日付は見てはいけないらしいです。

*1:適当言ってます。同じことができる無名の言語がいくらでもありそうな。