REBOL [
    Title: "UTF-8"
    Date: 2-Dec-2002
    Name: "UTF-8"
    Version: 1.0.2
    File: %utf-8.r
    Author: "Jan Skibinski"
    Needs: [%hof.r]
    Purpose: {Encoding and decoding of UCS strings
to and from UTF-8 strings.
}
    History: {
    Version 1.0.2: Added a simulation of a Unicode support.
    Version 1.0.1:
        A full range of optimizations has been applied,
        resulting in much improved speed. The entire scheme
        has been redesigned and the algorithms simplified.
        Most of the data tables have been removed.
        There are fewer functions but they are more generic.

    Version 1.0.0:
        Basic UTF-8 encoding and decoding functions.
        Limitations: Does not handle a big/little endian
        signatures yet. Needs thorough testing and algorithms
        optimalizations.
    }
        library: [
        level: 'intermediate
        platform: 'all
        type: 'Tool
        domain: 'text
        tested-under: none
        support: none
        license: none
        see-also: none
    ]
    Email: jan.skibinski@sympatico.ca
    Category: [crypt 4]
    Acknowledments: {
        Inspired by the script 'utf8-encode.r of RebOldes
        and Romano Paulo Tenca, which encodes Latin-1 strings.

        I'd like to thank RebOldes for his suggestions for improvements
        regarding the version 1.0.0. However the completely redesigned
        version 1.0.1 was already under way so his changes did not
        make it here.

        My thanks go also to Romano Paulo Tenca for converting
        my 'while loop to a 'parse loop and for a bunch of other
        optimization tricks, such as inlining and "precompilation"
        of fetch functions.
    }
]
comment {

    UCS means: Universal Character Set (or Unicode)
    UCS-2 means: 2-byte representation of a character in UCS.
    UCS-4 means: 4-byte representation of a character in UCS.
    UTF-8 means: UCS Transformation Format using 8-bit octets.


    The following excerpt from:
        UTF-8 and Unicode FAQ for Unix/Linux, by Markus Kuhn
        http://www.cl.cam.ac.uk/~mgk25/unicode.html
    provides motivations for using UTF-8.

    <>

    The copy of forementioned Annex D can be found on Markus site:
    http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html.
    Encoding and decoding functions implemented here are
    based on the descriptions of algorithms found in the Annex D.

    Testing: The page http://www.cl.cam.ac.uk/~mgk25/unicode.html
    has many pointers to variety of test data. One of them
    is a UTF-8 sampler from Kermit pages of Columbia University
    http://www.columbia.edu/kermit/utf8.html, where the
    phrase "I can eat glass and it doesn't hurt me." is
    produced in dozens of world languages.

}

comment {
------------------------------------------------------------
SUMMARY of script UTF-8.R
------------------------------------------------------------
encode      (integer -> string -> string)
decode      (integer -> string -> string)
to-ucs      (integer -> string -> string)

}
; do %/e/rebol/view/public/www.reboltech.com/library/scripts/hof.r
; do %/e/rebol/view/public/www.reboltech.com/library/scripts/utf-8.r
; x: "chars: ìšèøžýáíé"
; y: encode 1 x
; t0: now/time/precise loop 1000 [decode 1 y] to decimal! now/time/precise - t0

    comment
    {
        A table of the three fetch functions for cases k = 1, 2, 4,
        where k is a number of octets to fetch.
        Each function reads k octets and attempts to convert
        them to a single character if possible, otherwise
        it computes an integer represented by those characters.
        Case 3 is just a dummy placeholder.
    }
    fetch: reduce [
        func[u][first u]
        func[u][
            either 0 < first u [
                0 + (second u) + (256 * first u)][second u]]
        3
        func[u /local z][
            z: 16777216 * first u
                 + (65536 * second u)
                 + (256 * third u)
            either 0 < z [z + fourth u][fourth u]
        ]
    ]

    comment
    {
        Data used by both 'encode and 'decode functions.
        Every non-zero element here has this property
        that its index within the block is equal to a total
        count of its consecutive most-significant 1-bits.
        The counter signals to the decoder a number of octets
        to use when decoding a character.
    }
    udata: [0 192 224 240 248 252]


    encode: func [
        {
        Encode string of k-wide characters into UTF-8 string,
        where k: 1, 2 or 4.
            (integer -> string -> string)
        }
        k [integer!]
        ucs [string!]
        /local c f x result [string!]
    ][
        result: make string! length? ucs
        f: pick fetch k
        parse/all ucs [any [c: k skip (
            either 128 > x: f c [
                insert tail result x
            ][
                either x < 256 [
                    insert insert tail result x / 64 or 192 x and 63 or 128
                ][
                    result: tail result
                    until [
                        insert result to char! x and 63 or 128
                        128 > x: x and -64 / 64
                    ]
                    insert result to char! x or pick udata 1 + length? result
                ]
            ]
        )]]
        head result
    ]


    decode: func [
        {
        Decode a UTF-8 encoded string into UCS-k string,
        where k = 1, 2, 4.
        Encoded strings which originated from Latin-1
        can be decoded with k = 1, 2, or 4.
        Other encoded Latin-m (m > 1) strings can be
        decoded either with k = 2 or k = 4,
        but not with k = 1.
        }
        k [integer!]
        xs [string!]
        /local m x c result [string!]
    ][
        result: make string! (length? xs) * k
        while [not tail? xs][
            x: first xs
            either x < 128 [
                insert insert/dup tail result #"^@" k - 1 x
            ][
                m: 8 - length? find enbase/base to binary! x 2 #"0"
                x: x xor pick udata m
                loop m - 1 [x: 64 * x + (63 and first xs: next xs)]
                result: tail result
                loop k - 1 [
                    insert result to char! x and 255
                    x: x and -256 / 256
                ]
                insert result to char! x
            ]
            xs: next xs
        ]
        head result
    ]


    to-ucs: func [
        {
        Convert 'ansi string to a string
        of wide characters: 1, 2, or 4 octets per
        character.
        This is an auxiliary function, just for testing.

        Note that when a UTF-encoded string is already
        available this function is no longer needed because
        such a string can be converted to a UCS-k string
        by simply invoking
            decode k utf-8-string .

        The condition is: k >= k-minimum. Hence
        any Latin-1 encoded string can be decoded to UCS-1,
        UCS-2 or UCS-4. Similarly, any Latin-2 encoded
        string can be decoded either to UCS-2 or UCS-4
        but not to UCS-1.
        }
        k [integer!]
        ansi [string!]
        /local c result [string!]
    ][
        either k > 1 [
            result: make string! (length? ansi) * k
            parse/all ansi [any [c:  skip (
                insert insert/dup tail result #"^@" (k - 1) first c
            )]]
        ][
            result: copy ansi
        ]
        result
    ]



    unicode-dir: %/e/rebol/unicode/

    comment {
        Some Microsoft Windows codepages:
    }
    charset-windows: map (func[u][rejoin [unicode-dir u]]) [
        %CP1252.TXT     ; Western Europe
        %CP1250.TXT     ; Central Europe
        ;%CP1257.TXT     ; Baltic
        %CP1251.TXT     ; Cyrillic
        ;%CP1253.TXT     ; Greek
        ;%CP1254.TXT     ; Turkish
        ;%CP1255.TXT     ; Hebrew
        ;%CP1256.TXT     ; Arabic
        ;%CP1258.TXT     ; Viat Nam
        ]

    comment {
        ISO-8859 character sets
    }
    charset-iso: map (func[u][rejoin [unicode-dir u]]) [
        %8859-1.TXT %8859-2.TXT %8859-3.TXT %8859-4.TXT %8859-5.TXT
        %8859-6.TXT %8859-7.TXT %8859-8.TXT %8859-9.TXT %8859-10.TXT
        %8859-11.TXT %8859-13.TXT %8859-14.TXT %8859-15.TXT %8859-16.TXT]


    comment {
        If standard 'debase does not work or crashes
        use this replacement.
    debase-16: func[
        x
    ][
        head insert tail insert x "16#{" "}"
    ]
    }


    cross-maps: func [
        {
        A sorted union of all cross-maps 'xs,
        such as 'charset-windows or 'charset-iso.
        }
        xs
        /local zs result
    ][
        zs: sort foldl1 :union (map :cross-map xs)
        result: make block! 2 * (length? zs)
        while [not tail? zs][
            insert tail result first zs
            zs: next zs
        ]
        to-hash result
    ]


    cross-map: func [
        {
        A block of pairs of codes read from
        a mapping 'file that maps an iso or a proprietary
        (MS, APPLE, ..) character set to a subset of unicode.

        All comments, empty lines and undefined elements
        have been removed. The 'file must be one of the
        cross mapping files published at
        http://www.unicode.org/Public/MAPPINGS/ .
        }
        file
        /local z u xs result
    ][
        xs: read/lines file
        result: copy []
        while [not tail? xs][
            x: first xs
            if (length? x) >= 11 [
                if (first x) <> #"#" [
                    z: make block! 2
                    u: debase/base copy/part skip x 7 4 16
                    if (length? u) > 0 [
                        insert z to-integer u
                        insert tail z to-integer debase/base copy/part skip x 2 2 16
                        insert/only tail result z
                    ]
                ]
            ]
            xs: next xs
        ]
        head result
    ]


    to-alias-string: func [
        {
        String of "narrow" characters mapped from a 'ucs
        string of k-wide characters to currently
        selected charset, where 'xs is a unicode-to-charset
        mapping hashtable.
        Unmatched unicode codes are substituted
        by a character corresponding to integer 'subst.
        }
        k [integer!]
        xs [hash!]
        str [string!]
        subst [integer!]
        /local x result [string!]
    ][
        f: pick fetch k
        result: make string! (length? str)
        while [not tail? str][
            x: to integer! f str
            insert tail result to-alias-char xs x subst
            str: skip str k
        ]
        result
    ]


    to-alias-char: func [
        {
        Character mapped from a unicode integer 'x
        to one of the characters from currently
        selected charset, where 'xs is a unicode-to-charset
        mapping hashtable.
        If no match is found return a substitute
        character corresponding to integer 'subst.
        }
        xs [hash!]
        x [integer!]
        subst [integer!]
        /local result [char!]
    ][
        result: select/skip xs x 2
        either not none? result [
            to char! first result
        ][
            to char! subst
        ]
    ]

    comment {
    ----------------------------------------------------------
    Everything below this is just for testing. Comment it
    out or remove if you wish.
    ----------------------------------------------------------
    }


    comment
        {
        A short sentence translated to a bunch of languages,
        from Latin-1,-2,-4,-5. For every keyword, such as 'Spanish
        there is a corresponding UTF-8 string.
        }
    glass:  [
        English {I can eat glass and it doesn't hurt me.}
            ; -- Latin-1 --
        French {Je peux manger du verre, cela ne me fait pas mal.}
        Quebecois {J'peux manger d'la vitre, Ã§a m'fa pas mal.}
        Spanish {Puedo comer vidrio, no me hace daÃ±o.}
        Portuguese {Posso comer vidro, nÃ£o me faz mal.}
        Irish {Is fÃ©idir liom gloinne a ithe. NÃ dhÃ©anann sÃ dochar ar bith dom}
        Norwegian {Eg kan etas utan Ã¥ skada meg.}
        Swedish {Jag kan Ã¤ta glas, det skadar mig inte.}
        Danish {Jeg kan spise glas, det gÃ¸r ikke ondt pÃ¥ mig.}
        Dutch {Ik kan glas eten. Het doet me geen pijn.}
        Finnish {Pystyn syÃ¶mÃ¤Ã¤n lasia. Seei koske yhtÃ¤Ã¤n.}
            ; -- Latin-2 --
        Hungarian {Meg tudom enni az Ã¼veget, nem lesz tÅ‘le bajom}
        Polish {MogÄ™ jeÅ›Ä‡ szkÅ‚o i mi nie szkodzi.}
        Czech {Mohu jÃst sklo, neublÃÅ¾Ãmi.}
        Slovak {MÃ´Å¾em jesÅ¥ sklo. NezranÃ ma.}
        Belarusian-Lacinka {Ja mahu jeÅ›ci Å¡kÅ‚o, jano mne ne Å¡kodziÄ‡.}
            ; -- Latin-4 --
        Estonian {Ma vÃµin klaasi sÃ¼Ã¼a, see ei tee mulle midagi.}
        Latvian {Es varu Ä“st stiklu, tas man nekaitÄ“.}
        Lithuanian {AÅ¡ galiu valgyti stiklÄ… ir jis manÄ™s neÅ¾eidÅ¾ia.}
            ; -- Latin-5 --
        Russian {Ð¯ Ð¼Ð¾Ð³Ñƒ ÐµÑ?Ñ‚ÑŒ Ñ?Ñ‚ÐµÐºÐ»Ð¾, Ð¾Ð½Ð¾ Ð¼Ð½Ðµ Ð½Ðµ Ð²Ñ€ÐµÐ´Ð¸Ñ‚.}
        Belarusian-Cyrillic {Ð¯ Ð¼Ð°Ð³Ñƒ ÐµÑ?Ñ†Ñ– ÑˆÐºÐ»Ð¾, Ñ?Ð½Ð¾ Ð¼Ð½Ðµ Ð½Ðµ ÑˆÐºÐ¾Ð´Ð·Ñ–Ñ†ÑŒ.}
        Ukrainian {Ð¯ Ð¼Ð¾Ð¶Ñƒ Ñ—Ñ?Ñ‚Ð¸ ÑˆÐºÐ»Ð¾, Ð¹ Ð²Ð¾Ð½Ð¾ Ð¼ÐµÐ½Ñ– Ð½Ðµ Ð¿Ð¾ÑˆÐºÐ¾Ð´Ð¸Ñ‚ÑŒ. }
        Bulgarian {ÐœÐ¾Ð³Ð° Ð´Ð° Ñ?Ð¼ Ñ?Ñ‚ÑŠÐºÐ»Ð¾ Ð¸ Ð½Ðµ Ð¼Ðµ Ð±Ð¾Ð»Ð¸. }
    ]