Manual Pages for UNIX Darwin command on man Tcl

Manual Pages for UNIX Darwin command on man Tcl_WinUtfToTChar

TclGetEncoding(3) Tcl Library Procedures TclGetEncoding(3)

NAME

TclGetEncoding, TclFreeEncoding, TclExternalToUtfDString, TclExternalToUtf, TclUtfToExternalDString, TclUtfToExternal,

TclWinTCharToUtf, TclWinUtfToTChar, TclGetEncodingName, TclSetSys-

temEncoding, TclGetEncodingNames, TclCreateEncoding, TclGetDefault-

EncodingDir, TclSetDefaultEncodingDir - procedures for creating and

using encodings.

SYNOPSIS

##iinncclluuddee <>

TclEncoding TTccllGGeettEEnnccooddiinngg(interp, name) void TTccllFFrreeeeEEnnccooddiinngg(encoding) char * TTccllEExxtteerrnnaallTTooUUttffDDSSttrriinngg(encoding, src, srcLen, dstPtr) int TTccllEExxtteerrnnaallTTooUUttff(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr) char * TTccllUUttffTTooEExxtteerrnnaallDDSSttrriinngg(encoding, src, srcLen, dstPtr) int TTccllUUttffTTooEExxtteerrnnaall(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr) char * TTccllWWiinnTTCChhaarrTTooUUttff(tsrc, srcLen, dstPtr) TCHAR * TTccllWWiinnUUttffTTooTTCChhaarr(src, srcLen, dstPtr) CONST char * TTccllGGeettEEnnccooddiinnggNNaammee(encoding) int TTccllSSeettSSyysstteemmEEnnccooddiinngg(interp, name) void TTccllGGeettEEnnccooddiinnggNNaammeess(interp) TclEncoding TTccllCCrreeaatteeEEnnccooddiinngg(typePtr) CONST char * TTccllGGeettDDeeffaauullttEEnnccooddiinnggDDiirr(void) void TTccllSSeettDDeeffaauullttEEnnccooddiinnggDDiirr(path) AARRGGUUMMEENNTTSS TclInterp *interp (in) Interpreter to use for error reporting, or NULL if no error reporting is desired. CONST char *name (in) Name of encoding to load. TclEncoding encoding (in) The encoding to query,

free, or use for convert-

ing text. If encoding is NULL, the current system encoding is used. CONST char *src (in) For the TTccllEExxtteerrnnaallTTooUUttff functions, an array of bytes in the specified encoding that are to be

converted to UTF-8. For

the TTccllUUttffTTooEExxtteerrnnaall and

TTccllWWiinnUUttffTTooTTCChhaarr func-

tions, an array of UTF-8

characters to be converted to the specified encoding. CONST TCHAR *tsrc (in) An array of Windows TCHAR characters to convert to

UTF-8.

int srcLen (in) Length of src or tsrc in bytes. If the length is

negative, the encoding-

specific length of the string is used.

TclDString *dstPtr (out) Pointer to an uninitial-

ized or free TTccllDDSSttrriinngg in which the converted result will be stored.

int flags (in) Various flag bits OR-ed

together. TCLENCOD-

INGSTART signifies that the source buffer is the

first block in a (poten-

tially multi-block) input

stream, telling the con-

version routine to reset to an initial state and perform any initialization that needs to occur before

the first byte is con-

verted. TCLENCODINGEND signifies that the source buffer is the last block

in a (potentially multi-

block) input stream, telling the conversion routine to perform any finalization that needs to occur after the last byte is converted and then to reset to an initial state.

TCLENCODINGSTOPONERROR

signifies that the conver-

sion routine should return immediately upon reading a source character that

doesn't exist in the tar-

get encoding; otherwise a default fallback character

will automatically be sub-

stituted. TclEncodingState *statePtr (in/out) Used when converting a

(generally long or indefi-

nite length) byte stream

in a piece by piece fash-

ion. The conversion rou-

tine stores its current state in *statePtr after src (the buffer containing the current piece) has been converted; that state information must be passed back when converting the next piece of the stream so the conversion routine knows what state it was in when it left off at the end of the last piece. May be NULL, in which case the value specified for flags is ignored and the source buffer is assumed to contain the complete string to convert.

char *dst (out) Buffer in which the con-

verted result will be stored. No more than dstLen bytes will be stored in dst. int dstLen (in) The maximum length of the output buffer dst in bytes. int *srcReadPtr (out) Filled with the number of bytes from src that were actually converted. This

may be less than the orig-

inal source length if

there was a problem con-

verting some source char-

acters. May be NULL. int *dstWrotePtr (out) Filled with the number of bytes that were actually stored in the output buffer as a result of the conversion. May be NULL. int *dstCharsPtr (out) Filled with the number of characters that correspond to the number of bytes stored in the output buffer. May be NULL. TclEncodingType *typePtr (in) Structure that defines a new type of encoding. CONST char *path (in) A path to the location of the encoding file. IINNTTRROODDUUCCTTIIOONN These routines convert between Tcl's internal character representation,

UTF-8, and character representations used by various operating systems

or file systems, such as Unicode, ASCII, or Shift-JIS. When operating

on strings, such as such as obtaining the names of files or displaying characters using international fonts, the strings must be translated into one or possibly multiple formats that the various system calls can use. For instance, on a Japanese Unix workstation, a user might obtain

a filename represented in the EUC-JP file encoding and then translate

the characters to the jisx0208 font encoding in order to display the filename in a Tk widget. The purpose of the encoding package is to

help bridge the translation gap. UTF-8 provides an intermediate stag-

ing ground for all the various encodings. In the example above, text

would be translated into UTF-8 from whatever file encoding the operat-

ing system is using. Then it would be translated from UTF-8 into what-

ever font encoding the display routines require. Some basic encodings are compiled into Tcl. Others can be defined by

the user or dynamically loaded from encoding files in a platform-inde-

pendent manner.

DESCRIPTION

TTccllGGeettEEnnccooddiinngg finds an encoding given its name. The name may refer

to a builtin Tcl encoding, a user-defined encoding registered by call-

ing TTccllCCrreeaatteeEEnnccooddiinngg, or a dynamically-loadable encoding file. The

return value is a token that represents the encoding and can be used in

subsequent calls to procedures such as TTccllGGeettEEnnccooddiinnggNNaammee, TTccllFFrreeeeEEnn-

ccooddiinngg, and TTccllUUttffTTooEExxtteerrnnaall. If the name did not refer to any known or loadable encoding, NULL is returned and an error message is returned in interp. The encoding package maintains a database of all encodings currently in use. The first time name is seen, TTccllGGeettEEnnccooddiinngg returns an encoding with a reference count of 1. If the same name is requested further

times, then the reference count for that encoding is incremented with-

out the overhead of allocating a new encoding and all its associated data structures. When an encoding is no longer needed, TTccllFFrreeeeEEnnccooddiinngg should be called to release it. When an encoding is no longer in use anywhere (i.e., it has been freed as many times as it has been gotten) TTccllFFrreeeeEEnnccooddiinngg will release all storage the encoding was using and delete it from the database.

TTccllEExxtteerrnnaallTTooUUttffDDSSttrriinngg converts a source buffer src from the speci-

fied encoding into UTF-8. The converted bytes are stored in dstPtr,

which is then null-terminated. The caller should eventually call

TTccllDDSSttrriinnggFFrreeee to free any information stored in dstPtr. When con-

verting, if any of the characters in the source buffer cannot be repre-

sented in the target encoding, a default fallback character will be used. The return value is a pointer to the value stored in the DString. TTccllEExxtteerrnnaallTTooUUttff converts a source buffer src from the specified

encoding into UTF-8. Up to srcLen bytes are converted from the source

buffer and up to dstLen converted bytes are stored in dst. In all

cases, *srcReadPtr is filled with the number of bytes that were suc-

cessfully converted from src and *dstWrotePtr is filled with the corre-

sponding number of bytes that were stored in dst. The return value is one of the following: TTCCLLOOKK All bytes of src were converted. TTCCLLCCOONNVVEERRTTNNOOSSPPAACCEE The destination buffer was not

large enough for all of the con-

verted data; as many characters as could fit were converted though. TTCCLLCCOONNVVEERRTTMMUULLTTIIBBYYTTEE The last fews bytes in the source buffer were the beginning of a multibyte sequence, but more bytes were needed to complete this sequence. A subsequent call to the conversion routine should pass a buffer containing the unconverted bytes that remained in src plus some further bytes from the source

stream to properly convert the for-

merly split-up multibyte sequence.

TTCCLLCCOONNVVEERRTTSSYYNNTTAAXX The source buffer contained an invalid character sequence. This may occur if the input stream has

been damaged or if the input encod-

ing method was misidentified.

TTCCLLCCOONNVVEERRTTUUNNKKNNOOWWNN The source buffer contained a char-

acter that could not be represented in the target encoding and

TCLENCODINGSTOPONERROR was speci-

fied.

TTccllUUttffTTooEExxtteerrnnaallDDSSttrriinngg converts a source buffer src from UTF-8 into

the specified encoding. The converted bytes are stored in dstPtr,

which is then terminated with the appropriate encoding-specific null.

The caller should eventually call TTccllDDSSttrriinnggFFrreeee to free any informa-

tion stored in dstPtr. When converting, if any of the characters in the source buffer cannot be represented in the target encoding, a default fallback character will be used. The return value is a pointer to the value stored in the DString.

TTccllUUttffTTooEExxtteerrnnaall converts a source buffer src from UTF-8 into the

specified encoding. Up to srcLen bytes are converted from the source buffer and up to dstLen converted bytes are stored in dst. In all

cases, *srcReadPtr is filled with the number of bytes that were suc-

cessfully converted from src and *dstWrotePtr is filled with the corre-

sponding number of bytes that were stored in dst. The return values are the same as the return values for TTccllEExxtteerrnnaallTTooUUttff.

TTccllWWiinnUUttffTTooTTCChhaarr and TTccllWWiinnTTCChhaarrTTooUUttff are Windows-only convenience

functions for converting between UTF-8 and Windows strings. On Windows

95 (as with the Macintosh and Unix operating systems), all strings exchanged between Tcl and the operating system are "char" based. On Windows NT, some strings exchanged between Tcl and the operating system are "char" oriented while others are in Unicode. By convention, in Windows a TCHAR is a character in the ANSI code page on Windows 95 and a Unicode character on Windows NT. If you planned to use the same "char" based interfaces on both Windows 95 and Windows NT, you could use TTccllUUttffTTooEExxtteerrnnaall and TTccllEExxtteerrnnaallTTooUUttff (or their TTccllDDSSttrriinngg equivalents) with an encoding of NULL (the current system encoding). On the other hand, if you planned to use the Unicode interface when running on Windows NT and the "char" interfaces when running on Windows 95, you would have to perform

the following type of test over and over in your program (as repre-

sented in pseudo-code):

if (running NT) {

encoding <- TclGetEncoding("unicode");

nativeBuffer <- TclUtfToExternal(encoding, utfBuffer);

TclFreeEncoding(encoding); } else {

nativeBuffer <- TclUtfToExternal(NULL, utfBuffer);

TTccllWWiinnUUttffTTooTTCChhaarr and TTccllWWiinnTTCChhaarrTTooUUttff automatically handle this test and use the proper encoding based on the current operating system. TTccllWWiinnUUttffTTooTTCChhaarr returns a pointer to a TCHAR string, and TTccllWWiinnTTCChhaarrTTooUUttff expects a TCHAR string pointer as the src string.

Otherwise, these functions behave identically to TTccllUUttffTTooEExxtteerrnnaallDD-

SSttrriinngg and TTccllEExxtteerrnnaallTTooUUttffDDSSttrriinngg. TTccllGGeettEEnnccooddiinnggNNaammee is roughly the inverse of TTccllGGeettEEnnccooddiinngg. Given an encoding, the return value is the name argument that was used to create the encoding. The string returned by TTccllGGeettEEnnccooddiinnggNNaammee is only guaranteed to persist until the encoding is deleted. The caller must not modify this string. TTccllSSeettSSyysstteemmEEnnccooddiinngg sets the default encoding that should be used whenever the user passes a NULL value for the encoding argument to any of the other encoding functions. If name is NULL, the system encoding is reset to the default system encoding, bbiinnaarryy. If the name did not

refer to any known or loadable encoding, TCLERROR is returned and an

error message is left in interp. Otherwise, this procedure increments

the reference count of the new system encoding, decrements the refer-

ence count of the old system encoding, and returns TCLOK. TTccllGGeettEEnnccooddiinnggNNaammeess sets the interp result to a list consisting of the

names of all the encodings that are currently defined or can be dynami-

cally loaded, searching the encoding path specified by TTccllSSeettDDeeffaauulltt-

EEnnccooddiinnggDDiirr. This procedure does not ensure that the dynamically-load-

able encoding files contain valid data, but merely that they exist.

TTccllCCrreeaatteeEEnnccooddiinngg defines a new encoding and registers the C proce-

dures that are called back to convert between the encoding and UTF-8.

Encodings created by TTccllCCrreeaatteeEEnnccooddiinngg are thereafter visible in the database used by TTccllGGeettEEnnccooddiinngg. Just as with the TTccllGGeettEEnnccooddiinngg procedure, the return value is a token that represents the encoding and

can be used in subsequent calls to other encoding functions. TTccllCCrree-

aatteeEEnnccooddiinngg returns an encoding with a reference count of 1. If an encoding with the specified name already exists, then its entry in the database is replaced with the new encoding; the token for the old encoding will remain valid and continue to behave as before, but users of the new token will now call the new encoding procedures. The typePtr argument to TTccllCCrreeaatteeEEnnccooddiinngg contains information about

the name of the encoding and the procedures that will be called to con-

vert between this encoding and UTF-8. It is defined as follows:

typedef struct TclEncodingType { CONST char *encodingName; TclEncodingConvertProc *toUtfProc; TclEncodingConvertProc *fromUtfProc; TclEncodingFreeProc *freeProc; ClientData clientData; int nullSize; } TclEncodingType; The encodingName provides a string name for the encoding, by which it can be referred in other procedures such as TTccllGGeettEEnnccooddiinngg. The toUtfProc refers to a callback procedure to invoke to convert text from

this encoding into UTF-8. The fromUtfProc refers to a callback proce-

dure to invoke to convert text from UTF-8 into this encoding. The

freeProc refers to a callback procedure to invoke when this encoding is deleted. The freeProc field may be NULL. The clientData contains an

arbitrary one-word value passed to toUtfProc, fromUtfProc, and freeProc

whenever they are called. Typically, this is a pointer to a data

structure containing encoding-specific information that can be used by

the callback procedures. For instance, two very similar encodings such

as aasscciiii and mmaaccRRoommaann may use the same callback procedure, but use dif-

ferent values of clientData to control its behavior. The nullSize

specifies the number of zero bytes that signify end-of-string in this

encoding. It must be 11 (for single-byte or multi-byte encodings like

ASCII or Shift-JIS) or 22 (for double-byte encodings like Unicode).

Constant-sized encodings with 3 or more bytes per character (such as

CNS11643) are not accepted. The callback procedures toUtfProc and fromUtfProc should match the type TTccllEEnnccooddiinnggCCoonnvveerrttPPrroocc: typedef int TclEncodingConvertProc( ClientData clientData, CONST char *src, int srcLen, int flags, TclEncoding *statePtr, char *dst, int dstLen, int *srcReadPtr, int *dstWrotePtr, int *dstCharsPtr); The toUtfProc and fromUtfProc procedures are called by the TTccllEExxtteerrnnaallTTooUUttff or TTccllUUttffTTooEExxtteerrnnaall family of functions to perform the actual conversion. The clientData parameter to these procedures is the same as the clientData field specified to TTccllCCrreeaatteeEEnnccooddiinngg when

the encoding was created. The remaining arguments to the callback pro-

cedures are the same as the arguments, documented at the top, to TTccllEExxtteerrnnaallTTooUUttff or TTccllUUttffTTooEExxtteerrnnaall, with the following exceptions.

If the srcLen argument to one of those high-level functions is nega-

tive, the value passed to the callback procedure will be the appropri-

ate encoding-specific string length of src. If any of the srcReadPtr,

dstWrotePtr, or dstCharsPtr arguments to one of the high-level func-

tions is NULL, the corresponding value passed to the callback procedure

will be a non-NULL location.

The callback procedure freeProc, if non-NULL, should match the type

TTccllEEnnccooddiinnggFFrreeeePPrroocc: typedef void TclEncodingFreeProc( ClientData clientData); This freeProc function is called when the encoding is deleted. The clientData parameter is the same as the clientData field specified to TTccllCCrreeaatteeEEnnccooddiinngg when the encoding was created. TTccllGGeettDDeeffaauullttEEnnccooddiinnggDDiirr and TTccllSSeettDDeeffaauullttEEnnccooddiinnggDDiirr access and set the directory to use when locating the default encoding files. If this value is not NULL, the TTccllppIInniittLLiibbrraarryyPPaatthh routine appends the path to the head of the search path, and uses this path as the first place to look into when trying to locate the encoding file. ENCODING FILES Space would prohibit precompiling into Tcl every possible encoding

algorithm, so many encodings are stored on disk as dynamically-loadable

encoding files. This behavior also allows the user to create addi-

tional encoding files that can be loaded using the same mechanism. These encoding files contain information about the tables and/or escape sequences used to map between an external encoding and Unicode. The

external encoding may consist of single-byte, multi-byte, or double-

byte characters.

Each dynamically-loadable encoding is represented as a text file. The

initial line of the file, beginning with a ``#'' symbol, is a comment

that provides a human-readable description of the file. The next line

identifies the type of encoding file. It can be one of the following letters: [1] SS

A single-byte encoding, where one character is always one byte

long in the encoding. An example is iissoo88885599-11, used by many

European languages. [2] DD

A double-byte encoding, where one character is always two bytes

long in the encoding. An example is bbiigg55, used for Chinese text. [3] MM

A multi-byte encoding, where one character may be either one or

two bytes long. Certain bytes are a lead bytes, indicating that

another byte must follow and that together the two bytes repre-

sent one character. Other bytes are not lead bytes and repre-

sent themselves. An example is sshhiiffttjjiiss, used by many Japanese computers. [4] EE

An escape-sequence encoding, specifying that certain sequences

of bytes do not represent characters, but commands that describe how following bytes should be interpreted. The rest of the lines in the file depend on the type.

Cases [1], [2], and [3] are collectively referred to as table-based

encoding files. The lines in a table-based encoding file are in the

same format as this example taken from the sshhiiffttjjiiss encoding (this is not the complete file):

# Encoding file: shiftjis, multi-byte

M 003F 0 40 00 0000000100020003000400050006000700080009000A000B000C000D000E000F 0010001100120013001400150016001700180019001A001B001C001D001E001F 0020002100220023002400250026002700280029002A002B002C002D002E002F 0030003100320033003400350036003700380039003A003B003C003D003E003F 0040004100420043004400450046004700480049004A004B004C004D004E004F 0050005100520053005400550056005700580059005A005B005C005D005E005F 0060006100620063006400650066006700680069006A006B006C006D006E006F 0070007100720073007400750076007700780079007A007B007C007D203E007F 0080000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 81 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C 301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000 00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5 FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6 25A125A025B325B225BD25BC203B301221922190219121933013000000000000 000000000000000000000000000000002208220B2286228722822283222A2229 000000000000000000000000000000002227222800AC21D221D4220022030000 0000000000000000000000000000000000000000222022A52312220222072261 2252226A226B221A223D221D2235222B222C0000000000000000000000000000 212B2030266F266D266A2020202100B6000000000000000025EF000000000000 The third line of the file is three numbers. The first number is the

fallback character (in base 16) to use when converting from UTF-8 to

this encoding. The second number is a 11 if this file represents the encoding for a symbol font, or 00 otherwise. The last number (in base 10) is how many pages of data follow. Subsequent lines in the example above are pages that describe how to

map from the encoding into 2-byte Unicode. The first line in a page

identifies the page number. Following it are 256 double-byte numbers,

arranged as 16 rows of 16 numbers. Given a character in the encoding, the high byte of that character is used to select which page, and the low byte of that character is used as an index to select one of the

double-byte numbers in that page - the value obtained being the corre-

sponding Unicode character. By examination of the example above, one can see that the characters 0x7E and 0x8163 in sshhiiffttjjiiss map to 203E and 2026 in Unicode, respectively. Following the first page will be all the other pages, each in the same format as the first: one number identifying the page followed by 256

double-byte Unicode characters. If a character in the encoding maps to

the Unicode character 0000, it means that the character doesn't actu-

ally exist. If all characters on a page would map to 0000, that page can be omitted.

Case [4] is the escape-sequence encoding file. The lines in an this

type of file are in the same format as this example taken from the

iissoo22002222-jjpp encoding:

# Encoding file: iso2022-jp, escape-driven

E init {} final {}

iso8859-1 \x1b(B

jis0201 \x1b(J

jis0208 \x1b$@

jis0208 \x1b$B

jis0212 \x1b$(D

gb2312 \x1b$A

ksc5601 \x1b$(C

In the file, the first column represents an option and the second col-

umn is the associated value. iinniitt is a string to emit or expect before the first character is converted, while ffiinnaall is a string to emit or

expect after the last character. All other options are names of table-

based encodings; the associated value is the escape-sequence that marks

that encoding. Tcl syntax is used for the values; in the above exam-

ple, for instance, ``{{}}'' represents the empty string and ``\\xx11bb'' rep-

resents character 27. When TTccllGGeettEEnnccooddiinngg encounters an encoding name that has not been loaded, it attempts to load an encoding file called name..eenncc from the eennccooddiinngg subdirectory of each directory specified in the library path

$$ttcclllliibbPPaatthh. If the encoding file exists, but is malformed, an error

message will be left in interp. KKEEYYWWOORRDDSS utf, encoding, convert Tcl 8.1 TclGetEncoding(3)