Code Pages

Most applications written today handle character data primarily as Unicode, using the UTF-16 encoding. However, many legacy applications continue to use character sets based on code pages. Even new applications sometimes have to work with code pages, often for one of the following reasons:

To communicate with legacy applications.
To communicate with older mail and news servers, which might not always support Unicode.
To communicate with the Windows Console.

New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization.

Each code page is represented by a code page identifier, for example, 1252, and is handled by the Unicode and character set API functions. For a list of supported code page identifiers, see Code Page Identifiers. The «Code Pages» reference on the Microsoft Go Global Developer Center gives full descriptions of many code pages.

Windows code pages, commonly called «ANSI code pages», are code pages for which non-ASCII values (values greater than 127) represent international characters. These code pages are used natively in Windows Me, and are also available on Windows NT and later.

Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft. That draft eventually became ISO 8859-1, but Windows code page 1252 was implemented before the standard became final, and is not exactly the same as ISO 8859-1.

Many Windows API functions have «A» (ANSI) and «W» (wide, Unicode) versions. The «A» version handles text based on Windows code pages, while the «W» version handles Unicode text. See Windows Data Types for Strings and Conventions for Function Prototypes.

Windows code pages are also sometimes referred to as «active code pages» or «system active code pages». A Windows operating system always has one currently active Windows code page. All ANSI versions of API functions use the currently active code page.

Original equipment manufacturer (OEM) code pages are code pages for which non-ASCII values represent line drawing and punctuation characters. These code pages were originally used for MS-DOS and are still used for console applications. They are also used for the non-extended file names in the FAT12, FAT16, and FAT32 file systems, as described in Character Sets Used in File Names. The usual OEM code page for English is code page 437.

For both Windows code pages and OEM code pages, the code values 0x00 through 0x7F correspond to the 7-bit ASCII character set. Code values 0x00 through 0x19 and 0x7F always represent standardized control characters and 0x20 through 0x7E represent standardized displayable characters. Characters represented by the remaining codes, 0x80 through 0xff, vary among character sets. Each character set includes different special characters, typically customized for a language or group of languages. Windows code page 1252 and OEM code page 437 are generally used in the United States.

In addition to Windows and OEM code pages, your applications can use non-native code pages. Examples are EBCDIC and Macintosh code pages.

Two encodings of Unicode (UTF-7 and UTF-8) are implemented as code pages. Like other code pages, each page is known by a numeric identifier and can be handled with many of the same Unicode and character set API functions.

Code pages can be either single-byte character set (SBCS) pages or double-byte character set (DBCS) pages. In SBCS pages, each byte directly encodes a single character, so that it is possible to represent exactly 256 distinct characters (including control characters, letters, digits, punctuation, symbols, and the like). DBCS code pages are used for languages such as Japanese and Chinese. In such a code page, some characters have two-byte encodings with certain byte values (always values greater than 127) serving as «lead bytes». Instead of encoding characters in their own right, lead bytes can be mapped to a character only in conjunction with a «trail byte».

Some legacy protocols require the use of SBCS and DBCS code pages. Each SBCS/DBCS code page supports different characters, but no code page supports the full breadth of characters provided by Unicode. Each SBCS/DBCS code page supports a different subset, differently encoded.

Data converted from one SBCS or DBCS code page to another is subject to corruption, because the same data value on different code pages can encode a different character. Data converted from Unicode to SBCS or DBCS is subject to data loss, because a given code page might not be able to represent every character used in that particular Unicode data.

Читайте также: Text from pdf linux

In addition to SBCS and DBCS code pages, your applications have available the multibyte character set code pages 52936, 54936, 51949, and 5022x, which use an approach similar to that for a DBCS. A multibyte character set code page goes beyond two-byte encodings of some characters, however. UTF-7 and UTF-8 use a similar approach to encode Unicode based on a 7-bit and 8-bit bytes, respectively. For more information, see Unicode.

Several Unicode and character set functions allow your applications to handle code pages. An application can use the GetCPInfo and GetCPInfoEx functions to obtain information about a code page. This information includes the default character used when a character in a converted string has no corresponding entry in the code page.

An application can use the MultiByteToWideChar and WideCharToMultiByte functions to convert between strings based on Windows code pages and Unicode strings. Although their names refer to «MultiByte», these functions work equally well with SBCS, DBCS, and multibyte character set code pages.

WideCharToMultiByte can lose some data if the supplied code page cannot represent all characters in a Unicode string.

Your application can convert between Windows code pages and OEM code pages using the standard C runtime library functions. However, use of these functions presents a risk of data loss because the characters that can be represented by each code page do not match exactly.

Your applications can also call the GetACP function. This function retrieves the identifier of the current Windows (ANSI) code page.

2.2.1 Supported Codepage in Windows

Windows assigns an integer, called code page ID, to every supported codepage .

Based on the usage, the codepage supported in Windows can be categorized in the following:

Windows codepages are also sometimes referred to as active codepages or system active codepages. Windows always has one currently active Windows codepage. All ANSI Windows functions use the currently active codepage.

The usual ANSI codepage ID for US English is codepage 1252.

Windows codepage 1252, the codepage commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft. That draft eventually became ISO 8859-1, but Windows codepage 1252 was implemented before the standard became final, and is not exactly the same as ISO 8859-1.

These codepages cannot be used as ANSI codepages, or OEM codepages. Windows can support conversions between Unicode and these codepages. These codepages are generally used for information exchange purpose with international/national standard or legacy systems. Examples are UTF-8, UTF-7, EBCDIC, and Macintosh codepages.

The following table shows all the supported codepages by Windows. The Codepage ID lists the integer number assigned to a codepage. ANSI/OEM codepages are in bold face. The Codepage Description column describes the codepage. The Codepage notes column lists the category of a codepage and the relevant protocol section in this document to find protocol information.

IBM EBCDIC US-Canada

Extended codepage; for processing rules, see section 3.1.5.1.1 .

OEM United States

OEM codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC International

Extended codepage; for processing rules, see section 3.1.5.1.1.

Arabic (ASMO 708)

Extended codepage; for processing rules, see section 3.1.5.1.1.

Arabic (Transparent ASMO); Arabic (DOS)

Extended codepage; for processing rules, see section 3.1.5.1.1.

OEM Greek (formerly 437G); Greek (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Baltic; Baltic (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Multilingual Latin 1; Western European (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Latin 2; Central European (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Cyrillic (primarily Russian)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Turkish; Turkish (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Multilingual Latin 1 + Euro symbol

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Portuguese; Portuguese (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Icelandic; Icelandic (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Hebrew; Hebrew (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM French Canadian; French Canadian (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Arabic; Arabic (864)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Nordic; Nordic (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Russian; Cyrillic (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

OEM Modern Greek; Greek, Modern (DOS)

OEM codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2

Extended codepage; for processing rules, see section 3.1.5.1.1.

ANSI/OEM Thai (same as 28605, ISO 8859-15); Thai (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Greek Modern

Extended codepage; for processing rules, see section 3.1.5.1.1.

ANSI/OEM Japanese; Japanese (Shift-JIS)

ANSI/OEM codepage; for processing rules, see section 3.1.5.1.1.

ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)

ANSI/OEM codepage; for processing rules, see section 3.1.5.1.1.

ANSI/OEM Korean (Unified Hangul Code)

ANSI/OEM codepage; for processing rules, see section 3.1.5.1.1.

ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)

ANSI/OEM codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Turkish (Latin 5)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Latin 1/Open System

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)

Extended codepage; for processing rules, see section 3.1.5.1.1.

Unicode UTF-16, little-endian byte order (BMP of ISO 10646); available only to managed applications

Not used in Windows.

Unicode UTF-16, big-endian byte order; available only to managed applications

Not used in Windows.

ANSI Central European; Central European (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI Cyrillic; Cyrillic (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI Latin 1; Western European (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI Greek; Greek (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI Turkish; Turkish (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI Hebrew; Hebrew (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI Arabic; Arabic (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI Baltic; Baltic (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

ANSI/OEM Vietnamese; Vietnamese (Windows)

ANSI codepage; for processing rules, see section 3.1.5.1.1.

Extended codepage; for processing rules, see section 3.1.5.1.1.

MAC Roman; Western European (Mac)

Extended codepage; for processing rules, see section 3.1.5.1.1.

MAC Traditional Chinese (Big5); Chinese Traditional (Mac)

Extended codepage; for processing rules, see section 3.1.5.1.1.

MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)

Extended codepage; for processing rules, see section 3.1.5.1.1.

MAC Latin 2; Central European (Mac)

Extended codepage; for processing rules, see section 3.1.5.1.1.

Unicode UTF-32, little-endian byte order; available only to managed applications

Not used in Windows.

Unicode UTF-32, big-endian byte order; available only to managed applications

Not used in Windows.

CNS Taiwan; Chinese Traditional (CNS)

Extended codepage; for processing rules, see section 3.1.5.1.1.

Eten Taiwan; Chinese Traditional (Eten)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IA5 German (7-bit)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IA5 Swedish (7-bit)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IA5 Norwegian (7-bit)

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 6937 Non-Spacing Accent

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Germany

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Denmark-Norway

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Finland-Sweden

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Italy

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Latin America-Spain

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC United Kingdom

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Japanese Katakana Extended

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC France

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Arabic

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Greek

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Hebrew

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Korean Extended

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Thai

Extended codepage; for processing rules, see section 3.1.5.1.1.

Russian (KOI8-R); Cyrillic (KOI8-R)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Icelandic

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Cyrillic Russian

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Turkish

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)

Extended codepage; for processing rules, see section 3.1.5.1.1.

Japanese (JIS 0208-1990 and 0121-1990)

Extended codepage; for processing rules, see section 3.1.5.1.1.

Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)

Extended codepage; for processing rules, see section 3.1.5.1.1.

IBM EBCDIC Cyrillic Serbian-Bulgarian

Extended codepage; for processing rules, see section 3.1.5.1.1.

Ext Alpha Lowercase

Extended codepage; for processing rules, see section 3.1.5.1.1. NOTE: Although this codepage is supported, it has no known use.

Ukrainian (KOI8-U); Cyrillic (KOI8-U)

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-1 Latin 1; Western European (ISO)

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-2 Central European; Central European (ISO)

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-3 Latin 3

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-4 Baltic

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-5 Cyrillic

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-6 Arabic

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-7 Greek

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-8 Hebrew; Hebrew (ISO-Visual)

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-9 Turkish

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-13 Estonian

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-15 Latin 9

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 8859-8 Hebrew; Hebrew (ISO-Logical)

Extended codepage; for processing rules, see section 3.1.5.1.1. Use [CODEPAGEFILES] 28598.txt.

ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)

Extended codepage; for processing rules, see section 3.1.5.1.1.

ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)

Extended codepage; for processing rules, see section 3.1.5.1.2 .

ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana — SO/SI)

Extended codepage; for processing rules, see section 3.1.5.1.2.

ISO 2022 Korean

Extended codepage; for processing rules, see section 3.1.5.1.2.

ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)

Extended codepage; for processing rules, see section 3.1.5.1.2.

ISO 2022 Traditional Chinese

Extended codepage; for processing rules, see section 3.1.5.1.2.

Extended codepage; for processing rules, see section 3.1.5.1.2. Use [CODEPAGEFILES] 20949.txt.

HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)

Extended codepage; for processing rules, see section 3.1.5.1.2.

GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)

Extended codepage; for processing rules, see section 3.1.5.1.3 .

Extended codepage; for processing rules, see section 3.1.5.1.4 .

Extended codepage; for processing rules, see section 3.1.5.1.4.

ISCII Odia (was Oriya)

Extended codepage; for processing rules, see section 3.1.5.1.4.

Extended codepage; for processing rules, see section 3.1.5.1.5 .

Extended codepage; for processing rules, see section 3.1.5.1.6 .

Find code page windows

Code Pages

2.2.1 Supported Codepage in Windows