Code Pages
It is IBM’s intention, as far as I know, that you will never need any other server than a System z and z/OS. Any configuration can be applied to z/OS as well. That gave us USS and z/Linux, WebSphere Application Server, Java in CICS and Batch. In order to accommodate all these demands to System z, IBM had to invent new instructions that work with the PC- and Unix architecture. 1)
Here comes a very brief introduction to the new world.
It started in the early 60’s where S/360 used punched cards as input where others used punched tape. It had an enormous impact on the whole architecture of both worlds:
S/370
|
Intel
|
Punched cards
|
Punched tape
|
Records
|
Variable string length with null terminator
|
Datasets with records
|
Files with one long line of characters
|
EBCDIC as code page
|
ASCII as code page
|
Big Endian Integers
|
Little Endian Integers
|
Differences between record oriented and string oriented platforms
Code Pages
I will start with code page conversion. Both ASCII and EBCDIC are one byte character sets. However, there are quite a few differences. 2)
EBCDIC
|
ASCII
|
Numbers are last in collating sequence
|
Numbers are first in collating sequence
|
Capital letters come after lower case
|
Capital letters come before lower case
|
Letters come in three blocks with gaps between
|
Letters come in one block with no gaps
|
256 8 bit characters
|
128 7 bit characters, later 8 bit
|
All computer companies have now agreed to accept a common code page called Unicode. It has a two byte characters and it is for example used in Java. Although Unicode is usually two bytes it has a compressed specification that is very similar to ASCII except for special characters that can be two, three or four bytes. It’s called Unicode Transformation Format 8 (UTF8). 8 refers to number of bits. So UTF16 is the same as Unicode.
I have three examples:
Convert from EBCDIC to ASCII
Convert ASCII to UTF16
Convert UTF16 to UTF8
Convert from EBCDIC to ASCII
This is very straight forward where you use the Translate instruction (TR). It has been around as long as I remember (40 years). It goes through a string of EBCDIC bytes and use each EBCDIC byte value as an offset into a 256 character table. Example: A X’F1’ has the EBCDIC value 1 (one) but the byte value is decimal 241. At the position 241 in the translate table is value X’31’ which is the ASCII value of 1 (one) and it is moved to into EBCDIC’s position.
Result: EBCDIC 1 is changed to ASCII 1.
You can see my EBCDIC to ASCII translation table at the bottom.3)
The assembler statements are like this:
L R1,RECADR Address of EBCDIC input
MVC OUTREC,0(R1) Move input to output
TR OUTREC,ASCII Translate it to ASCII
……………………….
ASCII DC 256X'20' Space if it can not be translated
ORG ASCII+C'0'
NUMBERS DC 10AL1(X'30'+(*-NUMBERS))
………...etc……………
|
Convert ASCII to UTF16/Unicode
It is getting a little bit more complicated when you wish to convert ASCII into UTF16/Unicode. Any character in the European and some other languages has its own value in the 65535 value table. However, the European letters are between X’0000’ and X’0FFF’.
Once again you must make a translation table but this time each character is two bytes. The Data Constant has a “new” attribute CU (Character Unicode). Each character occupies 2 bytes but you will recognize the ASCII values have not changed but the national characters have changed value.
Example: Danish “Å” has the value X’00C5’. The high order bit in second byte is on, hence above decimal 127
0000A4 0042004C00C50042 57 DC CU'BLÅBÆRGRØD'
0000AC 00C6005200470052
0000B4 00D80044
|
Data Constant with UTF-16 attribute.
Please note, that the assembler program is in Danish EBCDIC Code page 277 or here ‘1142’. You have to tell the assembler that Danish national characters must be converted according to that specific code page. The specification is set in the parm.
//ASM EXEC PGM=ASMA90,PARM='DECK,NOOBJECT,CODEPAGE(1142)'
|
Translate from Danish code page
There is a new instruction to translate from one byte code page to two byte code page. It is called Translate One To Two or just TROT. It works nearly the same way as the old fashioned Translate except that it moves the translated bytes into an new output area, here OUTREC. Register 1 points to the translate table.
XR 0,0
LA R2,OUTREC
LA R3,L'OUTREC*2
ST R3,OUTPUT_LENGTH
L R4,RECADR
LA R5,L'OUTREC
LA 1,UTF16
TROT R2,R4,B'0001'
……………………………
UTF16 DS 0D
UTF16L DC 128AL2((*-UTF16)/2)
UTF16H DC 128XL2'0020'
ORG UTF16+X'86'*2 å
DC X'00E5'
……...etc………..
|
Use of TROT - Translate table must be on double word boundery
Convert UTF16 to UTF8
UTF8 is usually one byte per character but special characters can occupy 2, 3 or even 4 bytes. The coding is like this:
One byte ASCII
|
Two bytes
|
Three bytes
|
Four bytes
|
X’00’ - X’7F’
|
B’11xxxxxx’
B’10xxxxxx’
|
B’111xxxxx’
B’10xxxxxx’
B’10xxxxxx’
|
B’1111xxxx’
B’10xxxxxx’
B’10xxxxxx’
B’10xxxxxx’
|
The number of bits in the first half byte tell the number of bytes to one character. ‘x’ is a character bit
This could be very cumbersome if you had to write a conversion routine for every translate. It also would take a lot of CPU-cycles to do. And we do not want that, do we? MIPS is a scarce resource and very expensive. IBM invented a whole range of instructions for the purpose of converting code Unicode. They are all called something with “Convert” like “CONVERT UTF-16 TO UTF-8” (CU21). They are supposed to be faster because the conversion is done in the microcode.
This time you do not need a translate table.
LA R2,OUTREC
LA R3,L'OUTREC
ST R3,OUTPUT_LENGTH
L R4,RECADR
LR R5,R3
CU21 R2,R4
|
Translate from UTF16 to UTF8
The input length and the output length is the same because the input string might be only special characters that occupy 2 bytes
Notes:
- That is what I call plastic computers as opposed to the real iron.
If you can carry it, it is not a proper computer
- Some say that EBCDIC is an encryption algorithm
- You can display special code pages in ISPF - Browse by entering a command:
DISP ASCII
DISP UTF16
etc
ASCII to Unicode conversion table
The first 128 characters are not changed. The last 128 are X’20’ (space) unless the character is over written by a national character
UTF16 DS 0H
UTF16L DC 128AL2((*-UTF16)/2)
UTF16H DC 128XL2'0020'
ORG UTF16+X'86'*2 å
DC X'00E5'
ORG UTF16+X'8F'*2 Å
DC X'00C5'
ORG UTF16+X'91'*2 æ
DC X'00E6'
ORG UTF16+X'92'*2 Æ
DC X'00C6'
ORG UTF16+X'9B'*2 ø
DC X'00F8'
ORG UTF16+X'9D'*2 Ø
DC X'00D8'
org UTF16+(256*2)
|
EBCDIC to ASCII conversion table
ASCII DC 256X'20' spaces
ORG ASCII+C'0'
NUMBERS DC 10AL1(X'30'+(*-NUMBERS))
ORG ASCII+C'A'
UCA2I DC 9AL1(X'41'+(*-UCA2I))
ORG ASCII+C'J'
UCJ2R DC 9AL1(X'4A'+(*-UCJ2R))
ORG ASCII+C'S'
UCS2Z DC 8AL1(X'53'+(*-UCS2Z))
ORG ASCII+C'a'
LCA2I DC 9AL1(X'61'+(*-LCA2I))
ORG ASCII+C'j'
LCJ2R DC 9AL1(X'6A'+(*-LCJ2R))
ORG ASCII+C's'
LCS2Z DC 8AL1(X'73'+(*-LCS2Z))
ORG ASCII+C'æ'
DC X'91'
ORG ASCII+C'Æ'
DC X'92'
ORG ASCII+C'ø'
DC X'9B'
ORG ASCII+C'Ø'
DC X'9D'
ORG ASCII+C'å'
DC X'86'
ORG ASCII+C'Å'
DC X'8F'
ORG ASCII+C' '
DC X'20'
ORG ASCII+C'!'
DC X'21'
ORG ASCII+C'"'
DC X'22'
ORG ASCII+C'#'
DC X'23'
ORG ASCII+C'$'
DC X'24'
ORG ASCII+C'%'
DC X'25'
ORG ASCII+X'50' &
DC X'26'
ORG ASCII+C''''
DC X'27'
ORG ASCII+C'('
DC X'28'
ORG ASCII+C')'
DC X'29'
ORG ASCII+C'*'
DC X'2A'
ORG ASCII+C'+'
DC X'2B'
ORG ASCII+C','
DC X'2C'
ORG ASCII+C'-'
DC X'2D'
ORG ASCII+C'.'
DC X'2E'
ORG ASCII+C'/'
DC X'2F'
ORG ASCII+256
|