System z Assembler: Brave new world (1)

Code Pages

It is IBM’s intention, as far as I know, that you will never need any other server than a System z and z/OS. Any configuration can be applied to z/OS as well. That gave us USS and z/Linux, WebSphere Application Server, Java in CICS and Batch. In order to accommodate all these demands to System z, IBM had to invent new instructions that work with the PC- and Unix architecture. 1)

Here comes a very brief introduction to the new world.

It started in the early 60’s where S/360 used punched cards as input where others used punched tape. It had an enormous impact on the whole architecture of both worlds:

S/370	Intel
Punched cards	Punched tape
Records	Variable string length with null terminator
Datasets with records	Files with one long line of characters
EBCDIC as code page	ASCII as code page
Big Endian Integers	Little Endian Integers

Differences between record oriented and string oriented platforms

Code Pages

I will start with code page conversion. Both ASCII and EBCDIC are one byte character sets. However, there are quite a few differences. 2)

EBCDIC	ASCII
Numbers are last in collating sequence	Numbers are first in collating sequence
Capital letters come after lower case	Capital letters come before lower case
Letters come in three blocks with gaps between	Letters come in one block with no gaps
256 8 bit characters	128 7 bit characters, later 8 bit

All computer companies have now agreed to accept a common code page called Unicode. It has a two byte characters and it is for example used in Java. Although Unicode is usually two bytes it has a compressed specification that is very similar to ASCII except for special characters that can be two, three or four bytes. It’s called Unicode Transformation Format 8 (UTF8). 8 refers to number of bits. So UTF16 is the same as Unicode.

I have three examples:

Convert from EBCDIC to ASCII
Convert ASCII to UTF16
Convert UTF16 to UTF8

Convert from EBCDIC to ASCII

This is very straight forward where you use the Translate instruction (TR). It has been around as long as I remember (40 years). It goes through a string of EBCDIC bytes and use each EBCDIC byte value as an offset into a 256 character table. Example: A X’F1’ has the EBCDIC value 1 (one) but the byte value is decimal 241. At the position 241 in the translate table is value X’31’ which is the ASCII value of 1 (one) and it is moved to into EBCDIC’s position.

Result: EBCDIC 1 is changed to ASCII 1.

You can see my EBCDIC to ASCII translation table at the bottom.3)

The assembler statements are like this:

L R1,RECADR Address of EBCDIC input

MVC OUTREC,0(R1) Move input to output

TR OUTREC,ASCII Translate it to ASCII

……………………….

ASCII DC 256X'20' Space if it can not be translated

ORG ASCII+C'0'

NUMBERS DC 10AL1(X'30'+(*-NUMBERS))

………...etc……………

Convert ASCII to UTF16/Unicode

It is getting a little bit more complicated when you wish to convert ASCII into UTF16/Unicode. Any character in the European and some other languages has its own value in the 65535 value table. However, the European letters are between X’0000’ and X’0FFF’.

Once again you must make a translation table but this time each character is two bytes. The Data Constant has a “new” attribute CU (Character Unicode). Each character occupies 2 bytes but you will recognize the ASCII values have not changed but the national characters have changed value.

Example: Danish “Å” has the value X’00C5’. The high order bit in second byte is on, hence above decimal 127

0000A4 0042004C00C50042 57 DC CU'BLÅBÆRGRØD'

0000AC 00C6005200470052

0000B4 00D80044

Data Constant with UTF-16 attribute.

Please note, that the assembler program is in Danish EBCDIC Code page 277 or here ‘1142’. You have to tell the assembler that Danish national characters must be converted according to that specific code page. The specification is set in the parm.

//ASM EXEC PGM=ASMA90,PARM='DECK,NOOBJECT,CODEPAGE(1142)'

Translate from Danish code page

There is a new instruction to translate from one byte code page to two byte code page. It is called Translate One To Two or just TROT. It works nearly the same way as the old fashioned Translate except that it moves the translated bytes into an new output area, here OUTREC. Register 1 points to the translate table.

XR 0,0

LA R2,OUTREC

LA R3,L'OUTREC*2

ST R3,OUTPUT_LENGTH

L R4,RECADR

LA R5,L'OUTREC

LA 1,UTF16

TROT R2,R4,B'0001'

……………………………

UTF16 DS 0D

UTF16L DC 128AL2((*-UTF16)/2)

UTF16H DC 128XL2'0020'

ORG UTF16+X'86'*2 å

DC X'00E5'

……...etc………..

Use of TROT - Translate table must be on double word boundery

Convert UTF16 to UTF8

UTF8 is usually one byte per character but special characters can occupy 2, 3 or even 4 bytes. The coding is like this:

One byte ASCII	Two bytes	Three bytes	Four bytes
X’00’ - X’7F’	B’11xxxxxx’ B’10xxxxxx’	B’111xxxxx’ B’10xxxxxx’ B’10xxxxxx’	B’1111xxxx’ B’10xxxxxx’ B’10xxxxxx’ B’10xxxxxx’

The number of bits in the first half byte tell the number of bytes to one character. ‘x’ is a character bit

This could be very cumbersome if you had to write a conversion routine for every translate. It also would take a lot of CPU-cycles to do. And we do not want that, do we? MIPS is a scarce resource and very expensive. IBM invented a whole range of instructions for the purpose of converting code Unicode. They are all called something with “Convert” like “CONVERT UTF-16 TO UTF-8” (CU21). They are supposed to be faster because the conversion is done in the microcode.

This time you do not need a translate table.

LA R2,OUTREC

LA R3,L'OUTREC

ST R3,OUTPUT_LENGTH

L R4,RECADR

LR R5,R3

CU21 R2,R4

Translate from UTF16 to UTF8

The input length and the output length is the same because the input string might be only special characters that occupy 2 bytes

Notes:

- That is what I call plastic computers as opposed to the real iron.
If you can carry it, it is not a proper computer
- Some say that EBCDIC is an encryption algorithm
- You can display special code pages in ISPF - Browse by entering a command:

DISP ASCII
DISP UTF16
etc

Unicode definition: http://www.unicode.org/versions/Unicode8.0.0/

ASCII to Unicode conversion table

The first 128 characters are not changed. The last 128 are X’20’ (space) unless the character is over written by a national character

UTF16 DS 0H

UTF16L DC 128AL2((*-UTF16)/2)

UTF16H DC 128XL2'0020'

ORG UTF16+X'86'*2 å

DC X'00E5'

ORG UTF16+X'8F'*2 Å

DC X'00C5'

ORG UTF16+X'91'*2 æ

DC X'00E6'

ORG UTF16+X'92'*2 Æ

DC X'00C6'

ORG UTF16+X'9B'*2 ø

DC X'00F8'

ORG UTF16+X'9D'*2 Ø

DC X'00D8'

org UTF16+(256*2)

EBCDIC to ASCII conversion table

ASCII DC 256X'20' spaces

ORG ASCII+C'0'

NUMBERS DC 10AL1(X'30'+(*-NUMBERS))

ORG ASCII+C'A'

UCA2I DC 9AL1(X'41'+(*-UCA2I))

ORG ASCII+C'J'

UCJ2R DC 9AL1(X'4A'+(*-UCJ2R))

ORG ASCII+C'S'

UCS2Z DC 8AL1(X'53'+(*-UCS2Z))

ORG ASCII+C'a'

LCA2I DC 9AL1(X'61'+(*-LCA2I))

ORG ASCII+C'j'

LCJ2R DC 9AL1(X'6A'+(*-LCJ2R))

ORG ASCII+C's'

LCS2Z DC 8AL1(X'73'+(*-LCS2Z))

ORG ASCII+C'æ'

DC X'91'

ORG ASCII+C'Æ'

DC X'92'

ORG ASCII+C'ø'

DC X'9B'

ORG ASCII+C'Ø'

DC X'9D'

ORG ASCII+C'å'

DC X'86'

ORG ASCII+C'Å'

DC X'8F'

ORG ASCII+C' '

DC X'20'

ORG ASCII+C'!'

DC X'21'

ORG ASCII+C'"'

DC X'22'

ORG ASCII+C'#'

DC X'23'

ORG ASCII+C'$'

DC X'24'

ORG ASCII+C'%'

DC X'25'

ORG ASCII+X'50' &

DC X'26'

ORG ASCII+C''''

DC X'27'

ORG ASCII+C'('

DC X'28'

ORG ASCII+C')'

DC X'29'

ORG ASCII+C'*'

DC X'2A'

ORG ASCII+C'+'

DC X'2B'

ORG ASCII+C','

DC X'2C'

ORG ASCII+C'-'

DC X'2D'

ORG ASCII+C'.'

DC X'2E'

ORG ASCII+C'/'

DC X'2F'

ORG ASCII+256

System z Assembler

fredag den 20. november 2015

Brave new world (1) - Code Pages

Code Pages

Code Pages

Convert from EBCDIC to ASCII

Convert ASCII to UTF16/Unicode

Convert UTF16 to UTF8

Notes:

ASCII to Unicode conversion table

EBCDIC to ASCII conversion table

Ingen kommentarer:

Send en kommentar