torsdag den 26. november 2015

Brave new world (3) - The war between endians

The war between endians

This last article in this series is about the most peculiar thing in the computer industry, I have ever seen: The different implementation of integers.

As you know, a proper integer looks like this: X’12345678’ which is decimal 305.419.896. On Intel computers it is the other way round: X’78563412’ is decimal 305.419.896. That really made me dizzy, when I saw it the first time. Why are they doing it that way? There can be several reasons but my explanation goes back to the paper tape. You could mix text with integers on a paper tape. They knew when to expect an integer and when the first byte was retrieved they could start calculating, f. ex. adding it to an integer in storage. Numbers come in calculating order.

The two types of CPU’s had an argument of which integer was better and some named it “Big-endian” (the right way ;-)’ and “Little endian” (the Intel way) after which address in storage the least significant byte starts.

Wikipedia writes:

“... the "endian" names were pointedly drawn from Jonathan Swift's 1726 satirical fantasy novel, Gulliver’s Travels, in which civil war erupts over whether the big or the small end of a soft-boiled egg is the proper end to crack open.”

In order to incorporate all peculiarities from the “plastic” world System z also had to make some instructions that dealt with “little endians” *1).

Move Inverse - MVCIN
Load Reversed - LRV
Store Reversed - STRV

*2)

LRV and STRV instructions are very straight forward. They load and store integers as ST and L. They just reverse the integer in the process. However, the MVCIN addresses the last byte of the “from” field and moves from right-to-left to the “to” field that adds the bytes left-to-right. Look at my example and it will probably make sense.

INTEGERS DS CL8

INTEGER DS F

STORE_REVERSED DS F

MOVE_INVERSED DS F

LOAD_REVERSED DS F

-------------------------------------------

MVC INTEGERS,=C'INTEGERS'

MVC INTEGER,BIG_ENDIAN

L R1,BIG_ENDIAN

STRV R1,STORE_REVERSED

MVCIN MOVE_INVERSED,BIG_ENDIAN+L'INTEGER-1

LRV R1,BIG_ENDIAN

ST R1,LOAD_REVERSED

--------------------------------------------

BIG_ENDIAN DC A(305419896)

Code example of reverse instructions

The storage dump of the integers looks as below

1F700E40 +58 00006FF0 C9D5E3C5 C7C5D9E2 12345678 *..?0INTEGERS....*

1F700E50 +68 78563412 78563412 78563412 *............ *

Storage dump of little-endians and big-endians

*1) If you can’t remember which is which just think of Little Endians are for small computers

*2) Pardon my English but I do not know the difference between "Inverse" and "Reverse"

onsdag den 25. november 2015

Brave new world (2) - Strings

Strings

In this new brave world of Intel servers etc., System z and z/OS have to deal with strings. A string is a set of (ASCII) characters ended with a “String Terminator”, usually “Null” (binary zero). You have to count the characters if you wish to know the length of the string. That is a legacy from the paper tapes that contained ASCII strings terminated with null. Strings have continued to files, TCP/IP communication etc. Strings are build into programming languages like C, C++, and Java, hense we have to deal with them so two new instructions have been developed which I will describe below:

CLST - Compare Logical String
MVST - Move String

CLST - Compare Logical String

CLST compare two strings of bytes. Only if they have the same length and every byte are equal it satisfies a Branch Equal (BE). Otherwise the first operand is either lower or higher than the second operand. The terminating character can be any 8 bit byte, usually X’00’ but in the first example it is EBCDIC Space (X’40’). The string terminator is always the low order byte of register zero. Make sure that register 0 is zero except maybe for the low order byte.

XR 0,0

ICM 0,B'0001',=C' '

LA R1,PARM_STRING

LA R2,PARM

CLST R1,R2

BE START_PROCESS

B WRONG_PARM

PARM_STRING DC C'STRING '

CLST - Compare String

MVST - Move String

The MVST copies a string from one position to another. It is the string terminator of the second operand that decides the length of the copied string. The example below copies a string to OUTREC and the rest of OUTREC is padded with ASCII space (X’20’)

L R2,RECADR

XR 0,0

LA R1,OUTREC

MVC OUTPUT_LENGTH,=A(L'OUTREC)

MVST R1,R2

LA R2,OUTREC+L'OUTREC

SR R2,R1

MVI 0(R1),X'20'

S R2,=A(2)

EX R2,MVC

B MVC+L'MVC

MVC MVC 1(0,R1),0(R1)

MVST - Move String

fredag den 20. november 2015

Brave new world (1) - Code Pages

Code Pages

It is IBM’s intention, as far as I know, that you will never need any other server than a System z and z/OS. Any configuration can be applied to z/OS as well. That gave us USS and z/Linux, WebSphere Application Server, Java in CICS and Batch. In order to accommodate all these demands to System z, IBM had to invent new instructions that work with the PC- and Unix architecture. 1)

Here comes a very brief introduction to the new world.

It started in the early 60’s where S/360 used punched cards as input where others used punched tape. It had an enormous impact on the whole architecture of both worlds:

S/370	Intel
Punched cards	Punched tape
Records	Variable string length with null terminator
Datasets with records	Files with one long line of characters
EBCDIC as code page	ASCII as code page
Big Endian Integers	Little Endian Integers

Differences between record oriented and string oriented platforms

Code Pages

I will start with code page conversion. Both ASCII and EBCDIC are one byte character sets. However, there are quite a few differences. 2)

EBCDIC	ASCII
Numbers are last in collating sequence	Numbers are first in collating sequence
Capital letters come after lower case	Capital letters come before lower case
Letters come in three blocks with gaps between	Letters come in one block with no gaps
256 8 bit characters	128 7 bit characters, later 8 bit

All computer companies have now agreed to accept a common code page called Unicode. It has a two byte characters and it is for example used in Java. Although Unicode is usually two bytes it has a compressed specification that is very similar to ASCII except for special characters that can be two, three or four bytes. It’s called Unicode Transformation Format 8 (UTF8). 8 refers to number of bits. So UTF16 is the same as Unicode.

I have three examples:

Convert from EBCDIC to ASCII
Convert ASCII to UTF16
Convert UTF16 to UTF8

Convert from EBCDIC to ASCII

This is very straight forward where you use the Translate instruction (TR). It has been around as long as I remember (40 years). It goes through a string of EBCDIC bytes and use each EBCDIC byte value as an offset into a 256 character table. Example: A X’F1’ has the EBCDIC value 1 (one) but the byte value is decimal 241. At the position 241 in the translate table is value X’31’ which is the ASCII value of 1 (one) and it is moved to into EBCDIC’s position.

Result: EBCDIC 1 is changed to ASCII 1.

You can see my EBCDIC to ASCII translation table at the bottom.3)

The assembler statements are like this:

L R1,RECADR Address of EBCDIC input

MVC OUTREC,0(R1) Move input to output

TR OUTREC,ASCII Translate it to ASCII

……………………….

ASCII DC 256X'20' Space if it can not be translated

ORG ASCII+C'0'

NUMBERS DC 10AL1(X'30'+(*-NUMBERS))

………...etc……………

Convert ASCII to UTF16/Unicode

It is getting a little bit more complicated when you wish to convert ASCII into UTF16/Unicode. Any character in the European and some other languages has its own value in the 65535 value table. However, the European letters are between X’0000’ and X’0FFF’.

Once again you must make a translation table but this time each character is two bytes. The Data Constant has a “new” attribute CU (Character Unicode). Each character occupies 2 bytes but you will recognize the ASCII values have not changed but the national characters have changed value.

Example: Danish “Å” has the value X’00C5’. The high order bit in second byte is on, hence above decimal 127

0000A4 0042004C00C50042 57 DC CU'BLÅBÆRGRØD'

0000AC 00C6005200470052

0000B4 00D80044

Data Constant with UTF-16 attribute.

Please note, that the assembler program is in Danish EBCDIC Code page 277 or here ‘1142’. You have to tell the assembler that Danish national characters must be converted according to that specific code page. The specification is set in the parm.

//ASM EXEC PGM=ASMA90,PARM='DECK,NOOBJECT,CODEPAGE(1142)'

Translate from Danish code page

There is a new instruction to translate from one byte code page to two byte code page. It is called Translate One To Two or just TROT. It works nearly the same way as the old fashioned Translate except that it moves the translated bytes into an new output area, here OUTREC. Register 1 points to the translate table.

XR 0,0

LA R2,OUTREC

LA R3,L'OUTREC*2

ST R3,OUTPUT_LENGTH

L R4,RECADR

LA R5,L'OUTREC

LA 1,UTF16

TROT R2,R4,B'0001'

……………………………

UTF16 DS 0D

UTF16L DC 128AL2((*-UTF16)/2)

UTF16H DC 128XL2'0020'

ORG UTF16+X'86'*2 å

DC X'00E5'

……...etc………..

Use of TROT - Translate table must be on double word boundery

Convert UTF16 to UTF8

UTF8 is usually one byte per character but special characters can occupy 2, 3 or even 4 bytes. The coding is like this:

One byte ASCII	Two bytes	Three bytes	Four bytes
X’00’ - X’7F’	B’11xxxxxx’ B’10xxxxxx’	B’111xxxxx’ B’10xxxxxx’ B’10xxxxxx’	B’1111xxxx’ B’10xxxxxx’ B’10xxxxxx’ B’10xxxxxx’

The number of bits in the first half byte tell the number of bytes to one character. ‘x’ is a character bit

This could be very cumbersome if you had to write a conversion routine for every translate. It also would take a lot of CPU-cycles to do. And we do not want that, do we? MIPS is a scarce resource and very expensive. IBM invented a whole range of instructions for the purpose of converting code Unicode. They are all called something with “Convert” like “CONVERT UTF-16 TO UTF-8” (CU21). They are supposed to be faster because the conversion is done in the microcode.

This time you do not need a translate table.

LA R2,OUTREC

LA R3,L'OUTREC

ST R3,OUTPUT_LENGTH

L R4,RECADR

LR R5,R3

CU21 R2,R4

Translate from UTF16 to UTF8

The input length and the output length is the same because the input string might be only special characters that occupy 2 bytes

Notes:

- That is what I call plastic computers as opposed to the real iron.
If you can carry it, it is not a proper computer
- Some say that EBCDIC is an encryption algorithm
- You can display special code pages in ISPF - Browse by entering a command:

DISP ASCII
DISP UTF16
etc

Unicode definition: http://www.unicode.org/versions/Unicode8.0.0/

ASCII to Unicode conversion table

The first 128 characters are not changed. The last 128 are X’20’ (space) unless the character is over written by a national character

UTF16 DS 0H

UTF16L DC 128AL2((*-UTF16)/2)

UTF16H DC 128XL2'0020'

ORG UTF16+X'86'*2 å

DC X'00E5'

ORG UTF16+X'8F'*2 Å

DC X'00C5'

ORG UTF16+X'91'*2 æ

DC X'00E6'

ORG UTF16+X'92'*2 Æ

DC X'00C6'

ORG UTF16+X'9B'*2 ø

DC X'00F8'

ORG UTF16+X'9D'*2 Ø

DC X'00D8'

org UTF16+(256*2)

EBCDIC to ASCII conversion table

ASCII DC 256X'20' spaces

ORG ASCII+C'0'

NUMBERS DC 10AL1(X'30'+(*-NUMBERS))

ORG ASCII+C'A'

UCA2I DC 9AL1(X'41'+(*-UCA2I))

ORG ASCII+C'J'

UCJ2R DC 9AL1(X'4A'+(*-UCJ2R))

ORG ASCII+C'S'

UCS2Z DC 8AL1(X'53'+(*-UCS2Z))

ORG ASCII+C'a'

LCA2I DC 9AL1(X'61'+(*-LCA2I))

ORG ASCII+C'j'

LCJ2R DC 9AL1(X'6A'+(*-LCJ2R))

ORG ASCII+C's'

LCS2Z DC 8AL1(X'73'+(*-LCS2Z))

ORG ASCII+C'æ'

DC X'91'

ORG ASCII+C'Æ'

DC X'92'

ORG ASCII+C'ø'

DC X'9B'

ORG ASCII+C'Ø'

DC X'9D'

ORG ASCII+C'å'

DC X'86'

ORG ASCII+C'Å'

DC X'8F'

ORG ASCII+C' '

DC X'20'

ORG ASCII+C'!'

DC X'21'

ORG ASCII+C'"'

DC X'22'

ORG ASCII+C'#'

DC X'23'

ORG ASCII+C'$'

DC X'24'

ORG ASCII+C'%'

DC X'25'

ORG ASCII+X'50' &

DC X'26'

ORG ASCII+C''''

DC X'27'

ORG ASCII+C'('

DC X'28'

ORG ASCII+C')'

DC X'29'

ORG ASCII+C'*'

DC X'2A'

ORG ASCII+C'+'

DC X'2B'

ORG ASCII+C','

DC X'2C'

ORG ASCII+C'-'

DC X'2D'

ORG ASCII+C'.'

DC X'2E'

ORG ASCII+C'/'

DC X'2F'

ORG ASCII+256