torsdag den 26. november 2015

Brave new world (3) - The war between endians

The war between endians

This last article in this series is about the most peculiar thing in the computer industry, I have ever seen: The different implementation of integers.

As you know, a proper integer looks like this: X’12345678’ which is decimal 305.419.896. On Intel computers it is the other way round: X’78563412’ is decimal 305.419.896. That really made me dizzy, when I saw it the first time. Why are they doing it that way? There can be several reasons but my explanation goes back to the paper tape. You could mix text with integers on a paper tape. They knew when to expect an integer and when the first byte was retrieved they could start calculating, f. ex. adding it to an integer in storage. Numbers come in calculating order.

The two types of CPU’s had an argument of which integer was better and some named it “Big-endian” (the right way ;-)’ and “Little endian” (the Intel way) after which address in storage the least significant byte starts.
Wikipedia writes:
“...  the "endian" names were pointedly drawn from Jonathan Swift's 1726 satirical fantasy novel, Gulliver’s Travels, in which civil war erupts over whether the big or the small end of a soft-boiled egg is the proper end to crack open.”

In order to incorporate all peculiarities from the “plastic” world System z also had to make some instructions that dealt with “little endians” *1).

  1. Move Inverse - MVCIN
  2. Load Reversed - LRV
  3. Store Reversed - STRV
*2)

LRV and STRV instructions are very straight forward. They load and store integers as ST and L. They just reverse the integer in the process. However, the MVCIN addresses the last byte of the “from” field and moves from right-to-left to the “to” field that adds the bytes left-to-right. Look at my example and it will probably make sense.
INTEGERS       DS CL8
INTEGER        DS F  
STORE_REVERSED DS F  
MOVE_INVERSED  DS F  
LOAD_REVERSED  DS F  
-------------------------------------------
        MVC   INTEGERS,=C'INTEGERS'               
*                                                  
        MVC   INTEGER,BIG_ENDIAN                  
*                                                  
        L     R1,BIG_ENDIAN                       
        STRV  R1,STORE_REVERSED                   
*                                                  
        MVCIN MOVE_INVERSED,BIG_ENDIAN+L'INTEGER-1
*                                                  
        LRV   R1,BIG_ENDIAN                       
        ST    R1,LOAD_REVERSED
--------------------------------------------
BIG_ENDIAN DC  A(305419896)
Code example of reverse instructions

The storage dump of the integers looks as below
1F700E40       +58  00006FF0 C9D5E3C5 C7C5D9E2 12345678 *..?0INTEGERS....*
1F700E50       +68  78563412 78563412 78563412          *............    *
Storage dump of little-endians and big-endians

*1) If you can’t remember which is which just think of Little Endians are for small computers
*2) Pardon my English but I do not know the difference between "Inverse" and "Reverse"

onsdag den 25. november 2015

Brave new world (2) - Strings

Strings

In this new brave world of Intel servers etc., System z and z/OS have to deal with strings. A string is a set of (ASCII) characters ended with a “String Terminator”, usually “Null” (binary zero). You have to count the characters if you wish to know the length of the string. That is a legacy from the paper tapes that contained ASCII strings terminated with null. Strings have continued to files, TCP/IP communication etc. Strings are build into programming languages like C, C++, and Java, hense we have to deal with them so two new instructions have been developed which I will describe below:
  • CLST - Compare Logical String
  • MVST - Move String

CLST - Compare Logical String

CLST compare two strings of bytes. Only if they have the same length and every byte are equal it satisfies a Branch Equal (BE). Otherwise the first operand is either lower or higher than the second operand. The terminating character can be any 8 bit byte, usually X’00’ but in the first example it is EBCDIC Space (X’40’). The string terminator is always the low order byte of register zero. Make sure that register 0 is zero except maybe for the low order byte.
        XR    0,0            
        ICM   0,B'0001',=C' '
        LA    R1,PARM_STRING
        LA    R2,PARM        
        CLST  R1,R2          
        BE    START_PROCESS
        B     WRONG_PARM
PARM_STRING DC C'STRING '
CLST - Compare String

MVST - Move String

The MVST copies a string from one position to another. It is the string terminator of the second operand that decides the length of the copied string. The example below copies a string to OUTREC and the rest of OUTREC is padded with ASCII space (X’20’)
        L     R2,RECADR
        XR    0,0
        LA    R1,OUTREC                 
        MVC   OUTPUT_LENGTH,=A(L'OUTREC)
        MVST  R1,R2                     
        LA    R2,OUTREC+L'OUTREC        
        SR    R2,R1                     
        MVI   0(R1),X'20'               
        S     R2,=A(2)                  
        EX    R2,MVC                    
        B     MVC+L'MVC                 
MVC      MVC   1(0,R1),0(R1)

MVST - Move String



fredag den 20. november 2015

Brave new world (1) - Code Pages

Code Pages

It is IBM’s intention, as far as I know, that you will never need any other server than a System z and z/OS. Any configuration can be applied to z/OS as well. That gave us USS and z/Linux, WebSphere Application Server, Java in CICS and Batch. In order to accommodate all these demands to System z, IBM had to invent new instructions that work with the PC- and Unix architecture. 1)
Here comes a very brief introduction to the new world.


It started in the early 60’s where S/360 used punched cards as input where others used punched tape. It had an enormous impact on the whole architecture of both worlds:
S/370
Intel
Punched cards
Punched tape
Records
Variable string length with null terminator
Datasets with records
Files with one long line of characters
EBCDIC as code page
ASCII as code page
Big Endian Integers
Little Endian Integers
Differences between record oriented and string oriented platforms

Code Pages

I will start with code page conversion. Both ASCII and EBCDIC are one byte character sets. However, there are quite a few differences. 2)
EBCDIC
ASCII
Numbers are last in collating sequence
Numbers are first in collating sequence
Capital letters come after lower case
Capital letters come before lower case
Letters come in three blocks with gaps between
Letters come in one block with no gaps
256 8 bit characters
128 7 bit characters, later 8 bit


All computer companies have now agreed to accept a common code page called Unicode. It has a two byte characters and it is for example used in Java. Although Unicode is usually two bytes it has a compressed specification that is very similar to ASCII except for special characters that can be two, three or four bytes. It’s called Unicode Transformation Format 8 (UTF8). 8 refers to number of bits. So UTF16 is the same as Unicode.


I have three examples:
  1. Convert from EBCDIC to ASCII
  2. Convert ASCII to UTF16
  3. Convert UTF16 to UTF8


Convert from EBCDIC to ASCII

This is very straight forward where you use the Translate instruction (TR). It has been around as long as I remember (40 years). It goes through a string of EBCDIC bytes and use each EBCDIC byte value as an offset into a 256 character table. Example: A X’F1’ has the EBCDIC value 1 (one) but the byte value is decimal 241. At the position 241 in the translate table is value X’31’ which is the ASCII value of 1 (one) and it is moved to into EBCDIC’s position.
Result: EBCDIC 1 is changed to ASCII 1.
You can see my EBCDIC to ASCII translation table at the bottom.3)


The assembler statements are like this:
        L     R1,RECADR      Address of EBCDIC input
        MVC   OUTREC,0(R1)   Move input to output    
        TR    OUTREC,ASCII   Translate it to ASCII   
……………………….
ASCII    DC    256X'20'       Space if it can not be translated     
        ORG   ASCII+C'0'              
NUMBERS  DC    10AL1(X'30'+(*-NUMBERS))
………...etc……………


Convert ASCII to UTF16/Unicode

It is getting a little bit more complicated when you wish to convert ASCII into UTF16/Unicode. Any character in the European and some other languages has its own value in the 65535 value table. However, the European letters are between X’0000’ and X’0FFF’.


Once again you must make a translation table but this time each character is two bytes. The Data Constant has a “new” attribute CU (Character Unicode). Each character occupies 2 bytes but you will recognize the ASCII values have not changed but the national characters have changed value.
Example: Danish “Å” has the value X’00C5’. The high order bit in second byte is on, hence above decimal 127


0000A4 0042004C00C50042    57    DC    CU'BLÅBÆRGRØD'
0000AC 00C6005200470052                                              
0000B4 00D80044                                                      
Data Constant with UTF-16 attribute.


Please note, that the assembler program is in Danish EBCDIC Code page 277 or here ‘1142’. You have to tell the assembler that Danish national characters must be converted according to that specific code page. The specification is set in the parm.
//ASM      EXEC PGM=ASMA90,PARM='DECK,NOOBJECT,CODEPAGE(1142)'
Translate from Danish code page


There is a new instruction to translate from one byte code page to two byte code page. It is called Translate One To Two or just TROT. It works nearly the same way as the old fashioned Translate except that it moves the translated bytes  into an new output area, here OUTREC. Register 1 points to the translate table.
        XR    0,0             
        LA    R2,OUTREC       
        LA    R3,L'OUTREC*2   
        ST    R3,OUTPUT_LENGTH
        L     R4,RECADR       
        LA    R5,L'OUTREC     
        LA    1,UTF16         
        TROT  R2,R4,B'0001'   
……………………………
UTF16    DS    0D                  
UTF16L   DC    128AL2((*-UTF16)/2)
UTF16H   DC    128XL2'0020'        
        ORG   UTF16+X'86'*2 å     
        DC    X'00E5'             
……...etc………..
Use of TROT - Translate table must be on double word boundery


Convert UTF16 to UTF8

UTF8 is usually one byte per character but special characters can occupy 2, 3 or even 4 bytes. The coding is like this:
One byte ASCII
Two bytes
Three bytes
Four bytes
X’00’ - X’7F’
B’11xxxxxx’
B’10xxxxxx’
B’111xxxxx’
B’10xxxxxx’
B’10xxxxxx’
B’1111xxxx’
B’10xxxxxx’
B’10xxxxxx’
B’10xxxxxx’
The number of bits in the first half byte tell the number of bytes to one character. ‘x’ is a character bit


This could be very cumbersome if you had to write a conversion routine for every translate. It also would take a lot of CPU-cycles to do. And we do not want that, do we? MIPS is a scarce resource and very expensive. IBM invented a whole range of instructions for the purpose of converting code Unicode. They are all called something with “Convert” like “CONVERT UTF-16 TO UTF-8” (CU21). They are supposed to be faster because the conversion is done in the microcode.


This time you do not need a translate table.
        LA    R2,OUTREC       
        LA    R3,L'OUTREC     
        ST    R3,OUTPUT_LENGTH
        L     R4,RECADR       
        LR    R5,R3
        CU21  R2,R4
Translate from UTF16 to UTF8


The input length and the output length is the same because the input string might be only special characters that occupy 2 bytes


Notes:

  1. - That is what I call plastic computers as opposed to the real iron.
      If you can carry it, it is not a proper computer
  2. - Some say that EBCDIC is an encryption algorithm
  3. - You can display special code pages in ISPF - Browse by entering a command:
  • DISP ASCII
  • DISP UTF16
  • etc




ASCII to Unicode conversion table

The first 128 characters are not changed. The last 128 are X’20’ (space) unless the character is over written by a national character
UTF16    DS    0H                 
UTF16L   DC    128AL2((*-UTF16)/2)
UTF16H   DC    128XL2'0020'       
        ORG   UTF16+X'86'*2 å    
        DC    X'00E5'            
        ORG   UTF16+X'8F'*2 Å    
        DC    X'00C5'            
        ORG   UTF16+X'91'*2 æ    
        DC    X'00E6'            
        ORG   UTF16+X'92'*2 Æ    
        DC    X'00C6'            
        ORG   UTF16+X'9B'*2 ø    
        DC    X'00F8'            
        ORG   UTF16+X'9D'*2 Ø    
        DC    X'00D8'            
        org   UTF16+(256*2)      


EBCDIC to ASCII conversion table



ASCII    DC    256X'20'     spaces     
        ORG   ASCII+C'0'              
NUMBERS  DC    10AL1(X'30'+(*-NUMBERS))
        ORG   ASCII+C'A'              
UCA2I    DC    9AL1(X'41'+(*-UCA2I))   
        ORG   ASCII+C'J'              
UCJ2R    DC    9AL1(X'4A'+(*-UCJ2R))   
        ORG   ASCII+C'S'              
UCS2Z    DC    8AL1(X'53'+(*-UCS2Z))   
        ORG   ASCII+C'a'              
LCA2I    DC    9AL1(X'61'+(*-LCA2I))   
        ORG   ASCII+C'j'              
LCJ2R    DC    9AL1(X'6A'+(*-LCJ2R))   
        ORG   ASCII+C's'              
LCS2Z    DC    8AL1(X'73'+(*-LCS2Z))                                          
        ORG   ASCII+C'æ'              
        DC    X'91'      
        ORG   ASCII+C'Æ'
        DC    X'92'      
        ORG   ASCII+C'ø'
        DC    X'9B'      
        ORG   ASCII+C'Ø'
        DC    X'9D'      
        ORG   ASCII+C'å'
        DC    X'86'      
        ORG   ASCII+C'Å'
        DC    X'8F'                                
        ORG   ASCII+C' '
        DC    X'20'      
        ORG   ASCII+C'!'
        DC    X'21'      
        ORG   ASCII+C'"'     
        DC    X'22'          
        ORG   ASCII+C'#'     
        DC    X'23'          
        ORG   ASCII+C'$'     
        DC    X'24'          
        ORG   ASCII+C'%'     
        DC    X'25'          
        ORG   ASCII+X'50'  &
        DC    X'26'          
        ORG   ASCII+C''''    
        DC    X'27'          
        ORG   ASCII+C'('     
        DC    X'28'          
        ORG   ASCII+C')'     
        DC    X'29'          
        ORG   ASCII+C'*'
        DC    X'2A'     
        ORG   ASCII+C'+'
        DC    X'2B'     
        ORG   ASCII+C','
        DC    X'2C'     
        ORG   ASCII+C'-'
        DC    X'2D'     
        ORG   ASCII+C'.'
        DC    X'2E'     
        ORG   ASCII+C'/'
        DC    X'2F'     
        ORG   ASCII+256