fredag den 20. november 2015

Brave new world (1) - Code Pages

Code Pages

It is IBM’s intention, as far as I know, that you will never need any other server than a System z and z/OS. Any configuration can be applied to z/OS as well. That gave us USS and z/Linux, WebSphere Application Server, Java in CICS and Batch. In order to accommodate all these demands to System z, IBM had to invent new instructions that work with the PC- and Unix architecture. 1)
Here comes a very brief introduction to the new world.


It started in the early 60’s where S/360 used punched cards as input where others used punched tape. It had an enormous impact on the whole architecture of both worlds:
S/370
Intel
Punched cards
Punched tape
Records
Variable string length with null terminator
Datasets with records
Files with one long line of characters
EBCDIC as code page
ASCII as code page
Big Endian Integers
Little Endian Integers
Differences between record oriented and string oriented platforms

Code Pages

I will start with code page conversion. Both ASCII and EBCDIC are one byte character sets. However, there are quite a few differences. 2)
EBCDIC
ASCII
Numbers are last in collating sequence
Numbers are first in collating sequence
Capital letters come after lower case
Capital letters come before lower case
Letters come in three blocks with gaps between
Letters come in one block with no gaps
256 8 bit characters
128 7 bit characters, later 8 bit


All computer companies have now agreed to accept a common code page called Unicode. It has a two byte characters and it is for example used in Java. Although Unicode is usually two bytes it has a compressed specification that is very similar to ASCII except for special characters that can be two, three or four bytes. It’s called Unicode Transformation Format 8 (UTF8). 8 refers to number of bits. So UTF16 is the same as Unicode.


I have three examples:
  1. Convert from EBCDIC to ASCII
  2. Convert ASCII to UTF16
  3. Convert UTF16 to UTF8


Convert from EBCDIC to ASCII

This is very straight forward where you use the Translate instruction (TR). It has been around as long as I remember (40 years). It goes through a string of EBCDIC bytes and use each EBCDIC byte value as an offset into a 256 character table. Example: A X’F1’ has the EBCDIC value 1 (one) but the byte value is decimal 241. At the position 241 in the translate table is value X’31’ which is the ASCII value of 1 (one) and it is moved to into EBCDIC’s position.
Result: EBCDIC 1 is changed to ASCII 1.
You can see my EBCDIC to ASCII translation table at the bottom.3)


The assembler statements are like this:
        L     R1,RECADR      Address of EBCDIC input
        MVC   OUTREC,0(R1)   Move input to output    
        TR    OUTREC,ASCII   Translate it to ASCII   
……………………….
ASCII    DC    256X'20'       Space if it can not be translated     
        ORG   ASCII+C'0'              
NUMBERS  DC    10AL1(X'30'+(*-NUMBERS))
………...etc……………


Convert ASCII to UTF16/Unicode

It is getting a little bit more complicated when you wish to convert ASCII into UTF16/Unicode. Any character in the European and some other languages has its own value in the 65535 value table. However, the European letters are between X’0000’ and X’0FFF’.


Once again you must make a translation table but this time each character is two bytes. The Data Constant has a “new” attribute CU (Character Unicode). Each character occupies 2 bytes but you will recognize the ASCII values have not changed but the national characters have changed value.
Example: Danish “Å” has the value X’00C5’. The high order bit in second byte is on, hence above decimal 127


0000A4 0042004C00C50042    57    DC    CU'BLÅBÆRGRØD'
0000AC 00C6005200470052                                              
0000B4 00D80044                                                      
Data Constant with UTF-16 attribute.


Please note, that the assembler program is in Danish EBCDIC Code page 277 or here ‘1142’. You have to tell the assembler that Danish national characters must be converted according to that specific code page. The specification is set in the parm.
//ASM      EXEC PGM=ASMA90,PARM='DECK,NOOBJECT,CODEPAGE(1142)'
Translate from Danish code page


There is a new instruction to translate from one byte code page to two byte code page. It is called Translate One To Two or just TROT. It works nearly the same way as the old fashioned Translate except that it moves the translated bytes  into an new output area, here OUTREC. Register 1 points to the translate table.
        XR    0,0             
        LA    R2,OUTREC       
        LA    R3,L'OUTREC*2   
        ST    R3,OUTPUT_LENGTH
        L     R4,RECADR       
        LA    R5,L'OUTREC     
        LA    1,UTF16         
        TROT  R2,R4,B'0001'   
……………………………
UTF16    DS    0D                  
UTF16L   DC    128AL2((*-UTF16)/2)
UTF16H   DC    128XL2'0020'        
        ORG   UTF16+X'86'*2 å     
        DC    X'00E5'             
……...etc………..
Use of TROT - Translate table must be on double word boundery


Convert UTF16 to UTF8

UTF8 is usually one byte per character but special characters can occupy 2, 3 or even 4 bytes. The coding is like this:
One byte ASCII
Two bytes
Three bytes
Four bytes
X’00’ - X’7F’
B’11xxxxxx’
B’10xxxxxx’
B’111xxxxx’
B’10xxxxxx’
B’10xxxxxx’
B’1111xxxx’
B’10xxxxxx’
B’10xxxxxx’
B’10xxxxxx’
The number of bits in the first half byte tell the number of bytes to one character. ‘x’ is a character bit


This could be very cumbersome if you had to write a conversion routine for every translate. It also would take a lot of CPU-cycles to do. And we do not want that, do we? MIPS is a scarce resource and very expensive. IBM invented a whole range of instructions for the purpose of converting code Unicode. They are all called something with “Convert” like “CONVERT UTF-16 TO UTF-8” (CU21). They are supposed to be faster because the conversion is done in the microcode.


This time you do not need a translate table.
        LA    R2,OUTREC       
        LA    R3,L'OUTREC     
        ST    R3,OUTPUT_LENGTH
        L     R4,RECADR       
        LR    R5,R3
        CU21  R2,R4
Translate from UTF16 to UTF8


The input length and the output length is the same because the input string might be only special characters that occupy 2 bytes


Notes:

  1. - That is what I call plastic computers as opposed to the real iron.
      If you can carry it, it is not a proper computer
  2. - Some say that EBCDIC is an encryption algorithm
  3. - You can display special code pages in ISPF - Browse by entering a command:
  • DISP ASCII
  • DISP UTF16
  • etc




ASCII to Unicode conversion table

The first 128 characters are not changed. The last 128 are X’20’ (space) unless the character is over written by a national character
UTF16    DS    0H                 
UTF16L   DC    128AL2((*-UTF16)/2)
UTF16H   DC    128XL2'0020'       
        ORG   UTF16+X'86'*2 å    
        DC    X'00E5'            
        ORG   UTF16+X'8F'*2 Å    
        DC    X'00C5'            
        ORG   UTF16+X'91'*2 æ    
        DC    X'00E6'            
        ORG   UTF16+X'92'*2 Æ    
        DC    X'00C6'            
        ORG   UTF16+X'9B'*2 ø    
        DC    X'00F8'            
        ORG   UTF16+X'9D'*2 Ø    
        DC    X'00D8'            
        org   UTF16+(256*2)      


EBCDIC to ASCII conversion table



ASCII    DC    256X'20'     spaces     
        ORG   ASCII+C'0'              
NUMBERS  DC    10AL1(X'30'+(*-NUMBERS))
        ORG   ASCII+C'A'              
UCA2I    DC    9AL1(X'41'+(*-UCA2I))   
        ORG   ASCII+C'J'              
UCJ2R    DC    9AL1(X'4A'+(*-UCJ2R))   
        ORG   ASCII+C'S'              
UCS2Z    DC    8AL1(X'53'+(*-UCS2Z))   
        ORG   ASCII+C'a'              
LCA2I    DC    9AL1(X'61'+(*-LCA2I))   
        ORG   ASCII+C'j'              
LCJ2R    DC    9AL1(X'6A'+(*-LCJ2R))   
        ORG   ASCII+C's'              
LCS2Z    DC    8AL1(X'73'+(*-LCS2Z))                                          
        ORG   ASCII+C'æ'              
        DC    X'91'      
        ORG   ASCII+C'Æ'
        DC    X'92'      
        ORG   ASCII+C'ø'
        DC    X'9B'      
        ORG   ASCII+C'Ø'
        DC    X'9D'      
        ORG   ASCII+C'å'
        DC    X'86'      
        ORG   ASCII+C'Å'
        DC    X'8F'                                
        ORG   ASCII+C' '
        DC    X'20'      
        ORG   ASCII+C'!'
        DC    X'21'      
        ORG   ASCII+C'"'     
        DC    X'22'          
        ORG   ASCII+C'#'     
        DC    X'23'          
        ORG   ASCII+C'$'     
        DC    X'24'          
        ORG   ASCII+C'%'     
        DC    X'25'          
        ORG   ASCII+X'50'  &
        DC    X'26'          
        ORG   ASCII+C''''    
        DC    X'27'          
        ORG   ASCII+C'('     
        DC    X'28'          
        ORG   ASCII+C')'     
        DC    X'29'          
        ORG   ASCII+C'*'
        DC    X'2A'     
        ORG   ASCII+C'+'
        DC    X'2B'     
        ORG   ASCII+C','
        DC    X'2C'     
        ORG   ASCII+C'-'
        DC    X'2D'     
        ORG   ASCII+C'.'
        DC    X'2E'     
        ORG   ASCII+C'/'
        DC    X'2F'     
        ORG   ASCII+256


Ingen kommentarer:

Send en kommentar