Other Projects
Unicode  Lab Report
"Arial Unicode MS" Viewer
The CJK Unified Ideographs span about eighty-two 256-character sectors. 
The default Unicode string below shows a variety of interesting characters.
      
Use the Cyrillic block to enter "Seven minus five equals two"

Purpose
The purpose of this project is to display any of the possible 65,536 Unicode characters along with descriptive information.  In addition to displaying any of the Unicode characters, any single character can be printed or saved to a 1-bit BMP file, which may be quite useful when studying a new language.

Materials and Equipment

Software Requirements
Windows NT/2000 (Sorry, this program does not work in Windows 95/98)
Office 2000 with font Arial Unicode MS installed
Delphi 4/5 (to recompile)
Unicode.EXE
NamesList.TXT
file (with Unicode description information)

Hardware Requirements
800-by-600 video display (I couldn't quite get everything to fit on a 640-by-480 form).

Procedure

  1. If necessary, install font Arial Unicode MS (ArialUni.TTF) from the Office 2000 CD -- this font is not installed by default because of its 23 MB size.  The steps to install this font are something like this:
    a.  Control Panel
    b.  Add/Remove Programs
    c.  Microsoft Office 2000 Professional
    d.  Change
    e.  Add or Remove Features
    f.  Office Tools
    g.  International Support
    h.  Universal Font
  2. Double click on the  Unicode.EXE icon to start the program.
  3. Select the Block Combobox pulldown and select any Unicode block.  The sectors in the selected block will be highlighted in the Sector StringGrid.  Most Unicode blocks are a single sector, or a part of a sector.  Note that not all sectors and characters are defined by selecting the blocks in the Block Combobox
  4. Using the Sector StringGrid, select any sector (where a sector defines the first two hex digits  of a Unicode value).  All of the possible sectors can be selected this way.
  5. For a given Unicode sector, select any of the 256 characters in that sector from the Character StringGrid.  When a character is selected in the Character StringGrid, and the Add Checkbox is checked, the selected character will be appended to the text area below the Character StringGrid.  When a Unicode value is selected, information about that character will appear below the large graphic a the lower left (if the NamesList.Txt file is present).
  6. Use the Clear and Backspace buttons for simple editing in the Unicode text field.
  7. After selecting a string of Unicode characters, select the Copy to Clipboard button to copy this WideString to another application, such as Microsoft Word 2000.  The following shows the default Unicode string when the program starts:

  1. Use Paste in the desired application, such as Word 2000.

  1. Select the copied text, and change the font to Arial Unicode MS to see the correct Unicode string in Microsoft Word 2000:

  1. With any Unicode character selected, pick the File/Print button to create a Unicode bitmap of the specified size.  This bitmap can be saved to a specified file, or printed.  (Better print results will be obtained if you create a larger bitmap.  Typically, a size of 1024 should be used when printing the single-page character printouts.)

 

Discussion

Delphi supports Unicode characters and strings through the WideChar, pWideChar, and WideString types.

In Delphi you can access WideChars (since they are accessible from the Windows API) but you cannot display Unicode strings in native Delphi controls (at this time).  

A WideChar has two bytes per character.   Since a byte can contain 256 possible values, two bytes can be used to represent 256*256 = 65536 possible Unicode characters.  

A Unicode string is a sequence of  two-byte words.  The WideString type represents a dynamically allocated string of 16-bit Unicode characters. In most respects it is similar to AnsiString, but it is less efficient because it does not implement reference-counting and copy-on-write semantics.

The first 256 Unicode characters map to the ANSI character set.

When the Unicode program starts, three checks are made in the TFormUnicode.FormCreate before the screen appears (see source program for complete details):

  1. Operating System Check

The Operating System must be Windows NT or Windows 2000 because of Unicode and certain API calls that are used.  This is enforced with the following code:

Operating System Check:  Only run Unicode.EXE under Windows NT/2000

VAR
  OSVersionInfo:  TOSVersionInfo;
...    
begin
  // Adapted from example on p. 706, "The Tomes of Delphi 3:  Win32 Core API"
  OSVersionInfo.dwOSVersionInfoSize := SizeOF(TOSVersionInfo);

  GetVersionEx(OSVersionInfo);

  IF   OSVersionInfo.dwPlatformID <> VER_PLATFORM_WIN32_NT
  THEN BEGIN
    ShowMessage('This program requires Windows NT/2000.' + #$0A +
                'Your OS Version is ' +
                IntToStr(OSVersionInfo.dwMajorVersion) +
                '.' +
                IntToStr(OSVersionInfo.dwMinorVersion) +
                ' Build ' +
                IntToStr(OSVersionInfo.dwBuildNumber) + '.'#$0A +
                'The application will be terminated.');
    Application.Terminate
  END;
...

Without the above check, the Unicode programs dies with some cryptic error messages in Windows 95.

  1. NamesList.TXT File Check

NamesList.TXT File Check:  This file contains descriptive information about each Unicode.

CONST
  NamesListFile = 'NamesList.TXT';
VAR
  filename      :  STRING;
...   
begin
...  
  filename := ExtractFilePath(ParamStr(0)) + NamesListFile;
  IF   FileExists(filename)
  THEN BEGIN
    UnicodeLibrary.LoadNamesListFile(filename);
    FormUnicode.Caption := UnicodeLibrary.UnicodeFilename
  END
  ELSE ShowMessage('Missing file ' + NamesListFile + '.'#$0A +
                   'Info about each Unicode will not be available.');

The UnicodeLibrary unit has a number of routines for working with Unicode, including parsing the NamesList.TXT file, which is loaded into memory with the LoadNamesListFile routine shown above.  You can edit the NamesList.Txt file to add additional information, but use an editor that shows you the location of the tab characters.  Do not remove or expand the tabs in this file or the parsing routine will no longer work correctly.

  1. Arial Unicode MS Font Check

Arial Unicode MS Font Check (font is from Office 2000)

CONST
  UnicodeFont = 'Arial Unicode MS';  
...
IF   Screen.Fonts.IndexOf(UnicodeFont) < 0
THEN BEGIN
  FormUnicode.Caption := 'Unicode font <' + UnicodeFont +
                         '> is not available.';
  ShowMessage(s + #$0A + 'Unicode characters will not appear.')
END;

Three main controls perform roughly the same actions.  The Block Combobox, and a click on either the Sector or Character StringGrids, all result in the at least the following steps:

  1. Define a range of Unicode characters (UnicodeRangeFrom and UnicodeRangeTo).

  2. Redraw StringGrids

  3. Draw Unicode Graphics

  4. Display information about selected Unicode from NamesList.Txt file.

The last three steps are performed in TFormUnicode.UpdateEverything.  

A single Unicode character can be displayed in a bitmap using the UnicodeLibrary routine GetUnicodeBitmap:

How to Display a Single Unicode Character in a Bitmap

// Calling program responsible for freeing bitmap
FUNCTION GetUnicodeBitmap(CONST w:  WideChar; CONST BitmapSize:  INTEGER;
                          VAR Size:  TSize):  TBitmap;
  CONST
    UnicodeFont = 'Arial Unicode MS';
  VAR
    Palette: TMaxLogPalette;
BEGIN
  // Need black/white palette for pf1bit bitmap
  Palette.palVersion := $300;
  Palette.palNumEntries := 2;
  WITH Palette.palPalEntry[0] DO
  BEGIN
    peRed   := 0;
    peGreen := 0;
    peBlue  := 0;
    peFlags := 0
  END;
  WITH Palette.palPalEntry[1] DO
  BEGIN
    peRed   := 255;
    peGreen := 255;
    peBlue  := 255;
    peFlags := 0
  END;
  RESULT := TBitmap.Create;
  RESULT.PixelFormat := pf1bit;  
  RESULT.Height := BitmapSize;
  RESULT.Width  := BitmapSize;
  RESULT.Palette := CreatePalette(pLogPalette(@Palette)^);
  // Not clear why this is need to clear whole background to white;
  // otherwise -- only sometimes -- part of bitmap is black instead.
  RESULT.Canvas.Brush.Color := clWhite;
  RESULT.Canvas.FillRect( RESULT.Canvas.ClipRect );
  RESULT.Canvas.Font.Name := UnicodeFont;
  RESULT.Canvas.Font.Height := BitmapSize;
  GetTextExtentPoint32W(RESULT.Canvas.Handle, @w, 1, Size);
  TextOutW(RESULT.Canvas.Handle,
          (RESULT.Width - size.cx) DIV 2, 0,   // center left-to-right
          @w, 1)
END {GetUnicodeBitmap};

GetUnicodeBitmap was used both in TFormUnicode.DisplayUnicodeGraphic and TFormSave.DisplayUnicodeGraphic.  Displaying a Unicode String is about the same amount of work:

How to Display a Unicode String in a Bitmap

PROCEDURE TFormUnicode.DisplayUnicodeString;
  VAR
    Bitmap:  TBitmap;
    Size  :  TSize;
BEGIN
  Bitmap := TBitmap.Create;
  TRY
    Bitmap.Width  := ImageUnicodeString.Width;
    Bitmap.Height := ImageUnicodeString.Height;
    Bitmap.PixelFormat := pf1bit;
   
    Bitmap.Canvas.Brush.Color := clWhite;
    Bitmap.Canvas.FillRect(Bitmap.Canvas.ClipRect);
    Bitmap.Canvas.Font.Name := UnicodeFont;
    Bitmap.Canvas.Font.Height := Bitmap.Height;
    // "scroll left" as necessary to make sure most recent character fits
    GetTextExtentPoint32W(Bitmap.Canvas.Handle, pWideChar(UnicodeString),
                           Length(UnicodeString), size);
    WHILE Size.cx > Bitmap.Width DO
    BEGIN
      Delete(UnicodeString, 1, 1);
      GetTextExtentPoint32W(Bitmap.Canvas.Handle, pWideChar(UnicodeString),
                           Length(UnicodeString), size)
    END;
    TextOutW(Bitmap.Canvas.Handle, 0, 0, pWideChar(UnicodeString), 
             Length(UnicodeString));
    ImageUnicodeString.Picture.Graphic := Bitmap
  FINALLY
    Bitmap.Free
  END;
  ...
END {DisplayUnicodeString};

Why doesn't DisplayUnicodeString deal with palettes when GetUnicodeBitmap does?   I don't know -- this was discovered quite by accident.   For now, I'm calling this a pf1bit enigma (and will investigate this more when time permits).  For some reason, setting the PixelFormat  to pf1bit after setting a Bitmap's height and width results in the "correct" black and white bitmap, BUT setting the PixelFormat first (before setting the Bitmap's height and width) does not result in the correct colors  and requires setting the Bitmap's palette for the two colors it contains.   Setting the PixelFormat first, especially with pf1bit bitmaps,  is usually desirable to reduce memory resources.

Instead of using the TextOutW API call, the ExtTextOutW API call can also be used (as pointed out by Mike Lischke) to display a Unicode string:

VAR
  Rect  :  TRect;
...
//  Alternate way to display Unicode string as used in Mike Lischke's code
    Rect := Bitmap.Canvas.ClipRect;
    ExtTextOutW(Bitmap.Canvas.Handle, 0,0, ETO_CLIPPED,
               @Rect,
               pWideChar(UnicodeString), Length(UnicodeString), NIL);

Unfortunately, Arial Unicode MS apparently is defined using the Unicode 2.1 standard, while the NamesList.TXT file used in this program is defined using the Unicode 3.0 standard.

Conclusions
Many of  the Unicode characters are quite interesting.  Explore them using this program.


Feedback

Mike Lischke's comments (23 May 2000):  "... with ExtTextOutW you can use Unicode on Win9x platforms too. ...Additionally, you should perhaps also point out that Arial Unicode MS is not the only Unicode font. Without Office 2000 (which many people won't have) you still can get remarkable results with Courier New, Lucidia Sans Unicode etc. (which are all freely downloadable form Microsoft's site)."

Thanks, Mike.  I'll may  rework the program to use ExtTextOutW and the other fonts some day.  -- efg

Peter's comments (6 May 2001):   "In your Unicode program, the characters 128 (80) to 159 (9F) are not displayed properly.   ...  The characters 128 (0080) to 159 (009F) seem to be control characters in Unicode.  See http://www.unicode.org/charts/PDF/U0080.pdf for more info."

I cannot explain your observation.  Arial Unicode MS, used in my program, apparently is defined using the Unicode 2.1 standard and you're citing a Unicode 3.0 standard -- I don't know if that matters.  Since I use the same API call for each of the 64 K Unicode characters, the characters I'm displaying are defined by Microsoft, not me.  -- efg

Mark O'Farrell's comments (27 July 2003):  "By the way I disabled the OS Version checking and the program runs perfectly under Win98SE."


Resources

general info The official ISO name for the Unicode standard is ISO/IEC 10646-1:2000.
From Unicode A Primer:
character  abstract meaning of a particular shape
glyph  visual representation of a character
font  collection of glyphs for each character

Unicode's characters
http://czyborra.com/unicode/characters.html 

Worried about localization problems? Content-enabled software, through Unicode, may be what really matters, Byte Magazine, March 1997
www.byte.com/art/9703/sec7/art5.htm 

AnsiToUnicode FUNCTION AnsiToUnicode(s:  STRING; VAR NewSize:   INTEGER):  pWideChar;

Delphi 2 Unleashed, p. 49.  Calls the MultiByteToWideChar API function.
Character Sets

Codepage & Co.
http://czyborra.com/charsets/codepages.html 

Charts

Unicode codepages
www.microsoft.com/typography/unicode/unicodecp.htm 

Fonts Multilingual Unicode TrueType Fonts in the Internet
www.ccss.de/slovo/unifonts.htm 
Samples/Tests Unicode Test Page
www.cogsci.ed.ac.uk/~richard/unicode-sample.html 
Unicode, Inc. www.unicode.org

Unicode Data (most files dated 9 Sept 1999)
(includes Blocks.txt, NameList.txt, and UnicodeData.txt)
ftp://ftp.unicode.org/Public/UNIDATA 

Unicode Character Database Version 3.0
(includes Blocks-3.txt, NamesList-3.0.0.txt, and Unicode-3.0.0.txt)
ftp://ftp.unicode.org/Public/3.0-Update 

Unicode Character Database Version 2.0
(includes Blocks-1.txt, NamesList-1.txt, and Unicode-2.0.14.txt)
ftp://ftp.unicode.org/Public/2.0-Update 

UnicodeToAnsi FUNCTION UnicodeToAnsi(s:  pWideChar):  STRING;

Delphi 2 Unleashed, p. 49.  Calls the WideCharToMultiByte API call.
Unicomp Unicode components for Windows2000/XP This is a unicode components package include: Edit Memo RichEdit Listbox Listview   Treeview Combogrid Stringgrid.
http://delphi.icm.edu.pl/ftp/d50share/Unicomp.zip
UseNet Posts Steve Schafer's UseNet Post about how to view Unicode on a form
Mike Lischke's UseNet Post about how to use MultiByteToWideChar to conver a string to Unicode
Web

Unicode and Multilingual Support in Web Browsers and HTML
http://www.alanwood.net/unicode/ 

Latin 1 and Unicode characters in &ampersand; entities
www.pemberley.com/janeinfo/latin1.html 


Keywords
Unicode, Arial Unicode MS font, Unicode Block, Unicode NamesList.TXT, WideChar, WideString, SetLength, Delete, TextOutW, ExtTextOutW, GetTextExtentPoint32W, StringGridDrawCell, StringGridClick, TBitmap, TImage, palette, TRichEdit, TCombobox, Invalidate, GetVersionEx, TOSVersionInfo, Fonts.IndexOf, ShellExecute, OpenClipboard, EmptyClipboard, SetClipboardData, CloseClipboard, CF_UNICODETEXT, StretchDIBits, Printer.Canvas

Download
Delphi 4/5 Source and EXE:  Unicode.ZIP (359 KB zip file includes NamesList.Txt file)

UPDATE (17 Sept 2007):  Ciarán Ó Duibhín added a combo box to choose among the installed fonts, and also got the program to run under all versions of Windows from 95B (at least) up.  Getting it to run in 95B required removal of the PixelFormat assignments to pf1bit — with these, the unicode buffer and the enlarged bitmap remain blank in Win 95 (though not in Win 98). The character grid is not subject to any assignment of PixelFormat and worked in all versions of Windows.  Duibhín's changes are all marked by comments in the this file:  UnicodeViewer-Duibhin.zip


Updated 14 Jun 2009
Since 21 May 2000