{"id":563,"date":"2009-09-26T23:55:32","date_gmt":"2009-09-26T11:55:32","guid":{"rendered":"https:\/\/www.deltics.co.nz\/blog\/?p=563"},"modified":"2009-09-26T23:55:32","modified_gmt":"2009-09-26T11:55:32","slug":"delphi-unicode-wide-ansi","status":"publish","type":"post","link":"https:\/\/www.deltics.co.nz\/blog\/posts\/563\/","title":{"rendered":"Delphi Unicode = Wide-ANSI"},"content":{"rendered":"<span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\"> 7<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span><p>Be careful what you wish for.  A lot of people were overjoyed to hear that Unicode support was coming to Delphi. Some were skeptical of the chosen implementation approach however, it all seemed just a little bit <em>too<\/em> easy.  I was one, and sadly it seems I was right.<br \/>\n<!--more--><br \/>\nI&#8217;ve just started updating a whole host of code to Delphi 2009. \u00a0Since Unicode is what my code has to speak whether I like it or not (if I want to use the latest\/greatest Delphi compilers) then I may as well get with the program and drag my ANSI code kicking and screaming into the UTF-16 world.<\/p>\n<p>&#8220;Kicking and Screaming&#8221; is certainly involved. \u00a0But mostly it&#8217;s me doing it.<\/p>\n<p>In utter frustration.<\/p>\n<p>There are currently two head-shaped dents in the wall.  Allow me to share my pain.<\/p>\n<h2>Dent #1: UTF8ToANSI()<\/h2>\n<p>The help says:<\/p>\n<blockquote><p>Call Utf8ToAnsi to convert a UTF-8 string to Ansi. S is a string encoded in  UTF-8. Utf8ToAnsi returns the corresponding string that uses the Ansi character  set.<\/p><\/blockquote>\n<p>Well if you <span style=\"text-decoration: underline;\">do<\/span> call <strong>UTF8ToAnsi <\/strong>and expect it to do what that says it will do then you are in for a disappointment. \u00a0Because it actually returns a <strong>UnicodeString<\/strong>.<\/p>\n<p>Not &#8220;what it says on the tin&#8221; and sure as eggs is eggs not what the caller wants or expects, not matter how quickly and dirtily they want to hack their old ANSI code into a warning-free ANSI-on-WideAPI &#8220;pseudo-Unicode&#8221; application..<\/p>\n<p>Dumb. \u00a0Dumb. \u00a0Dumb.<\/p>\n<p>But it gets even dumber.  There <em>is<\/em> now also a function <strong>UTF8ToString()<\/strong>!  And if we inspect the source for that we find the following puzzling implementation:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\r\n  function UTF8ToString(const S: RawByteString): string;\r\n  begin\r\n  {$IFDEF UNICODE}\r\n    Result := UTF8ToUnicodeString(S);\r\n  {$ELSE}\r\n    Result := UTf8ToAnsi(S);\r\n  {$ENDIF}\r\n  end;\r\n<\/pre>\n<p>You may be thinking that this isn&#8217;t puzzling at all.  That it is a perfectly sensible implementation &#8211; if we&#8217;re compiling with UNICODE defined then defer to the UnicodeString implementation.  Otherwise defer to the ANSI implementation.<\/p>\n<p>But hang-on, since when was UNICODE an option in Delphi 2009?  CodeGear decided that it was to be &#8220;Unicode or the Hi-rode&#8221;.<\/p>\n<p>But it get&#8217;s better, because of course in deferring to the ANSI implementation &#8211; even assuming this is ever compiled with UNICODE <strong>not<\/strong> defined &#8211; they are of course calling the so-called-ANSI implementation that &#8211; UNICODE or not &#8211; calls the Unicode implementation!<\/p>\n<p>You can&#8217;t help but think that CodeGear managed to confuse even themselves with their approach to Unicode.<\/p>\n<h2>OK, But What&#8217;s this Wide-ANSI Nonesense ?<\/h2>\n<p>Ok, so <strong>UTF8ToANSI()<\/strong> is not something that I&#8217;m guessing many people are actually using (my code is working with the <a href=\"http:\/\/www.apple.com\/support\/bonjour\/\">Apple Bonjour SDK<\/a> which makes extensive use of UTF-8, and the code originated in Delphi 7, hence the ANSI\/UTF8 conversions).<\/p>\n<p>So I ran into an <a href=\"http:\/\/en.wikipedia.org\/wiki\/Edge_case\">edge-case<\/a>.  No big deal, right?  Granted.  But sadly there are other, wider [sic] issues which people <em>will<\/em> undoubtedly run into.<\/p>\n<p>Let me give you an example that I&#8217;m certain will be encountered by more people than that UTF8\/ANSI scenario.<\/p>\n<h2>Dent #2: Uppercase()<\/h2>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\r\nvar ws: String;  \/\/ UnicodeString in Delphi 2009\/10\r\n\r\nws := 'a\u00e0';\r\nws := Uppercase(ws);\r\n<\/pre>\n<p>What does <strong>ws<\/strong> contain after the call to <strong>Uppercase()<\/strong>? \u00a0If you said\u00a0<strong>A\u00c0<\/strong> then congratulations &#8211; you are 100% <span style=\"text-decoration: underline;\">wrong<\/span>. \u00a0It actually contains\u00a0<strong>A\u00e0<\/strong>.<\/p>\n<p>Sadly, that is NOT the uppercase version of the input string according to the Unicode specification.<\/p>\n<p>Now, to be fair, the documentation in this case is quite correct and clear that Uppercase() does not handle Unicode Strings as Unicode.  Which makes an absolutely mockery of a strongly typed language such as Delphi.<\/p>\n<p><strong>Uppercase()<\/strong> accepts a <strong>String<\/strong> parameter.  In Delphi 2009 <strong>String<\/strong> is a Unicode data type.<\/p>\n<p>In Delphi 2009+ <strong>Uppercase()<\/strong> is &#8211; by contract &#8211; a Unicode function and should jolly well behave as such.<\/p>\n<p>Frankly, this decision by CodeGear speaks to me of a cavalier disregard for what makes Delphi, well, Delphi.<\/p>\n<p>Anyway&#8230; Now, the hard part.  Can you guess what function you <em>should<\/em> use to properly convert a UnicodeString to uppercase?<\/p>\n<p>Well, you actually have a choice of two:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\r\n    WideUppercase( );\r\n    ANSIUppercase( );\r\n<\/pre>\n<p>No, wait.  That can&#8217;t be right surely?  <strong>ANSI<\/strong>Uppercase?<\/p>\n<p>Yes.  Of course the alarm bells should be ringing as soon as you realise that <strong>ANSIUppercase<\/strong> accepts not an <strong>ANSIString<\/strong> parameter but a plain <strong>String<\/strong> parameter, which of course made perfect sense pre-Delphi 2009 when <strong>String<\/strong> meant <strong>ANSIString<\/strong>, but in Delphi 2009+ CodeGear found themselves painted into a corner.<\/p>\n<p>Having decided that they, and only they, would get to choose what <strong>String<\/strong> meant to the compiler (although I think that the $ifdef UNICODE in System.pas is evidence that they weren&#8217;t &#8211; at one time at least &#8211; as convinced that this was the right and obvious thing to do as they seemed to suggest), if they properly modified these ANSI RTL functions to accept ANSI strings, then a lot of code currently calling them with <strong>String<\/strong> parameters would start throwing up warnings when <strong>String<\/strong> transformed from <strong>ANSIString<\/strong> to <strong>UnicodeString<\/strong>.<\/p>\n<p>But at the same time the vast, and I mean VAST majority of code would simply be calling <strong>Uppercase()<\/strong>, and they didn&#8217;t want any warnings coming from that either.<\/p>\n<p>What a pickle.<\/p>\n<h2>Fight Fire With Fire<\/h2>\n<p>It seems they chose to resolve this dilemma by adding a little more confusion into the mix, creating a situation where someone explicitly calling an ANSI routine will obtain a Wide operation and someone calling the &#8220;default&#8221; implementation, presumably expecting it to yield a Wide operation (because Delphi 2009 is Unicode through and through, right?) is rewarded with an ANSI operation.<\/p>\n<p>Perhaps they thought that if they made our heads spin enough we wouldn&#8217;t notice what a pigs-ear they&#8217;d made of the whole thing?<\/p>\n<p>But what does all this mean for someone with some old apps that they can&#8217;t wait to get updated to the latest Unicode compiler so that they can start selling to their customers who have been demanding Unicode support?<\/p>\n<p>Well having compiled your pre-Unicode source with the Unicode compiler you may have been very happy to find that you had only a handful of warnings to deal with and then BINGO, you had a Unicode application.<\/p>\n<p>Sorry.<\/p>\n<p>I hate to break it to you, but you may not have finished yet.<\/p>\n<p>What you have at the moment is an application that in all likelihood is still behaving very much like an ANSI application, it&#8217;s just that it&#8217;s now sitting atop the Windows Wide API&#8217;s, <em>pretending<\/em> to be a Unicode application.<\/p>\n<p>In many respects it may fool you, perhaps for a long time, because anyone who hadn&#8217;t previously tackled Unicode head-on almost certainly doesn&#8217;t have any actual need for Unicode and their application will not be stressing those parts of Unicode support that actually separate a &#8220;Unicode&#8221; application from a &#8220;non-Unicode&#8221; application.<\/p>\n<h2>A Thought Experiment<\/h2>\n<p>This is a &#8220;Thought Experiment&#8221; in the sense of &#8220;You know what Thought did?&#8221;  One answer to which is &#8220;He thought he did, but he didn&#8217;t.&#8221;<\/p>\n<p>Let&#8217;s take a simple and I think highly common situation.  A text entry field for some user specified code that is required to accept only alpha-numeric characters and enforce uppercase.<\/p>\n<p>Pre-Delphi 2009 such a field may well have had a key filter installed in an event handler:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\r\n  procedure TForm1.Edit1KeyPress(Sender: TObject; var Key: Char);\r\n  begin\r\n    if (Key in &#x5B;'a'..'z']) then\r\n      Key := UpCase(Key)\r\n    else if NOT (Key in &#x5B;'0'..'9']) then\r\n      Key := #0;\r\n  end;\r\n<\/pre>\n<p>In Delphi 2009 this throws up the &#8220;WideChar reduced to byte char&#8221; warning that <a href=\"http:\/\/www.google.co.nz\/search?q=WideChar+reduced+to+byte+char\">everyone &#8220;doing Unicode&#8221; in Delphi has been talking about<\/a>, and like a good little Delphi developer they do what CodeGear tell them and change this code.  But let us imagine that we know a little bit about Unicode and are actually migrating to Delphi 2009 because we want to be able to market our product as a Unicode application.<\/p>\n<p>So rather than simply calling <strong>CharInSet<\/strong>, because that still only deals with ASCII characters, we adjust the routine to call the Windows Unicode support routine <strong>IsCharAlphaNumeric<\/strong> instead:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\r\n  procedure TForm1.Edit1KeyPress(Sender: TObject; var Key: Char);\r\n  begin\r\n    if IsCharAlphaNumeric(Key) then\r\n      Key := UpCase(Key)\r\n    else\r\n      Key := #0;\r\n  end;\r\n<\/pre>\n<p>With this simple change the code now compiles without any warnings.<\/p>\n<p>YAY!  We didn&#8217;t fall into the trap laid for us by CharInSet!  We dealt with this properly and now we have a Unicode application!<\/p>\n<p>Don&#8217;t we?<\/p>\n<p>No.<\/p>\n<p>Can you guess what that <strong>WideChar<\/strong> version of <strong>UpCase()<\/strong> does.  Yep.  Behaves exactly the same as the ANSI version.<\/p>\n<p>Dumb. Dumb. Dumb.<\/p>\n<p>There is not even the excuse of needing to maintain backward compatability in this case &#8211; there was no WideChar version of <strong>UpCase()<\/strong> prior to Delphi 2009.<\/p>\n<p>Now, anyone who understands Unicode and specifically the properties and characteristics of UTF16 will be able to tell you why <strong>UpCase( WideChar )<\/strong> does not perform a Unicode case conversion &#8211; it simply <strong>can<\/strong>not.  A single WideChar may not represent an entire character &#8211; it may be part of a surrogate pair and whilst I don&#8217;t think that there are any case convertible characters that require surrogate pairs currently, that could change (and I may be wrong on that anyhow).<\/p>\n<p>So what else could it do but echo the ANSI implementation?<\/p>\n<p>Well one option would have been to reflect the true nature of Unicode and simply not try to create the illusion of supporting something that cannot be supported.<\/p>\n<p>But more acceptable I think would have been to perform a Unicode conversion on those chars that it could (non-surrogates) and if a surrogate was supplied as input, simply return it unmodified.<\/p>\n<p>As it is, the previous (and current) &#8220;ANSI&#8221; implementation is incomplete w.r.t ANSI, providing only ASCII case conversion.  It would have been easy to argue that a Unicode implementation that only operated on characters in the BMP (Basic Multilingual Plane) was the natural and obvious behaviour for a Wide UpCase implementation.<\/p>\n<p>So I&#8217;m afraid if you do want and\/or need proper Unicode support, you&#8217;ve still got some work to do before you can get there, and unfortunately the compiler is not going to help you from this point on.  Furthermore, the VCL now seems to go out of it&#8217;s way to make it harder by in some cases completely breaking the type-safety that you can normally expect when working in Delphi and which would have guided you toward the answers to the numerous questions that arise when contemplating proper Unicode support.<\/p>\n<p>The compiler assistance in &#8220;helping you find the things you need to change for Unicode support&#8221; only really works if you don&#8217;t actually need <em>proper<\/em> Unicode support and just want to get your ANSI code running over the Wide API in Windows.<\/p>\n<p>Once you&#8217;ve reached that point &#8211; or even before that &#8211; if you then decide you want to do Unicode properly, I fear you will find that the design decisions made to facilitate the migration of Wide-ANSI applications will frustrate you and complicate your job no-end.<\/p>\n<p>Really, I have to wonder if it really was worth getting so excited about Unicode support if it&#8217;s main audience is people not actually supporting Unicode properly?<\/p>\n<p>I can only hope that the 64-bit support that it seems people increasingly need is not being delayed in order to make way for a cross-platform implementation that will suffer the same identity crisis as the Unicode implementation is littered with.<\/p>\n<p>Getting it done quick sometimes is not as important as doing it right.<\/p>\n<p>In the case of Unicode I&#8217;m afraid I&#8217;ve not come across anything to make me think it was &#8220;done right&#8221; at all.<\/p>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\"> 7<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span>Be careful what you wish for. A lot of people were overjoyed to hear that Unicode support was coming to Delphi. Some were skeptical of the chosen implementation approach however, it all seemed just a little bit too easy. I was one, and sadly it seems I was right.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[4],"tags":[292,22],"class_list":["post-563","post","type-post","status-publish","format-standard","hentry","category-delphi","tag-delphi","tag-unicode"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1TKYv-95","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/563","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/comments?post=563"}],"version-history":[{"count":10,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/563\/revisions"}],"predecessor-version":[{"id":667,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/563\/revisions\/667"}],"wp:attachment":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/media?parent=563"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/categories?post=563"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/tags?post=563"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}