Why doesn't printf format unicode parameters?

Go To StackoverFlow.com

7

When using printf to format a double-byte string into a single-byte string:

printf("%ls\n", L"s:\\яшертыHello");   // %ls for a wide string (%s varies meaning depending on the project's unicode settings).

Clearly, some characters can't be represented as ascii characters, so sometimes I have seen behaviour where double-byte characters get turned into a '?' mark character. But, this seems to depend on the particular characters. For the printf above, the output is:

s:\

I was hoping I might get something like:

s:\??????Hello

I'm afraid I've lost the example, but I think for one string when it encountered unicode characters, replaced the first one with a '?' and then gave up on the rest.

So, my question is, what's supposed to happen when you format a wide string into a single-byte string. Documentation here: http://msdn.microsoft.com/en-us/library/hf4y5e3w.aspx says "Characters are displayed up to the first null character". But, I'm not seeing that. Is this a bug in printf, or is the behaviour I'm seeing documented somewhere, if so, where.

Thanks for your help.

UPDATE

Thanks for the answers from people giving me alternatives to using printf. I am going to change to an alternative, but I'm really interested out of curiosity why does printf not have reliable documented behaviour. It appears almost as if the implementer of it went out of their way to make this not work.

2012-04-04 08:13
by Scott Langham
Have you tried "%S" as the format specifier instead of "%ls" - Daniel Schlößer 2012-04-04 08:20
yes. I believe %S and %ls have the same meaning if your project doesn't have UNICODE defined - Scott Langham 2012-04-04 08:22
Reading the format specifications (which I agree aren't clear). S is for a wide-string when your project settings do not have UNICODE defined, S is for a single-byte string when you do have UNICODE defined. %ls is for a wide-string regardless of whether or not you're building for UNICODE or not. %s also varies meaning, %hs is always for single-byte strings - Scott Langham 2012-04-04 08:23
It seems to be a sprintf/printf issue. wsprintfA works fine - Abyx 2012-04-04 08:34
OK %S %s meaning doesn't change whether unicode is defined or not, it's whether or not you use printf or wprintf - Scott Langham 2012-04-04 12:37


12

I expect your code to work -- and it works here on Linux -- but it is locale dependent. That means you have to set up the locale and your locale must support the character set used. Here is my test program:

#include <locale.h>
#include <stdio.h>

int main()
{
    int c;
    char* l = setlocale(LC_ALL, "");
    if (l == NULL) {
        printf("Locale not set\n");
    } else {
        printf("Locale set to %s\n", l);
    }
    printf("%ls\n", L"s:\\яшертыHello");
    return 0;
}

and here is an execution trace:

$ env LC_ALL=en_US.utf8 ./a.out
Locale set to en_US.utf8
s:\яшертыHello

If it says that the locale isn't set or is set to "C", it is normal that you don't get the result you expect.

Edit: see the answers to this question for the equivalent of en_US.utf8 for Windows.

2012-04-04 08:31
by AProgrammer
Hmm. This answer seems in the right kind of area. I wonder how you've got your locale set to utf8 though... when I try that, setlocale fails. The docs here: http://msdn.microsoft.com/en-us/library/x99tb11d.aspx (if you search for utf-8) says it will fail if you try utf-8. Maybe it just doesn't work in Microsoft's implementation - Scott Langham 2012-04-04 09:22
@ScottLangham, locale names aren't standardized and I don't know what is supported under Windows but I'd be surprised if they don't have any Unicode -- not necessarily UTF8 -- locale - AProgrammer 2012-04-04 11:23
Windows doesn't support a 'Unicode' locale. On all implementations wchart's encoding is locale independent so a locale's encoding only relates to the narrow character encoding. So a 'Unicode' locale essentially requires UTF-8, and Windows doesn't provide any locales using UTF-8. Windows supports Unicode by using UTF-16 as the wchart encoding - bames53 2012-04-04 18:42


5

In C++ I usually use std::stringstream to create formatted text. I also implemented an own operator to use Windows function to make the encoding:

ostream & operator << ( ostream &os, const wchar_t * str )
{
  if ( ( str == 0 ) || ( str[0] == L'\0' ) )
   return os;
  int new_size = WideCharToMultiByte( CP_UTF8, 0, str, -1, NULL, NULL, NULL, NULL );
  if ( new_size <= 0 )
    return os;
  std::vector<char> buffer(new_size);
  if ( WideCharToMultiByte( CP_UTF8, 0, str, -1, &buffer[0], new_size, NULL, NULL ) > 0 )
    os << &buffer[0];
  return os;
}

This code convert to UTF-8. For other possibilities check: WideCharToMultiByte.

2012-04-04 08:26
by Naszta
Nice example of how to do this : - jcoder 2012-04-04 08:35
@JohnB: thanks! : - Naszta 2012-04-04 08:35
Ads