Incidencia #39687

wcsrtombs with NULL dest pointer doesn't ignore len parameter

Abrir Fecha: 2019-10-20 02:38 Última actualización: 2020-03-20 07:04

Informador:
Propietario:
Tipo:
Estado:
Open [Owner assigned]
Componente:
Hito:
(Ninguno)
Prioridad:
5 - Medium
Gravedad:
5 - Medium
Resolución:
Ninguno
Fichero:
2
Vote
Score: 0
No votes
0.0% (0/0)
0.0% (0/0)

Details

I'm using MinGW-gcc-6.3.0. The wcsrtombs() function as mentioned in the docs on cppreference should return the number of bytes that would have been written to src. However it doesn't do so on my end. It seems the implementation doesn't ignore the length parameter when dest is passed as NULL? A similar issue was reported and presumably fixed for the Mingw-w64 on the sourceforge site Currently working around by passing INT_MAX as the length parameter, so it finishes within the limit and returns the size.

I apologize in advance if this is just an issue from my end or if any other info is missing. First time submitting a ticket :)

Ticket History (3/23 Histories)

2019-10-20 02:38 Updated by: gallickgunner
  • New Ticket "wcsrtombs with null dest pointer doesn't ignore len parameter" created
2019-10-20 02:53 Updated by: gallickgunner
  • Details Updated
2019-10-23 02:27 Updated by: keith
  • Propietario Update from (Ninguno) to keith
  • Componente Update from GCC to WSL
  • Details Updated
  • Summary Updated
Comentario

Thanks for bringing this to my attention.

Yes, there are issues with the MinGW implementations of wcsrtombs(), and the related wcrtomb() functions. They go far beyond this mishandling of a NULL pointer destination in wcsrtombs(); in particular, they fail to defer to the Microsoft implementations, which were first introduced by MSVCR80.DLL, and subsequently retrofitted to MSVCRT.DLL in Vista, in addition to both being non-conforming (to ISO-C99), w.r.t. handling of a NULL destination pointer.

I'll look into providing conforming implementations, for mingwrt-5.3 — these are runtime library issues, not issues with GCC itself. In the meantime, it would be helpful if you can provide a SSCCE to both demonstrate the issue, and test any potential solution.

2019-10-24 16:14 Updated by: gallickgunner
Comentario

Reply To keith

Thanks for bringing this to my attention. Yes, there are issues with the MinGW implementations of wcsrtombs(), and the related wcrtomb() functions. They go far beyond this mishandling of a NULL pointer destination in wcsrtombs(); in particular, they fail to defer to the Microsoft implementations, which were first introduced by MSVCR80.DLL, and subsequently retrofitted to MSVCRT.DLL in Vista, in addition to both being non-conforming (to ISO-C99), w.r.t. handling of a NULL destination pointer. I'll look into providing conforming implementations, for mingwrt-5.3 — these are runtime library issues, not issues with GCC itself. In the meantime, it would be helpful if you can provide a SSCCE to both demonstrate the issue, and test any potential solution.

Thanks didn't know it was the runtime library issue. Anyways I'd be happy to post a code example. Is a simple concise main.cpp enough?

2019-10-24 19:44 Updated by: keith
Comentario

Reply To gallickgunner

I'd be happy to post a code example. Is a simple concise main.cpp enough?

Yes. Short and sweet is the order of the day. Concise is good ... provided it is sufficient to illustrate the fault, and can subsequently be used to show that any proposed solution is effective.

2019-10-25 16:12 Updated by: gallickgunner
Comentario

Sorry about the delay but I think the site encppreference gives a perfect example. Posting the code here,

#include <iostream>
#include <vector>
#include <string>
#include <cwchar>
#include <clocale>
 
void print_wide(const wchar_t* wstr)
{
    std::mbstate_t state = std::mbstate_t();
    std::size_t len = 1 + std::wcsrtombs(nullptr, &wstr, 0, &state);
    std::vector<char> mbstr(len);
    std::wcsrtombs(&mbstr[0], &wstr, mbstr.size(), &state);
    std::cout << "multibyte string: " << &mbstr[0] << '\n'
              << "Length, including '\\0': " << mbstr.size() << '\n';
}
 
int main()
{
    std::setlocale(LC_ALL, "");
    const wchar_t* wstr = L"Hello";
    print_wide(wstr);
}

Expected Output

multibyte string: Hello
Length, including '\0': 6

If Length is printed 1 or "Hello" doesnt get printed, the function doesn't work correctly.

(Edited, 2019-10-26 03:08 Updated by: gallickgunner)
2019-10-25 20:56 Updated by: keith
Comentario

Reply To gallickgunner

So if Length is printed 1 it means the function doesn't work correctly.

And, what should be printed, when the function is working correctly? That is vital information, to make this a valid test case.

2019-10-26 03:04 Updated by: gallickgunner
Comentario

Reply To keith

Reply To gallickgunner

So if Length is printed 1 it means the function doesn't work correctly.

And, what should be printed, when the function is working correctly? That is vital information, to make this a valid test case.

Real sorry. I updated the code and replaced the string literal with fancy characters with a simple "Hello". I'm not an expert at strings and encodings but I think the number of characters of the previous string would depend on the encoding used. As depending on the encoding the special characters can get translated to different multibyte characters. encppreference used en_US.utf8 for the locale but it isn't available on all platforms. I think the simple string Hello is gonna be converted to the same old 5 characters regardless of the encoding. If I'm wrong do correct me, anyone.

2019-10-30 05:23 Updated by: keith
Comentario

Reply To gallickgunner

I think the site encppreference gives a perfect example....

Alas, as it now stands, (and as it was originally), it isn't particularly useful. Its problems originate in this statement:

    std::setlocale(LC_ALL, "");
which, on MS-Windows, is somewhat dysfunctional; it does not respect LC_CLASS environment variables, but rather just aligns the runtime locale with the pre-configured console code page, (which, in my case, is English_United Kingdom.1252). In the majority of usage cases, for western language locales, that will represent a single byte character encoding, in which a significant majority of Unicode code points cannot be represented, so the conversion fails with errno = EILSEQ, when you give it any of the majority of wchar_t inputs in which the high order byte is non-zero. This is exactly what happens, with your original example, when run with code page 1252, (so the correctly displayed byte count for Length would be zero — i.e. 1 + (size_t)(-1)).

2019-11-05 07:34 Updated by: keith
Comentario

Unfortunately, converting a wchar_t representation of an SBCS string (such as L"Hello") doesn't really exercise the code effectively. Thus, I modified your original test case, to:

  1. #include <iostream>
  2. #include <vector>
  3. #include <string>
  4. #include <cstring>
  5. #include <cwchar>
  6. #include <clocale>
  7. #include <cerrno>
  8. void print_wide(const wchar_t* wstr)
  9. { std::mbstate_t state = std::mbstate_t();
  10. std::size_t len = 1 + std::wcsrtombs(nullptr, &wstr, 0, &state);
  11. if( len > 0 )
  12. { std::vector<char> mbstr(len);
  13. std::wcsrtombs(&mbstr[0], &wstr, mbstr.size(), &state);
  14. std::cout << "Multibyte string: " << &mbstr[0] << '\n'
  15. << "Length, including '\\0': " << mbstr.size() << '\n';
  16. }
  17. else
  18. { std::cerr << "Verified input length: " << len << "; "
  19. << std::strerror(errno) << '\n';
  20. }
  21. }
  22. int main()
  23. { const char *lang = std::getenv("LC_CTYPE");
  24. std::setlocale(LC_CTYPE, (lang == nullptr) ? "" : lang );
  25. std::cerr << "Locale: " << std::setlocale(LC_CTYPE, NULL) << '\n';
  26. const wchar_t* wstr = L"A wstring-\u00df\u6c34\U0001d10b"; // or L"A wstring-ß水𝄋"
  27. print_wide(wstr);
  28. }
When I compile this, and run it on my GNU/Linux host, I see:
$ g++ tc39687.cc 
]$ ./a.out
Locale: en_GB.utf8
Multibyte string: A wstring-ß水𝄋
Length, including '\0': 20
I shall attempt to reproduce this, with a modified MinGW implementation of wcsrtombs() and related functions.

2019-11-10 06:38 Updated by: keith
  • File wchar-to-mbcs-conversion.patch (File ID: 5575) is attached
2019-11-10 07:01 Updated by: keith
Comentario

I've attached a tentative patch, which goes at least part of the way towards resolving this issue. When I run my previous test case under wine-4.18, on my Manjaro Linux box, I see:

$ mingw32-g++ tc39687.cc 
$ LC_CTYPE="en_GB.65001" ./a.exe | cat
Locale: English_United Kingdom.Multibyte string: A wstring-ß水𝄋
Length, including '\0': 20
which is consistent with the behaviour I observe from my original (native) GNU/Linux test.

(Note that I have to give the LC_CTYPE spec in a mixed Linux/Microsoft format, because wine's use of libfontconfig raises complaints about a Microsoft-style language_region part, but not the .codeset part, which I do, currently, need to give as a Microsoft codepage number; the output from wine is piped through cat because wine, itself, garbles the display of non-ASCII UTF-8 code points, when the output is direct to the UTF-8 encoded terminal, whereas cat reads the correctly formatted UTF-8 output stream, and subsequently displays it correctly).

2019-11-12 08:39 Updated by: keith
Comentario

I feared that this may prove to be the case: if I run that same test case, which succeeded under wine, in a native Win7 virtual machine, it fails:

$ LC_CTYPE="English_United Kingdom.65001" /e/tc39687.exe
Locale: C
Verified input length: 0; Illegal byte sequence
I suspect that the problem arises from an unfortunate — but documented — limitation of Microsoft's setlocale() implementation; it is unable to process locale specifications in which the codeset element refers to any with an MB_CUR_MAX of greater than two bytes, (which precludes UTF-8 — code page 65001 — amongst a significant number of others).

I shall continue to explore possible work-arounds.

2019-11-13 00:11 Updated by: keith
Comentario

The problem is indeed the Microsoft setlocale() limitation, (which, although Microsoft's documentation doesn't acknowledge it as such, is clearly a bug). The clue is in the Locale: C output, following the conditional call which becomes equivalent to setlocale( LC_CTYPE, "English_United Kingdom.65001" ); the failure to change the locale setting, from the initial "C" state, is a clear indication that the 65001 (UTF-8) codeset specification has been rejected.

The ideal solution would be to have a correctly working setlocale() implementation, but Microsoft will never provide that, particularly in the case of any legacy Windows version which MinGW still endeavours to support. Next best would be a MinGW replacement setlocale() implementation, (which could also address other failings of the Microsoft implementation), but this would require significant development effort, and must, therefore, be considered only as a longer term development objective. In the shorter term, I propose to adapt the MinGW replacements for wcrtomb() and wcsrtombs(), such that they will initially identify a potential codeset identity from the string returned by setlocale( LC_CTYPE, NULL ), (as they do currently), but then override that by inspection of getenv( "LC_ALL" ), getenv( "LC_CTYPE" ), or getenv( "LANG" ), in that order, and taking the first non-NULL, if any, and then if, and only if, the string returned by setlocale( LC_CTYPE, NULL ) is a verbatim match for that returned by setlocale( LC_CTYPE, "" ), and any overriding codeset, so identified, is confirmed as having a non-zero encoding length, by GetCPInfo( codeset, &info ).

(Edited, 2019-11-13 00:22 Updated by: keith)
2019-11-13 07:40 Updated by: keith
  • File wchar-to-mbcs-conversion.patch (File ID: 5575) is deleted
2019-11-13 08:38 Updated by: keith
Comentario

I've updated my original patch, to implement the stratagem I outlined earlier. With this applied, and also with the following addition to my original test case:

  1. --- a/tc39687.cc 2019-11-12 23:06:48.688108980 +0000
  2. +++ b/tc39687.cc 2019-11-11 22:55:10.816277321 +0000
  3. @@ -19,12 +19,14 @@
  4. { std::cerr << "Verified input length: " << len << "; "
  5. << std::strerror(errno) << '\n';
  6. }
  7. }
  8. int main()
  9. -{ const char *lang = std::getenv("LC_CTYPE");
  10. +{
  11. + std::setlocale(LC_ALL, "" );
  12. + const char *lang = std::getenv("LC_CTYPE");
  13. std::setlocale(LC_CTYPE, (lang == nullptr) ? "" : lang );
  14. std::cout << "Locale: " << std::setlocale(LC_CTYPE, NULL) << '\n';
  15. const wchar_t* wstr = L"A wstring-\u00df\u6c34\U0001d10b"; // or L"A wstring-ß水𝄋"
  16. print_wide(wstr);
  17. }
With these changes, I can successfully run the test case on my Win7 VM, where I now see:
$ LC_CTYPE="English_United Kingdom.65001" /e/tc39687.exe
Locale: English_United Kingdom.1252
Multibyte string: A wstring-ÃYæ°´ð?"<
Length, including '\0': 20
Note that the output of the converted string data is garbled, (because the console is actually incapable of rendering UTF-8 encoded data), but the length determination is now correct.

I propose incorporating this modification into mingwrt-5.3.

2020-01-22 08:14 Updated by: keith
Comentario

Following on from this, I've reviewed the complementary mbrlen(), mbrtowc(), and mbsrtowcs() function implementations; all exhibit potentially serious, (and potentially more difficult to rectify), defects. In particular:

  • As with the original wcsrtombs() implementation, the output length limit prevents proper determination of required buffer length in mbsrtowcs(), when initially called with an output buffer specified as NULL.
  • The codeset conversion routine, shared by all three of these functions, is ill equipped to handle anything other than SBCS or DBCS inputs, in spite of Microsoft's eventual recognition of MBCS codesets with more than two bytes per code point.

I'm currently undecided as to whether or not mingwrt-5.3 should be delayed, pending resolution of these complementary issues.

2020-03-16 08:00 Updated by: keith
Comentario

I've now reimplemented all of wcsrtombs(), wcrtomb(), wctob(), mbsrtowcs(), mbrtowc(), mbrlen(), and btowc(), such that all will now attempt to delegate to a Microsoft DLL implementation, if available, and fall back to a new libmingwex implementation, in situations where a Microsoft implementation is unavailable, or may be unsuitable. All are included in mingwrt-5.3.

Also included in mingwrt-5.3, is a replacement for mbsinit() ... in libmingwex only; AFAICT, there never has been a Microsoft implementation for this, but the original libmingwex implementation would always report initial state, which is incorrect in cases where any of the restartable conversion functions have left pending state from an incomplete conversion.

The libmingwex implementations of each of these functions conform, as far as is practicable, to ISO-C99 and to POSIX.1; Microsoft's implementations are likely nonconforming. Users wishing to always use the libmingwex implementations, (to exploit their more reliable standards conformity), should avoid linking with any of Microsoft's non-free runtime DLLs — i.e. link with MSVCRT.DLL only, and do not define __MSVCRT_VERSION__ — and should define either _ISOC99_SOURCE, or _ISOC11_SOURCE, before including any header file, in any translation unit which calls any of the above-named functions. (These measures disable call delegation to the Microsoft function implementations, thus ensuring that the libmingwex implementations are used).

Of the new libmingwex implementations, mbrtowc() has proved particularly challenging; in fact, it is not possible to implement this in a truly ISO-C99 conforming manner, while limited by Microsoft's unfortunate choice of UTF-16LE as the underlying representation of wchar_t ... there is simply no safe way to return a surrogate pair in a single call! Thus, the new implementation incorporates special, non-standard handling for this abnormal case, as described in this manpage.

2020-03-16 21:50 Updated by: keith
Comentario

An oversight in the mingwrt-5.3 implementation of mbrtowc(), if the caller fails to check for return of a high surrogate, and consequently, doesn't correctly follow up by retrieving the complementary low surrogate, allows the pending surrogate pair conversion state to be incorrectly interpreted, on a following mbrtowc() call, as a new MBCS sequence instead of as a low surrogate awaiting retrieval. This is now corrected, in the repository, and will thus be corrected in mingwrt-5.3.1.

A similar defect, whereby pending conversion state could be misinterpreted as an already complete MBCS sequence, in the (perhaps less likely) event that the active codeset is changed between mbrtowc() calls, has also been corrected.

(Edited, 2020-03-23 00:19 Updated by: keith)
2020-03-20 06:42 Updated by: keith
Comentario

Due to my misinterpretation of Microsoft's documentation, the mingwrt-5.3 implementation of btowc() may incorrectly return 0xFFFD, when it should return WEOF, if the byte to be converted is not valid as a single byte character in the active codeset. This is corrected, in the repository, for whenever mingwrt-5.3.1 is published.

2020-03-20 07:04 Updated by: keith
Comentario

I've identified a further issue, this time with the mingwrt-5.3 implementation of mbsrtowcs(); if the cumulative count of wide characters stored reaches exactly one less than the nominated buffer length, and conversion of the next multibyte character results in a surrogate pair, this is stored anyway, causing a buffer overrun by one wide character. This too, is now corrected in the repository, for inclusion in mingwrt-5.3.1.

Attachment File List

Editar

Please login to add comment to this ticket » Entrar