在std :: wstring和std :: string之间转换

在研究在std::wstring和std::string之间来回转换的方法时，我在MSDN论坛上发现了此对话。

对我来说，有两个功能看起来不错。具体来说，这些是：

std::wstring s2ws(const std::string& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0); 
    wchar_t* buf = new wchar_t[len];
    MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
    std::wstring r(buf);
    delete[] buf;
    return r;
}

std::string ws2s(const std::wstring& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0); 
    char* buf = new char[len];
    WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, buf, len, 0, 0); 
    std::string r(buf);
    delete[] buf;
    return r;
}

但是，双重分配和删除缓冲区的需要关系到我（性能和异常安全性），因此我将它们修改为：

std::wstring s2ws(const std::string& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0); 
    std::wstring r(len, L'q4312078q');
    MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, &r[0], len);
    return r;
}

std::string ws2s(const std::wstring& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0); 
    std::string r(len, 'q4312078q');
    WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, &r[0], len, 0, 0); 
    return r;
}

单元测试表明它可以在一个不错的，受控的环境中工作，但是在我的客户的计算机是恶性和不可预测的世界中可以吗？

#1 楼

实际上，我的单元测试表明您的代码是错误的！

问题是您在输出字符串中包含零终止符，而std::string和朋友不应该发生这种情况。这是一个可能导致问题的示例，特别是如果您使用std::string::compare：

// Allocate string with 5 characters (including the zero terminator as in your code!)
string s(5, '_');

memcpy(&s[0], "ABCDq4312078q", 5);

// Comparing with strcmp is all fine since it only compares until the terminator
const int cmp1 = strcmp(s.c_str(), "ABCD"); // 0

// ...however the number of characters that std::string::compare compares is
// someString.size(), and since s.size() == 5, it is obviously not equal to "ABCD"!
const int cmp2 = s.compare("ABCD"); // 1

// And just to prove that string implementations automatically add a zero terminator
// if you call .c_str()
s.resize(3);
const int cmp3 = strcmp(s.c_str(), "ABC"); // 0
const char term = s.c_str()[3]; // 0

printf("cmp1=%d, cmp2=%d, cmp3=%d, terminator=%d\n", cmp1, cmp2, cmp3, (int)term);

\ $ \ begingroup \ $
我发现终止符的添加也很烦人：在我的情况下，它也破坏了字符串添加。我最终在这两种方法中都添加了布尔参数includeTerminator。
\ $ \ endgroup \ $
–reallynice
15年7月30日在13:01

\ $ \ begingroup \ $
@reallynice请参阅：FlagArgument（martinfowler.com）
\ $ \ endgroup \ $
– Marc.2377
19年9月12日，0：53

#2 楼

我会并且已经重新设计了类似于铸模的功能集：

std::wstring x;
std::string y = string_cast<std::string>(x);

以后当您开始与一些第三者打交道时，这可能会带来很多好处。库对字符串应该是什么样的想法。

\ $ \ begingroup \ $
我喜欢语法。可以共享代码吗？
\ $ \ endgroup \ $
– Jere.Jones
2011年1月31日18:53

\ $ \ begingroup \ $
哦。看起来不错。那怎么办？只是制作一个具有专门知识的模板即可在各种字符串类型之间进行转换？
\ $ \ endgroup \ $
–比利·奥尼尔（Billy ONeal）
2011年2月5日在18:21

\ $ \ begingroup \ $
@Billy，如果您对此感兴趣，有人在此处发布了有关string_cast实现的代码审查问题。
\ $ \ endgroup \ $
–狼人
2011年11月5日，下午3:20

\ $ \ begingroup \ $
为什么呢？我想我更喜欢具有to_string和to_wstring函数，类似于标准库（当然是在我自己的命名空间中）。
\ $ \ endgroup \ $
– Marc.2377
19-09-13在23:55

#3 楼

这实际上取决于std::wstring和std::string使用的是哪种编解码器。

此答案假定std::wstring使用的是UTF-16编码，并且转换为std::string的过程将使用UTF-8编码。 br />

#include <codecvt>
#include <string>

std::wstring utf8ToUtf16(const std::string& utf8Str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
    return conv.from_bytes(utf8Str);
}

std::string utf16ToUtf8(const std::wstring& utf16Str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
    return conv.to_bytes(utf16Str);
}

此答案使用STL，并且不依赖于平台特定的库。

\ $ \ begingroup \ $
这是最好的答案，因为这是唯一的方法。谢谢。
\ $ \ endgroup \ $
–罗马
17/12/24在9:02

\ $ \ begingroup \ $
dbj.org/c17-codecvt-deprecated-panic ...我的无耻插件可能会帮助...
\ $ \ endgroup \ $
– DBJDBJ
19年7月15日在15:48

\ $ \ begingroup \ $
@DBJDBJ您提出的解决方案绝不是中的wstring_convert的替代品。您通过说这种方法“有点争议”来轻描淡写地解决问题-在我看来，这远不止于此。这是错的。我有wstring_convert的使用，您的解决方案无法替代。它不能转换像“おはよう”这样的真正unicode字符串，因为它不是真正的转换。这是一个演员。考虑在您的文本中使它更明确；）
\ $ \ endgroup \ $
– Marc.2377
19-09-12在0:59

\ $ \ begingroup \ $
@ Marc.2377-好吧，我想我在那段文字中提出了很多警告。我什至提到“铸造”一词。甚至还有“该文章”的链接。无论如何，非常感谢您的阅读。
\ $ \ endgroup \ $
– DBJDBJ
19年9月12日在15:17

\ $ \ begingroup \ $
FFWD到2019年-已弃用
\ $ \ endgroup \ $
– DBJDBJ
19年9月12日在16:15

#4 楼

可能引起问题的一件事是，它假定字符串是使用当前活动的代码页（CP_ACP）进行ANSI格式化的。如果是UTF-8，则可能要考虑使用特定的代码页或CP_UTF8。

\ $ \ begingroup \ $
这可能是一个愚蠢的问题，但是我怎么知道呢？对于我来说，这些通常是文件名。
\ $ \ endgroup \ $
– Jere.Jones
2011年1月31日19:49

\ $ \ begingroup \ $
如何获取文件名？这将确定要使用的正确代码页。
\ $ \ endgroup \ $
–Ferruccio
2011年2月1日于1:30

\ $ \ begingroup \ $
@ Jere.Jones：一种方法是检查字符串是否有效的UTF-8。如果不是，则假定它是ANSI。
\ $ \ endgroup \ $
– dan04
2011-2-5在16:31

\ $ \ begingroup \ $
@ dan04：ANSI要求指定代码页。 zh.wikipedia.org/wiki/Code_page。
\ $ \ endgroup \ $
–Ferruccio
2011-2-5 21:33

\ $ \ begingroup \ $
附加说明：MSDN文档建议不要对用于永久存储的字符串使用CP_ACP，因为活动页面可能随时更改。
\ $ \ endgroup \ $
– M.M
18-2-20在20:46

#5 楼

我建议将其更改为：

int len;
int slength = (int)s.length() + 1;
len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);

...更改为：

int slength = (int)s.length() + 1;
int len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);

稍微简洁一点，len的作用域减小了，并且没有未初始化的变量浮动回合（好吧，仅用于一行）作为陷阱。

#6 楼

我没有进行任何Windows开发，因此我无法对WideCharToMultiByte的安全性做出任何评论。

我要说的一件事是确保对所有内容使用正确的类型。例如，string.length()返回一个std::string::size_type（很可能是size_t，构造函数也使用一个std::string::size_type，但这没什么大不了的）。它可能永远不会咬你，但是要确保您在编写的其他代码中没有任何溢出，这是要小心的事情。

\ $ \ begingroup \ $
好吧，它返回一个std :: string :: size_type。
\ $ \ endgroup \ $
–琼·普迪（Jon Purdy）
2011年1月30日，9：15

\ $ \ begingroup \ $
@Jon：是的，但是我从未见过它不等于size_t的表示形式。不过，我将修改答案，谢谢您的反馈。
\ $ \ endgroup \ $
– Mark Loeser
2011年1月30日15:41

\ $ \ begingroup \ $
@Jon：std :: string :: size_type始终是std :: size_t。
\ $ \ endgroup \ $
– GManNickG
2011年1月30日在22:02

\ $ \ begingroup \ $
@GMan：我只是出于无聊而做书。 SGI说这是“可以表示容器距离类型的任何非负值的无符号整数类型”（即，difference_type），并且对于现有类型，这两者都必须是typedef，但这并不意味着size_type必须是相当于size_t。这里还有其他工作吗？
\ $ \ endgroup \ $
–琼·普迪（Jon Purdy）
2011年1月30日23:28

\ $ \ begingroup \ $
@Jon：我不确定SGI为什么重要。标准说std :: string :: size_type是allocator_type :: size_type，默认分配器的size_type是std :: size_t。
\ $ \ endgroup \ $
– GManNickG
2011年1月31日下午5:31

#7 楼

当我们使用std::string和std::wstring时，我们需要#include <string>。

没有MultiByteToWideChar()或WideCharToMultiByte()的声明。它们的名称表明它们可能分别是std::mbstowcs()和std::wcstombs()的薄包装，但是如果没有看到它们，就很难确定。

如果我们使用标准函数来更容易理解。在以null终止的多字节字符串和宽字符字符串之间进行转换。

根据使用情况，std::codecvt<wchar_t, char, std::mbstate_t>构面可能更合适。然后，几乎不需要编写任何代码，尤其是无需猜测可能的输出长度。

#8 楼

我只是简要地查看了您的代码。我没有使用std :: string进行很多工作，但是我使用API进行了很多工作。

假设您正确设置了所有长度和参数（有时要确保终止符以及宽字节与多字节长度没事的话可能会很棘手），我认为您的做法正确。我认为您发布的第一个例程不必要地分配了额外的缓冲区。不需要。

#9 楼

不，这很危险！ std :: string中的字符可能不会存储在连续的内存块中，并且您不得使用指针&r[0]写入该字符以外的任何字符！这就是为什么c_str()函数返回const指针的原因。
它可以与MSVC一起使用，但是如果切换到其他编译器或STL库，它可能会中断。

\ $ \ begingroup \ $
\ $ \ endgroup \ $
–比利·奥尼尔（Billy ONeal）
2011年11月5日，13：59

编程黑洞网