我可以瘦这个UTF8编码程序吗？

下面是我的整个程序。尤其要感谢注释和规范，您可以阅读它的功能。

我的问题是：它可以改进吗？例如，是否有可能避免在每个fwrite()内写入if？

整个程序都基于此UTF8模型，并且还研究了在第32位出现位的情况。

 #include <stdio.h>
#include <math.h>
#include <stdint.h>

double log(double a);

/*
* This program reads 4 byte codepoints (in BIG ENDIAN) from a file strictly called "input.data" and creates another file called "ENCODED.data" with the relative encoding in UTF8.
* 
* In order to compile this file, in Unix, you need to add the -lm clause because the library math.h function log() requires it.
* For example: gcc encoding.c -o encoding -lm
*/
int main() {

    unsigned char bufferCP[4]; //Buffer used to store the codepoints
    unsigned char bufferOut[6]; //Buffer used to store the UTF8-encoded codepoints

    FILE *ptr, *out;
    ptr = fopen("input.data", "rb"); //r for read, b for bynary
    out = fopen("ENCODED.data", "wb");

    int elem = 0, bytesRead = 0;
    unsigned char mask = 0x3F; //Mask used to keep bits interesting for analysis
    uint32_t codepoint = 0; //A codepoint must be an unsigned 32 bit integer

    //--------------------File-Reading--------------------
    while ((elem = fgetc(ptr)) != EOF) {
        //Stores the character in the buffer
        bufferCP[bytesRead++] = (unsigned char) elem;

        if (bytesRead == 4) { //A codepoint is ready to be managed              

            //Builds a codepoint from the buffer. Reads it in BIG ENDIAN.
            for(int j=3; j>=0; j--) {
                    codepoint <<= 8;
                    codepoint |= bufferCP[j];
            }
            //Searches the position of the most significant bit
            double logRes = (log(codepoint)/log(2)) + 1;
            int bitPos = (int) logRes;

            //--------------------UTF8-Encoding--------------------
            if (bitPos <= 7) {
                bufferOut[0] = (unsigned char) codepoint; //No need to manage this codepoint
                fwrite(bufferOut, 1, 1, out);

            } else if (bitPos <= 11) {
                bufferOut[0] = (codepoint >> 6) | 0xC0;
                bufferOut[1] = (codepoint & mask) | 0x80;
                fwrite(bufferOut, 1, 2, out); 

            } else if (bitPos <= 16) {
                bufferOut[0] = (codepoint >> 12) | 0xE0; 
                for(int i=1; i<3; i++)
                    bufferOut[i] = ((codepoint >> 6*(2-i)) & mask) | 0x80;
                fwrite(bufferOut, 1, 3, out);

            } else if (bitPos <= 21) {
                bufferOut[0] = (codepoint >> 18) | 0xF0; 
                for(int i=1; i<4; i++)
                    bufferOut[i] = ((codepoint >> 6*(3-i)) & mask) | 0x80;
                fwrite(bufferOut, 1, 4, out);

            } else if (bitPos <= 26) {
                bufferOut[0] = (codepoint >> 24) | 0xF8;
                for(int i=1; i<5; i++)
                    bufferOut[i] = ((codepoint >> 6*(4-i)) & mask) | 0x80;
                fwrite(bufferOut, 1, 5, out);

            } else if (bitPos <= 32) {
                if (bitPos == 32)
                    bufferOut[0] = (codepoint >> 30) | 0xFE; //UTF8-encoding first byte would be: 11111111?
                else
                    bufferOut[0] = (codepoint >> 30) | 0xFC;

                for(int i=1; i<6; i++)
                    bufferOut[i] = ((codepoint >> 6*(5-i)) & mask) | 0x80;
                fwrite(bufferOut, 1, 6, out);
            }

            bytesRead = 0; //Variable reset
        }
    }

}

有关快速，轻松地将wchar_t转换为UTF-8的函数，请参见stackoverflow.com/a/148766/5987。

我必须自己做。这是大学的任务...

然后，您可能会想到它。

#1 楼

高效的文件I / O

默认情况下，使用fopen()打开的文件被缓冲，这意味着并非每次对fread()或fwrite()的调用都会导致系统调用。相反，C库具有内部缓冲区，并且将尝试一次读取和写入较大的块。但是，每次调用fread()和fwrite()时，仍然需要为常规函数调用支付开销。为避免这种情况，最好也用自己的代码大块读写。

虽然您可以尝试一次读取整个文件，甚至可以使用mmap()之类的技术通过内存映射文件，您可以通过一次读写大约64 KB的块来获得非常好的性能。这样可以避免使用大量内存。
当然，您必须处理最后一个块，它的大小不一定要精确到64 KB，但这很容易处理。 fread()允许您指定元素的大小和要读取的元素数，这对确保您读取4字节代码点的整数非常有用。

我将构建您的结构像这样的代码：

uint32_t bufferIn[16384]; // 16384 4-byte code points = 64 kB
char bufferOut[65536];

size_t countIn;

while ((countIn = fread(bufferIn, sizeof *bufferIn, sizeof bufferIn / sizeof *bufferIn, ptr)) > 0) {
    // There are countIn codepoints in the buffer
    for (size_t i = 0; i < countIn; i++) {
         uint32_t codepoint = ...; // Convert bufferIn[i] to native endian here.

         // Write UTF-8 to bufferOut here.
         // If bufferOut is almost full, fwrite() it and start writing to it from the start.
    }
}

// Flush the remaining bytes in bufferOut here.

不要在整数问题上使用浮点运算

在处理整数时避免使用浮点运算。很难做到正确，将fwrite()转换为int，进行一些数学运算，然后再转换回去可能会很慢。

有几种方法可以获取整数中的最高置位比特。如果您想要一个便携式的，我建议您使用一些可笑的工具。有时，编译器甚至会识别出这种乱七八糟的东西，并在可能的情况下将其转换为一条CPU指令。

\ $ \ begingroup \ $
您是否真的是在希望通过IO限制的程序中，一些额外的函数调用（而不是内核调用）的开销是微不足道的？在我看来，这不太可能。
\ $ \ endgroup \ $
–Voo
20-5-18在11:27

\ $ \ begingroup \ $
@Voo：I / O是否受限制？这确实取决于您的硬件配置。而且，即使您受到I / O的限制，浪费的CPU周期也比不必要的浪费更多的能量（如果您的设备使用电池供电，这一点很重要），并且如果其他进程并行运行，它们将花费更少的CPU周期。同样，这不仅是执行函数调用的开销，而且fread（）还必须进行检查以查看所请求的数据是否已经在内部缓冲区中，依此类推。每次读取仅几个字节就这样做了。
\ $ \ endgroup \ $
– G. Sliepen
20-05-18在12:20

\ $ \ begingroup \ $
在AMD Ryzen 9 3900X和Samsung 950 PRO NVMe驱动器以及4 GB输入文件上运行一些基准测试，未优化的代码平均获得42.9秒，而fread（）的平均获得3.6秒和fwrite（）分成64个kiB块。根据hdparm，驱动器的吞吐量为2.5 GB / s，因此优化的代码大约受I / O约束，而未优化的代码慢12倍。
\ $ \ endgroup \ $
– G. Sliepen
20-05-18在15:10

\ $ \ begingroup \ $
@MarkRansom：setvbuf（）？是C89。
\ $ \ endgroup \ $
– G. Sliepen
20年5月18日在18:55

\ $ \ begingroup \ $
至少使用glibc，f *函数调用具有虚函数派发形式的开销（因为FILE结构不必引用实际文件），锁定（因为多个线程可以同时执行文件操作时间），簿记（调整结构偏移量，缓冲区指针，分配）等。它们可能是非常昂贵的函数调用，以至于您可能会通过谨慎使用write击败它们。
\ $ \ endgroup \ $
–nneonneo
20年5月20日在18:19

#2 楼

log已经声明了。您不需要自己声明它。实际上，这可能是有害的。

如另一个答案所述，请不要使用浮点数学。

实际上，您不需要知道确切的位置最左边的对于您的目的，<math.h>的值就足够了。例如，codepoint等效于bitPos <= 7。

我强烈建议将I / O与转换逻辑分开。考虑

while (read_four_bytes(input_fp, bufferCP) == 4) {
    size_t utf_char_size = convert_to_utf(bufferCP, bufferOut);
    write_utf_char(bufferOut, utf_char_size);
}

干燥。所有转换子句看起来都非常相似。考虑按照

将其重构为函数，并用作

，并将其用作

convert_codepoint(uint32_t codepoint, int utf_char_size, char * bufferOut) {
    for (int i = 0; i < utf_char_size; i++) {
        bufferOut[i] = ((codepoint >> 6 * (utf_char_size - i)) & mask) | 0x80;
    }
    bufferOut[0] |= special_mask[utf_char_size];
}

由此产生的codepoint < (1 << 8)的级联也可以转换为循环。

\ $ \ begingroup \ $
如何将其转换为循环？你可以给我一个例子吗？我是if-else迷，但我很好奇如何循环执行。
\ $ \ endgroup \ $
–lettomobile
20-05-19在22:41

\ $ \ begingroup \ $
您甚至可以使用循环来代替if ... else if ...测试的继承。并且可能在输出每个字节之后移动代码点。
\ $ \ endgroup \ $
– jcaron
20 May 20 '21：57

#3 楼

该程序从严格称为“ input.data”的文件中读取4个字节的代码点（在BIG ENDIAN中），并创建另一个名为“ ENCODED.data”的文件，其相对编码为UTF8。

不用说，这是存储代码点的一种怪异方法。我知道UTF-16，但是UTF-32BE（只是大尾数形式的代码点）并未得到广泛使用，尽管Python似乎在内部使用它来编码字符串。现在您知道该编码的含义了，不知道您是否需要自己编码或是否可以使用一个库。

一次确实是一个实现细节。通常，我们不会创建将自己限制为特定文件（甚至是文件）的转换应用程序。

* This program reads 4 byte codepoints (in BIG ENDIAN) from a file strictly called "input.data" and creates another file called "ENCODED.data" with the relative encoding in UTF8.

意思是，那么通常最好用变量名将其拼写出来：utf32be_buffer将是一个很好的变量名。

值4没有含义，一旦拆分，就成为问题将main方法转换为函数（如您所愿）。

unsigned char bufferCP[4]; //Buffer used to store the codepoints

utf8_buffer怎么样？变量声明到不同的行。 elem也是直接分配的，因此完全不需要分配零。

unsigned char bufferOut[6]

这句话确实引出了读者的问题：哪些位“很有趣”？ br />

int elem = 0, bytesRead = 0;

完全不必要的评论。 “必须存在”还提出了一个问题：对于该程序还是根据某种标准？

unsigned char mask = 0x3F; //Mask used to keep bits interesting for analysis

read_into_buffer而不是注释怎么办？

uint32_t codepoint = 0; //A codepoint must be an unsigned 32 bit integer

重复文字，而utf32be_buffer已被分配大小。使用该内容。

再次获得一条注释，该注释看起来像应该引入一种方法。您几乎可以听到自己定义它们的信息。

最后，如果文件不包含4字节的倍数会发生什么？看来您只是删除了最后一个字节而没有警告或错误。 br />

//--------------------File-Reading--------------------

同一文字4的另一个重复，但现在伪装成3，即4-1。太好了。 >我实际上为此在Java中使用了一个常量（convert_code_point()），但是您可以在这里使用8作为原谅，尤其是因为此代码应该能很好地执行。已经指出，为此使用位操作。请提供一种方法，这是在StackOverflow上的答案。当我第一次阅读该评论时，我担心您会跳过它。幸运的是，情况并非如此。 />

if (bytesRead == 4) { //A codepoint is ready to be managed

我们在C样式语言中使用基于零的索引。根据您的说法，在第32位有很大的机会是什么？它还显示变量名称错误：它表示缓冲区中的字节数，而不是从文件读取的字节数。

\ $ \ begingroup \ $
有时在Linux中使用UTF-32LE，但是我从未遇到过。
\ $ \ endgroup \ $
–鸭鸭
20-05-19在17:09

#4 楼

关于：

ptr = fopen("input.data", "rb"); 
out = fopen("ENCODED.data", "wb");

始终检查（！= NULL）返回值以确保操作成功。如果未成功（== NULL），则调用：

perror( "your error message" );

输出错误消息和文本原因，系统认为错误发生在stderr上。 >

#5 楼

正如其他人所说，不要使用浮点数学，但从某种意义上说，这是在审查错误的图层。其背后的真正问题是，您无需分支到派生数量即位数。而是在代码点值范围（原始输入）上分支。例如（摘录自我的实现）：对于当前的问题（UTF-8），必须进行正确的错误处理。不是精确位数的边界（D800和DFFF之间，大于10FFFF）对应于错误的输入，不应将其输入为格式错误的UTF-8，而应以某种方式拒绝。

\ $ \ begingroup \ $
您能否解释为什么答案内的解决方案更好？
\ $ \ endgroup \ $
–pacmaninbw
20-05-18在22:33

#6 楼

代码无法检测到无效的代码点

有1,112,064个有效的unicode代码点，而不是232。

有效范围为[0x0-0x10FFFF]，除了[ 0xD800-0xDFFF]。此更高的子范围用于代理。

未为该范围之外的4字节值定义UTF-8。除非代码称它为UTF-8的过时1993版，否则代码不应尝试创建六字节的“ UTF-8”。 />代码会静默丢弃额外的字节

代码应该读取额外的最后1个，2个或3个字节，不提供错误指示。

\ $ \ begingroup \ $
您绝对必须检查字节是否正确读取。我根本不这样做，因为这项任务并不值钱，所以我避免了应该在更严格的转换器中进行的检查。关于UTF8版本，我进行了编辑，以显示程序所基于的模型。
\ $ \ endgroup \ $
–lettomobile
20-5-19在19:56

编程黑洞网