计算字符串的熵_编程黑洞网

我们正在计算堆栈溢出中几个地方的字符串熵，这是低质量的象征。突然出现在我脑海中的东西。这是“最愚蠢的方法”。

/// <summary>
/// returns the # of unique characters in a string as a rough 
/// measurement of entropy
/// </summary>
public static int Entropy(this string s)
{
  var d = new Dictionary<char, bool>();
  foreach (char c in s)
      if (!d.ContainsKey(c)) d.Add(c, true);
  return d.Count();
}

是否有更好/更优雅/更准确的方法来计算字符串的熵？

效率也很好，尽管我们从来没有在大型字符串上调用它，所以它并不是一个大问题。

en.wikipedia.org/wiki/Entropy_(information_theory）

您的问题使我想起了我20年前读过的Dobbs博士的文章。幸运的是，它可以在线使用。它包括简单的.c代码drdobbs.com/security/184408492

杰夫，请告诉我，您不是在尝试使用此代码，以使其更难以发布“是”之类的简短注释。通过阻止用户添加点或破折号...

我不知道您想使用它做什么，但是估计数据熵的一种方法是压缩它，并取结果的长度。数据的长度是熵的上限。压缩机程序越好-估算值就越好。

从技术上讲，已知字符串没有熵。产生弦的过程具有熵。您正在做的是假设一个进程空间，估计哪个进程产生了此字符串，并给出了该进程的熵。

#1 楼

string name = "lltt";
int uniqueCharacterCount = name.Distinct().Count();

将返回2

\ $ \ begingroup \ $
鉴于Distinct可能使用了HashSet，我认为这是最简洁明了的实现。
\ $ \ endgroup \ $
– ICR
2011-2-20在21:43

\ $ \ begingroup \ $
这是重点。目标是计算字符串的熵，而不是找到一种计算字符的理想方法。计数字符是一种尝试（在OP中）：更优雅地计数字符在计算熵方面并不明显更好。
\ $ \ endgroup \ $
– Y牛
15年1月13日在14:37

#2 楼

public static int Entropy(this string s)
{
    HashSet<char> chars = new HashSet<char>(s);
    return chars.Count;
}

\ $ \ begingroup \ $
我总是惊讶于仅使用正确的数据结构如何获得简单的算法，或者在这种情况下完全消失。我最喜欢的示例是计算离散值的直方图，实际上它只是新的MultiSet（sourceData）。
\ $ \ endgroup \ $
–Jörg W Mittag
2011年2月21日在17:56

\ $ \ begingroup \ $
代表相同字形或不能用单个字符表示的字形的不同字符呢？
\ $ \ endgroup \ $
– dfhwze
19年5月19日在12:30

\ $ \ begingroup \ $
@dfhwze s参数是令牌流。在此答案提供的实现中，每个字符已经是一个标记，因此无需对其进行预处理；在您的情况下，您需要先“标记化”您的输入。（并且该参数将不是字符串，更像是IEnumerable 或类似的字符串）
\ $ \ endgroup \ $
–加百列
19年5月20日在17:37

#3 楼

我还基于Shannon熵提出了这一点。

在信息论中，熵是对与随机变量相关的不确定性的度量。在这种情况下，该术语通常是指香农熵，它以通常以比特为单位量化消息中包含的信息的期望值。

与简单地计算字母相比，它是一种更“形式化”的熵计算方法： >

/// <summary>
/// returns bits of entropy represented in a given string, per 
/// http://en.wikipedia.org/wiki/Entropy_(information_theory) 
/// </summary>
public static double ShannonEntropy(string s)
{
    var map = new Dictionary<char, int>();
    foreach (char c in s)
    {
        if (!map.ContainsKey(c))
            map.Add(c, 1);
        else
            map[c] += 1;
    }

    double result = 0.0;
    int len = s.Length;
    foreach (var item in map)
    {
        var frequency = (double)item.Value / len;
        result -= frequency * (Math.Log(frequency) / Math.Log(2));
    }

    return result;
}

\ $ \ begingroup \ $
这里有一些微妙之处。您所计算的不是字符串的熵，而是字符串中字符的熵。您应该考虑是否为字符串终止符包括一个频率为1的伪字符（一元数具有某些内容），以及是否要乘以字符串的长度。
\ $ \ endgroup \ $
– Peter Taylor
2011-2-22在16:42

\ $ \ begingroup \ $
对不起，刚刚注意到这一点，这等效于我稍后发布的代码。杰夫，这绝对是一个更好的解决方案。我认为，对这个问题最不切实际的答案就是要点。
\ $ \ endgroup \ $
– BlueRaja-Danny Pflughoeft
2011-2-24在2:03

\ $ \ begingroup \ $
在这里，我们看到了频率数据结构很有用的另一种情况。 var map = new FrequencyTable （）; foreach（s中的char c）{map.Add（c）; }
\ $ \ endgroup \ $
– ICR
2011-2-27 14:14

\ $ \ begingroup \ $
如果您不使用键，那么进行foreach（map.Values中的var值）是否更清晰？
\ $ \ endgroup \ $
– ICR
2011-2-27在14:15

\ $ \ begingroup \ $
并不是说这会是一件大事，但我会将Math.Log（2）计算从循环中移出。
\ $ \ endgroup \ $
– Jesse C. Slicer
2011年11月21日在19:15

#4 楼

从理论上讲，您只能从给定模型的角度衡量熵。例如，PI位数分布均匀，但实际上熵高吗？完全没有，因为可以将无限序列压缩到一个计算所有数字的小程序中。但是，我想向您建议一些可以构成一个非常简单但实用的模型的东西。比较相同的字符在某种程度上就是这样，但是一般来说是建立一个频率表并检查分布。

给定一个长度为N的字符串，我应该期望多少个A字符平均来说，给定我的模型（可以是英语分布，也可以是自然分布）？

那“ abcdefg”呢？这里没有重复，但这根本不是随机的。
所以这里想要的是也取一阶导数，并检查一阶导数的分布。从第一个字符减去第二个字符，从第二个字符减去第三位字符，因此在我们的示例字符串中，它变成：“ abcdefg” => 1,1,1,1,1,1,1

现在aobut“ ababab” ...？由于导数为1，-1,1，-1，...，这似乎具有更好的分布。因此，您实际上想要的是取绝对值。

长字符串

如果字符串足够长，则毫无脑子的方法是：尝试对其进行压缩，然后计算压缩输出与输入之间的比率。

\ $ \ begingroup \ $
棘手的... asdfghjkl;也是很烂的弦
\ $ \ endgroup \ $
– Sam Saffron
2011-2-20在22:17

\ $ \ begingroup \ $
@Sam：一阶导数测试实际上会将您的字符串标记为低熵。当然，这里您要更改模型，也就是说，根据键盘上字符的位置，这也是一个好的模型。当然，您也可以将其添加到混合中。
\ $ \ endgroup \ $
– Antirez
2011-2-20在22:19

\ $ \ begingroup \ $
非常有趣的方法。请记住，我们的熵测试主要针对真正短的字符串。这是结合其他一些算法使用的经典示例（stackoverflow.com/review/…）
\ $ \ endgroup \ $
– Sam Saffron
2011-02-20 22:24

\ $ \ begingroup \ $
您不能通过查看字符串来判断它是否是随机产生的（abc）。如果从均等分布中选择3个字符，则abc，aaa，zzz，zur和apk的机会均等。当然，在您的示例中，您是有意而非随机地选择了abcdef，但这并不能证明随机生成器不可能形成它。
\ $ \ endgroup \ $
–用户未知
2011-2-21在11:59

#5 楼

实际计算熵如何？同样，尚不清楚字符级熵是否会有所帮助，但这是可行的。它使用我的母语C ++，但是可以肯定的是，您可以使用Array而不是std :: vector将其转换为Java。

float CharacterEntropy(const char *str) {
  std::vector<unsigned> counts(256);
  for (const char *i = str; *i; ++i)
    ++counts[static_cast<unsigned char>(*i)];
  unsigned int total = 0;
  for (unsigned i = 0; i < 256; ++i)
    total += counts[i];
  float total_float = static_cast<float>(total);
  float ret = 0.0;
  for (unsigned i = 0; i < 256; ++i) {
    float p = static_cast<float>(counts[i]) / total_float;
    ret -= p * logf(p);
  }
  return p * M_LN2;
}

\ $ \ begingroup \ $
注意0 * log（0）-> 0
\ $ \ endgroup \ $
–尼尔G
2011-2-21在0:25

\ $ \ begingroup \ $
不是Java-我猜是C＃。在Java中，它是“字符串”而不是“字符串”。 :)
\ $ \ endgroup \ $
–用户未知
2011-2-21在13:18

#6 楼

与zngu的答案类似，我认为比计算字符数更好的是计算消息的字符熵：

public double CalculateEntropy(string entropyString)
{
    Dictionary<char, int> characterCounts = new Dictionary<char, int>();
    foreach(char c in entropyString.ToLower())
    {
        if(c == ' ') continue;
        int currentCount;
        characterCounts.TryGetValue(c, out currentCount);
        characterCounts[c] = currentCount + 1;
    }

    IEnumerable<double> characterEntropies = 
        from c in characterCounts.Keys
        let frequency = (double)characterCounts[c]/entropyString.Length
        select -1*frequency*Math.Log(frequency);

    return characterEntropies.Sum();
}

以下是一些测试：

private void CalculateEntropyTest(object sender, EventArgs e)
{
    string[] testStrings = {
        "Hello world!",
        "This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs",
        String.Join("", "This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs".ToCharArray().OrderBy(o => o).Select(o => o.ToString()).ToArray()),
        "Won't this work too?\nstring name = \"lltt\";\nint uniqueCharacterCount = name.Distinct().Count();\nwill return 2",
        "Pull the entropy finding source from any compression algotithm, i.e. Huffman",
        "float CharacterEntropy(const char *str) {\n  std::vector<unsigned> counts(256);\n  for (const char *i = str; *i; ++i)\n    ++counts[static_cast<unsigned char>(*i)];\n  unsigned int total = 0;\n  for (unsigned i = 0; i < 256; ++i)\n    total += counts[i];\n  float total_float = static_cast<float>(total);\n  float ret = 0.0;\n  for (unsigned i = 0; i < 256; ++i) {\n    float p = static_cast<float>(counts[i]) / total_float;\n    ret -= p * logf(p);\n  }\n  return p * M_LN2;\n}",
        "~~~~~~No.~~~~~~",
        "asdasdasdasdasdasd",
        "abcdefghijklmnopqrstuvwxyz",
        "Fuuuuuuu-------",                
    };
    foreach(string str in testStrings)
    {
        Console.WriteLine("{0}\nEntropy: {1:0.000}\n", str, CalculateEntropy(str));
    }
}

结果：
你好，世界！

这是一个典型的英语句子，包含所有英语字母-敏捷的棕色狐狸跳过了懒狗
熵：2.593

-TTaaaaaaabccccddeeeeeeeeeeeeeeeeeeffgggggghhhhhhhiiiiiiiijk
lllllllmnnnnnnnnnooooooppqrrrssys

llbrlllmnnnnnnnnooooooppqrrrssys

也不行吗？
字符串名称=“ lltt”;
int uniqueCharacterCount = name.Distinct（）。Count（）;
将返回2
熵： 2.838

从任何压缩算法中提取熵查找源，即霍夫曼
熵：2.641

float CharacterEntropy（const char * str）{
std :: vector counts（256）;
（const char * i = s tr; *一世; ++ i）
++ counts [static_cast（* i）];
无符号整数总计= 0;
for（无符号i = 0; i <256; ++ i）
total + = counts [i];
float total_float = static_cast（total）;
float ret = 0.0;
for（无符号i = 0; i <256; ++ i） {
float p = static_cast（counts [i]）/ total_float;
ret-= p * logf（p）;
}
return p * M_LN2;
}
熵：2.866

~~~~~~ No. ~~~~~~
熵：0.720

asdasdasdasdasdasd
熵：1.099

abcdefghijklmnopqrstuvwxyz
熵：3.258

Fuuuuuuu -------
熵：0.892
实际上，我认为最好进行一些频率分析，但是我对代码中使用的符号的频率一无所知。确定它的最佳位置是stackoverflow数据转储-在2年内完成下载后，我将不得不与您联系。

#7 楼

我不明白傻瓜的意思。您从未出现过将其设置为false的情况，因此我们可以改用List<T>。此方法应等效且更快：

/// <summary>
/// returns the # of unique characters in a string as a rough 
/// measurement of entropy
/// </summary>
public static int Entropy(this string s)
{
    var hs = new HashSet<char>();
    foreach (char c in s)
        hs.Add(c);
    return hs.Count();
}

\ $ \ begingroup \ $
尽管我同意使用HashSet比使用Dictionary更清楚，而只是忽略其值，但我看不出有什么理由会更快。
\ $ \ endgroup \ $
–sepp2k
2011-2-20在21:45

#8 楼

为什么不将给定字符串中的唯一字符数除以该字符串中的字符总数。这样可以更准确地度量熵。

例如，按照您的公式，一个5个字符的字符串的熵为3应该很好，但是一个8个字符的字符串的熵为3是可以的。很穷。但是，您的公式无法区分两个结果。鉴于上述公式可以提供更准确的度量。

#9 楼

我认为antirez提出熵方法需要模型是正确的。因此，假设我们在说英语，然后检查字符串的字符分布以及它与“平均值”对齐的紧密程度，很可能表明该文本大部分为英语。但这是您要实现的目标吗？可能有很多东西是代码或伪代码。压缩是个好主意，但这会为随机文本提供最高的熵-高熵不好吗？较低的熵表示可能有很多重复，也许是冗长，但是人们可以用轻率的单词写出很长的句子，而传递的信息很少（例如此评论）。

#10 楼

我只是一起鞭打了这个算法，所以我不知道这有多好。我担心如果在很长的字符串上使用它会导致溢出异常。

此算法的关键概念：

第一次遇到字符时，则将最大值添加到未归一化的熵总计中。 “最大值”是字符串的长度。
如果再次遇到一个字符，则我们计算该事件与最后一次事件之间的位置数，然后减去该字符出现的总次数在字符串中。然后，我们将该值添加到未归一化的熵总计中。

public static int Entropy(this string s)
{
    int entropy = 0;

    var mapOfIndexByChar = new Dictionary<char, CharEntropyInfo>();

    int index = 0;
    foreach (char c in s)
    {
        CharEntropyInfo charEntropyInfo;
        if (mapOfIndexByChar.TryGetValue(c, out charEntropyInfo))
        {
            // If this character has occurred previously, then only add the number of characters from
            // the last occurrence to this occurrence, and subtract the number of previous occurrences.
            // Many repeated characters can actually result in the entropy total being negative.
            entropy += ((index - charEntropyInfo.LastIndex) - charEntropyInfo.Occurrences);

            // update the last index and number of occurrences of this character
            mapOfIndexByChar[c] = new CharEntropyInfo(index, charEntropyInfo.Occurrences + 1);
        }
        else
        {
            // each newly found character adds the maximum possible value to the entropy total
            entropy += s.Length;

            // record the first index of this character
            mapOfIndexByChar.Add(c, new CharEntropyInfo(index, 1));
        }
    }

    // divide the entropy total by the length of the string to "normalize" the result
    return entropy / s.Length;
}

struct CharEntropyInfo
{
    int _LastIndex;
    int _Occurrences;

    public int LastIndex
    {
        get { return _LastIndex; }
    }
    public int Occurrences
    {
        get { return _Occurrences; }
    }

    public CharEntropyInfo(int lastIndex, int occurrences)
    {
        _LastIndex = lastIndex;
        _Occurrences = occurrences;
    }
}

快速测试：

        var inputs = new[]{
            "Hi there!",
            "Hi there, bob!",
            "ababababababababababababab",
            @"We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality.

I whipped up this simple method which counts unique characters in a string, but it is quite literally the first thing that popped into my head. It's the ""dumbest thing that works""."
        };

        foreach (string s in inputs)
        {
            System.Console.WriteLine("{1}: \"{0}\"", s, s.Entropy());
        }

结果熵值：

7：“你好！”
10：“你好，鲍勃！”
25：“我们正在计算熵字符串...“

#11 楼

您可能可以将其扩展为二元语法和三元语法，以得到诸如“ sdsdsdsdsdsdsdsdsdsdsdsd”之类的内容（尽管您也可以理解）。垃圾邮件过滤器之类的贝叶斯方法是否适合您想要实现的目标？

\ $ \ begingroup \ $
一阶导数也将很容易抓住这一点
\ $ \ endgroup \ $
– Antirez
2011-2-20在22:04

#12 楼

我将假定这是英语（因为这就是我们所做的一切）。保留HashSet<string>停用词（英语中不传达含义的最常见单词），将字符串标记为单词，并计算不是停用词的单词数量，会更好吗？ >

#13 楼

我会尝试对每个字符进行计数，并验证它与英语字母的正常频率大致匹配。（在足够大的输入下）比计算字母的数量可能更精确。

如果按字母的出现次数对字母进行排序，则从统计学上讲，您应该得到类似ETAONRISHDLFCMUGYPWBVKXJQZ的东西。您可以使用此字符串和字母之间的编辑距离（按外观顺序排序）来粗略地度量熵。（如果这样做，我建议您从计数中排除代码片段...）

\ $ \ begingroup \ $
作为第二次减少唯一字符的原始计数，我建议计算每个唯一字符的计数方差。这样，您就不会偏向英语和代码，而只要求某些字符的出现频率比其他字符少。
\ $ \ endgroup \ $
– David Harkness
2011-02-20 23:29

#14 楼

我已经看到许多答案，建议计算不同字符的数量。但是请注意，这仅适用于16位字符！

C＃中的字符是UTF-16代码单元。扩展的unicode字符存储在多个C＃字符中。 CharUnicodeInfo.GetUnicodeCategory允许我们检测C＃字符表示真实字符还是它是扩展的unicode字符或组合字符（UnicodeCategory.Surrogate）的一部分。

测试（伪）熵：

 public static void Main()
 {
     var value = "\U00020B20";

     // yields 2, even though \U00020B20 represents a single unicode-character '𠬠'
     var entropyTest = value.Distinct().Count(); 
 }

为了计算字符（不是C＃字符），我们需要增强算法。我正在使用一个名为Grapheme的类来完成技巧。此类可以检测扩展的Unicode字符和变音符号。

测试熵：

 public static void Main()
 { 
     var grapheme = Grapheme.Parse("\U00020B20");

     // yields 1, as \U00020B20 represents a single unicode-character '𠬠'.
     var entropyTest = grapheme.Select(x => x.Glyph).Distinct().Count();

     // yields 2, as \U00020B20 is stored in 2 C# characters.
     var codeUnits = grapheme.Single().CodeUnits.Length;
 }

最后的注释：

测试字符串的熵不是没有上下文的。根据所使用的字体，某些字符或组合字符会产生相同的字形。因此，熵只能在字体的上下文中计算。 Grapheme类没有考虑到这一点，因为不同的字体会呈现不同的熵。据说Grapheme类是上下文无关的。

（A）两个不同的字符可能具有完全相同的字形（homoglyph）
（B）组合的字符可能具有与另一个字符相同的标志符号

示例：

A：\ u0061和\ u0430都以某些字体表示字母“ a”
B： Å既是字符\ u00C5，又是带有点字符的组合字符A

附录：字形

public class Grapheme
{
    private char[] _codeUnits;
    private Grapheme[] _diacritics;
    private string _glyph;

    public Grapheme(string glyph) {

        Guard.NotNull(glyph, "glyph");
        _glyph = StringInfo.GetNextTextElement(glyph);
        Guard.Condition(_glyph.Length != glyph.Length, "glyph", "Invalid glyph specified");

        var codeUnits = new List<char>();
        var diacritics = new List<Grapheme>();
        var buffer = _glyph;

        if (buffer.Length > 0) {
            var cu0 = CharUnicodeInfo.GetUnicodeCategory(buffer[0]);
            switch (cu0) {
                case UnicodeCategory.Surrogate:
                    codeUnits.AddRange(buffer.Take(2));
                    buffer = buffer.Substring(2);
                    break;
                default:
                    codeUnits.Add(buffer[0]);
                    buffer = buffer.Substring(1);
                    break;
            }
            diacritics.AddRange(Parse(buffer));
        }

        _codeUnits = codeUnits.ToArray();
        _diacritics = diacritics.ToArray();

        if (_codeUnits.Length == 2) {
            Guard.Condition(!char.IsSurrogatePair(new string(_codeUnits), 0),
                "glyph", "Invalid surrogate pair specified");
        }
    }

    public static Grapheme[] Parse(string value) {
        Guard.NotNull(value, "value");
        return StringInfo.ParseCombiningCharacters(value).Select(i 
                        => new Grapheme(StringInfo.GetNextTextElement(value, i))).ToArray();
    }

    public static int[] ParseIndices(string value) {
        Guard.NotNull(value, "value");
        return StringInfo.ParseCombiningCharacters(value).ToArray();
    }

    public static Grapheme ParseNext(string value, int index) {
        return new Grapheme(StringInfo.GetNextTextElement(value, index));
    }

    public static Grapheme ParseNext(string value) {
        return ParseNext(value, 0);
    }

    public char[] CodeUnits { 
        get { 
            return _codeUnits; 
        }
    }

    public Grapheme[] Diacritics {
        get { 
            return _diacritics; 
        }
    }

    public string Glyph {
        get { 
            return _glyph;
        }
    }

    public Grapheme[] Flatten() {
        return new[] { this }.Concat(_diacritics.SelectMany(x => x.Flatten())).ToArray();
    }

    public Grapheme Normalize() {
        return new Grapheme(_glyph.Normalize());
    }

    public Grapheme Normalize(NormalizationForm form) {
        return new Grapheme(_glyph.Normalize(form));
    }

    public override bool Equals(object obj) {
        if (obj is Grapheme) {
            return string.Equals(((Grapheme)obj)._glyph, _glyph);
        }
        return false;
    }

    public override int GetHashCode() {
        return _glyph.GetHashCode();
    }

    public override string ToString() {
        return _glyph;
    }
}