马尔可夫国名生成器

我在Python 3.5中编写了一个国家/地区名称生成器。我的目标是获得尽可能看起来像真实世界名称的随机名称。每个名称都必须有一个名词和一个形容词形式（例如Italy和Italian）。名称由音节分解，名词和形容词的末尾分开（例如i-ta-l y/ian）。该程序将每个名称分成多个音节，并将每个音节分为三个部分：开始，核心和尾声（即前辅音，元音和尾辅音）。然后，它使用这些段的频率相对于彼此来驱动产生名称的Markov过程。（这不是纯粹的马尔可夫过程，因为我想确保音节计数的分布类似于输入集。我还对结尾加了特例。）拒绝了几种不受欢迎的名称。

#!/usr/bin/python3

import re, random

# A regex that matches a syllable, with three groups for the three
# segments of the syllable: onset (initial consonants), nucleus (vowels),
# and coda (final consonants).
# The regex also matches if there is just an onset (even an empty
# onset); this case corresponds to the final partial syllable of the
# stem, which is usually the consonant before a vowel ending (for
# example, the d in "ca-na-d a").
syllableRgx = re.compile(r"(y|[^aeiouy]*)([aeiouy]+|$)([^aeiouy]*)")
nameFile = "names.txt"

# Dictionary that holds the frequency of each syllable count (note that these
# are the syllables *before* the ending, so "al-ba-n ia" only counts two)
syllableCounts = {}

# List of four dictionaries (for onsets, nuclei, codas, and endings):
# Each dictionary's key/value pairs are prevSegment:segmentDict, where
# segmentDict is a frequency dictionary of various onsets, nuclei, codas,
# or endings, and prevSegment is a segment that can be the last nonempty
# segment preceding them. A prevSegment of None marks segments at the
# beginnings of names.
segmentData = [{}, {}, {}, {}]
ONSET = 0
NUCLEUS = 1
CODA = 2
ENDING = 3

# Read names from file and generate the segmentData structure
with open(nameFile) as f:
    for line in f.readlines():
        # Strip whitespace, ignore blank lines and comments
        line = line.strip()
        if not line:
            continue
        if line[0] == "#":
            continue
        stem, ending = line.split()
        # Endings should be of the format noun/adj
        if "/" not in ending:
            # The noun ending is given; the adjective ending can be
            # derived by appending -n
            ending = "{}/{}n".format(ending, ending)
        # Syllable count is the number of hyphens
        syllableCount = stem.count("-")
        if syllableCount in syllableCounts:
            syllableCounts[syllableCount] += 1
        else:
            syllableCounts[syllableCount] = 1

        # Add the segments in this name to segmentData
        prevSegment = None
        for syllable in stem.split("-"):
            segments = syllableRgx.match(syllable).groups()
            if segments[NUCLEUS] == segments[CODA] == "":
                # A syllable with emtpy nucleus and coda comes right before
                # the ending, so we only process the onset
                segments = (segments[ONSET],)
            for segType, segment in enumerate(segments):
                if prevSegment not in segmentData[segType]:
                    segmentData[segType][prevSegment] = {}
                segFrequencies = segmentData[segType][prevSegment]
                if segment in segFrequencies:
                    segFrequencies[segment] += 1
                else:
                    segFrequencies[segment] = 1
                if segment:
                    prevSegment = segment
        # Add the ending to segmentData
        if prevSegment not in segmentData[ENDING]:
            segmentData[ENDING][prevSegment] = {}
        endFrequencies = segmentData[ENDING][prevSegment]
        if ending in endFrequencies:
            endFrequencies[ending] += 1
        else:
            endFrequencies[ending] = 1


def randFromFrequencies(dictionary):
    "Returns a random dictionary key, where the values represent frequencies."

    keys = dictionary.keys()
    frequencies = dictionary.values()
    index = random.randrange(sum(dictionary.values()))
    for key, freq in dictionary.items():
        if index < freq:
            # Select this one
            return key
        else:
            index -= freq
    # Weird, should have returned something
    raise ValueError("randFromFrequencies didn't pick a value "
                     "(index remainder is {})".format(index))

def markovName(syllableCount):
    "Generate a country name using a Markov-chain-like process."

    prevSegment = None
    stem = ""
    for syll in range(syllableCount):
        for segType in [ONSET, NUCLEUS, CODA]:
            try:
                segFrequencies = segmentData[segType][prevSegment]
            except KeyError:
                # In the unusual situation that the chain fails to find an
                # appropriate next segment, it's too complicated to try to
                # roll back and pick a better prevSegment; so instead,
                # return None and let the caller generate a new name
                return None
            segment = randFromFrequencies(segFrequencies)
            stem += segment
            if segment:
                prevSegment = segment

    endingOnset = None
    # Try different onsets for the last syllable till we find one that's
    # legal before an ending; we also allow empty onsets. Because it's
    # possible we won't find one, we also limit the number of retries
    # allowed.
    retries = 10
    while (retries and endingOnset != ""
           and endingOnset not in segmentData[ENDING]):
        segFrequencies = segmentData[ONSET][prevSegment]
        endingOnset = randFromFrequencies(segFrequencies)
        retries -= 1
    stem += endingOnset
    if endingOnset != "":
        prevSegment = endingOnset
    if prevSegment in segmentData[ENDING]:
        # Pick an ending that goes with the prevSegment
        endFrequencies = segmentData[ENDING][prevSegment]
        endings = randFromFrequencies(endFrequencies)
    else:
        # It can happen, if we used an empty last-syllable onset, that
        # the previous segment does not appear before any ending in the
        # data set. In this case, we'll just use -a(n) for the ending.
        endings = "a/an"
    endings = endings.split("/")
    nounForm = stem + endings[0]
    # Filter out names that are too short or too long
    if len(nounForm) < 3:
        # This would give two-letter names like Mo, which don't appeal
        # to me
        return None
    if len(nounForm) > 11:
        # This would give very long names like Imbadossorbia that are too
        # much of a mouthful
        return None
    # Filter out names with weird consonant clusters at the end
    for consonants in ["bl", "tn", "sr", "sn", "sm", "shm"]:
        if nounForm.endswith(consonants):
            return None
    # Filter out names that sound like anatomical references
    for bannedSubstring in ["vag", "coc", "cok", "kok", "peni"]:
        if bannedSubstring in stem:
            return None
    if nounForm == "ass":
        # This isn't a problem if it's part of a larger name like Assyria,
        # so filter it out only if it's the entire name
        return None
    return stem, endings

测试代码

def printCountryNames(count):
    for i in range(count):
        syllableCount = randFromFrequencies(syllableCounts)
        nameInfo = markovName(syllableCount)
        while nameInfo is None:
            nameInfo = markovName(syllableCount)
        stem, endings = nameInfo
        stem = stem.capitalize()
        noun = stem + endings[0]
        adjective = stem + endings[1]
        print("{} ({})".format(noun, adjective))

if __name__ == "__main__":
    printCountryNames(10)

示例names.txt内容

# Comments are ignored
i-ta-l y/ian
# A suffix can be empty
i-ra-q /i
# The stem can end with a syllable break
ge-no- a/ese
# Names whose adjective suffix just adds an -n need only list the noun suffix
ar-me-n ia
sa-mo- a

我的完整names.txt文件以及代码可以在此Gist中找到。

示例输出

使用完整数据文件生成：

 Slorujarnia (Slorujarnian)
Ashmar (Ashmari)
Babya (Babyan)
Randorkia (Randorkian)
Esanoa (Esanoese)
Manglalia (Manglalic)
Konara (Konaran)
Lilvispia (Lilvispian)
Cenia (Cenian)
Rafri (Rafrian)

问题

我的代码可读吗？清除变量和函数名称？是否有足够的注释？
我应该重新组织任何内容吗？
我可以使用或更完善的Python 3功能吗？我对format以及使用它的各种方法特别不满意。

如果您发现还有其他可以改进的地方，请告诉我。只是一个例外：我知道PEP标准是snake_case，但是我想使用camelCase，但我无意对此进行更改。欢迎其他格式提示。

您可以考虑在代码上运行pep8和pyflakes

我真的很希望在示例输出中看到“ Eblonia”。

@ That1Guy我确实在一次测试中获得了“ Elbonia”。：^ D

#1 楼

最好遵循PEP8的规定，即import语句（例如您的情况）应使用多行：
您所使用的编程语言，最好尽可能避免输入/输出操作。因此，您可以为此目的选择一个合适的Python数据结构，而不是将国家/地区名称存储在文本文件中。
当某人阅读您的主程序时，他必须直接知道它在做什么。您的main.py文件不是这种情况，我看到很多分散注意力的信息和噪音。例如，您应该将所有这些常量保存在一个单独的模块中，您可以将其称为configuration.py，cfg.py，settings.py或您认为适合项目体系结构的任何名称。
选择有意义的名称：虽然您选择的大多数名称都是可食用的，但我相信您仍然可以对其中的几个进行一些改进。例如，nameFile就是这种情况，它过于模糊并且没有将任何信息分配给分配操作本身nameFile = "names.txt"之一。当然，它是文件的名称，但是只有在阅读程序的语句后，您才能猜出nameFile的含义，并开始建议您立即使用一个更合适和更有意义的名称，例如countries_names。请注意，在我的建议中，没有容器的名称。我的意思是，我并不是要让您的代码的读者知道编程的详细信息，例如您是将信息存储在文件中还是该数据结构中。名称应该是“高级”的，并且与其表示的数据结构无关。这为您提供了一个优点，即不会因为将存储数据从文件更改为其他数据结构而在程序中发现相同的名称来重写它。这也适用于syllableRgx：确保当有人阅读syllableRgx = re.compile(r"...")时，他知道您正在存储正则表达式的结果。但是由于我之前已经解释过的原因，您应该将此名称更改为更好的名称。
您应该遵循标准的命名约定。例如，应分别将syllableCounts和segmentData写为syllable_counts和segment_data。就像您加入新公司的开发人员团队一样：适应自己，不要要求他们适应自己的习惯和愿望。

\ $ \ begingroup \ $
嗯。食用名称。（可读吗？）
\ $ \ endgroup \ $
– TRiG
17年8月14日在10:19

\ $ \ begingroup \ $
可食用的（而不是一种法国人表达自己的方式），我的意思是“有意义的名称，用于说明变量，函数或类的用途*
\ $ \ endgroup \ $
– Billal Begueradj
17年8月14日在10:56

\ $ \ begingroup \ $
通常，某些时候需要以与语言无关的方式来存储信息。在这种情况下，您会保持结构不变还是先将所有内容放入变量中，然后将整个变量最后打印一次？
\ $ \ endgroup \ $
– Dennis Jaheruddin
17年8月14日在15:15

\ $ \ begingroup \ $
是的，常量应该像操作中一样用大写字母写成，OP一样，但是syllable_counts是一本字典，在开始时被初始化为空，随后填充了数据，因此不是常数@Wyrmwood
\ $ \ endgroup \ $
– Billal Begueradj
17年8月14日在15:56

\ $ \ begingroup \ $
我认为“易消化”（或更好的是，“易于消化”）是更好的英语，同时仍然忠实于您的法语意图。
\ $ \ endgroup \ $
–桑奇塞斯
17年8月14日在20:04

#2 楼

循环遍历文件的行

可能是次要的nitpick，但是当使用open()返回的文件对象时，您可以遍历该对象，而不用像这样调用readlines()：

# Read names from file and generate the segmentData structure
with open(nameFile) as input_names:
    for line in input_names:

来自文档：

readlines(hint=-1)

从流中读取并返回行列表。可以指定
以控制读取的行数：如果到目前为止所有行的总大小（以字节/字符为单位），则不再读取行。

/>
请注意，无需调用for line in file: ...就可以使用file.readlines()来对文件对象进行迭代。无需使用readlines()。

测试任何元素是否符合条件

...可以使用any()，map()和适当的函数来完成，因此：

# Filter out names with weird consonant clusters at the end
weird_consonant_clusters = ["bl", "tn", "sr", "sn", "sm", "shm"]
if any(map(nounForm.endswith, weird_consonant_clusters)):
    return None

尽管如此，对于bannedSubstring而言，却没有直接等价的in：您必须使用count()或编写一个lambda，因此可能没有太多收获在这里。

增量或设置

对于频率，在其中进行增量或设置的位置，可以使用get方法或defaultdict，以便于：

if ending in endFrequencies:
    endFrequencies[ending] += 1
else:
    endFrequencies[ending] = 1

成为：

endFrequencies[ending] = endFrequencies.get(ending, 0) + 1

或者如果endFrequencies是defaultdict(int)，则：

endFrequencies[ending] += 1

评论
\ $ \ begingroup \ $
为此专门设计了一个特殊的收藏类：collections.Counter。只需使用endFrequency = Counter（）进行初始化，并使用endFrequencies [ending] + = 1进行更新即可。
\ $ \ endgroup \ $
–Chortos-2
17年8月14日在12:09

\ $ \ begingroup \ $
[facepalm]忘了遍历文件。由于另一个答案的建议，我不再阅读文件了，谢谢。 defaultdict是另一个好主意，尽管它看起来像@ Chortos-2的collections。Counter正是我所需要的。
\ $ \ endgroup \ $
– DLosc
17年8月14日在20:07

编程黑洞网

马尔可夫国名生成器

评论

#1 楼

评论

#2 楼

评论