如何将文本分成句子？

我有一个文本文件。我需要获取句子列表。

如何实现？有很多微妙之处，例如缩写中使用了点。

我的旧正则表达式效果很差：

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

我想这样做，但是我想在有句点或换行符的地方拆分

#1 楼

自然语言工具包（nltk.org）可以满足您的需求。该群组发布表明这样做了：

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

（我还没有尝试过！）

@Artyom：它可能可以与俄语一起使用-看看NLTK / pyNLTK是否可以按每种语言（即非英语）工作，以及如何？

–马丁内
2011年1月2日，0：28

@Artyom：这是nltk .tokenize.punkt.PunktSentenceTokenizer的在线文档的直接链接。

–马丁内
2011年1月2日，0：32

您可能必须先执行nltk.download（）并下载模型-> punkt

–马丁·托马
15年1月12日在18:36

对于带引号结尾的情况，此操作将失败。如果我们的句子结尾像“ this”。

–福萨
18年2月21日在5:16

好吧，你说服了我。但是我只是测试了一下，它似乎并没有失败。我的输入是“这在带引号的情况下失败。如果我们的句子结尾像“ this”。这是另一句话。我的输出是['在引号结尾的情况下，这将失败。'，'如果我们的句子的结尾都像'this。'，'，'这是另一个句子。']对我来说似乎正确。

– szedjani
19-10-31在10:37

#2 楼

此功能可以在大约0.1秒内将Huckleberry Finn的整个文本拆分成多个句子，并处理许多使句子解析不平凡的更痛苦的情况，例如： “小约翰逊·约翰逊先生（John Johnson Jr.）出生于美国，但在加入耐克公司（Nike Inc.）成为工程师之前在以色列获得了博士学位。他还在craigslist.org担任业务分析师。”

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\1<prd>",text)
    text = re.sub(websites,"<prd>\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\1<stop> \2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\1<prd>\2<prd>\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\1<prd>\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \1<stop> \2",text)
    text = re.sub(" "+suffixes+"[.]"," \1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

这是一个很棒的解决方案。但是，我在正则表达式的声明中又增加了两行digits =“（[[0-9]）”和text = re.sub（digits +“ [。]” + digits，“ \\ 1 \ \ 2“，text）。现在，它不会将行拆分为十进制数（例如5.5）。感谢您的回答。

– Ameya Kulkarni
16年7月17日在11:12

您是如何解析整个Huckleberry Finn的？文本格式在哪里？

– Pascal Kooten
17年2月4日在10:52

一个很好的解决方案。在函数中，我添加了“例如”在文本中：text = text.replace（“ e.g。”，“ e g ”）如果是。在文本中：text = text.replace（“ i.e。”，“ i e ”），它完全解决了我的问题。

–Sisay Chala
17年6月1日在8:09

很棒的解决方案，有非常有用的评论！只是为了使其更加健壮：prefixs =“（Mr | St | Mrs | Ms | Dr | Prof | Capt | Cpt | Lt | Mt）[。]”，网站=“ [。]（com | net | org | io | gov | me | edu）”，如果文本中有“ ...”：text = text.replace（“ ...”，“ ”）

– Dascienz
18年1月26日在19:02

可以使该功能看作是这样的句子吗：当一个孩子问妈妈“婴儿从哪里来？”时，一个人应该回答她什么？

–鲸鱼
18年4月29日在6:54

#3 楼

除了使用正则表达式将文本拆分为句子外，还可以使用nltk库。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

参考：https://stackoverflow.com/a/9474645/2877052

比公认的答案更好，更简单，更可重用的示例。

–杰伊·D。
19年8月8日14:49

如果在点后删除空格，tokenize.sent_tokenize（）不起作用，但tokenizer.tokenize（）起作用！嗯...

– Leonid Ganeline
19年8月8日在21:32

用于tokenize.sent_tokenize（text）中的句子：print（sentence）

–维多利亚·斯图尔特（Victoria Stuart）
2月27日19:35

#4 楼

您可以尝试使用Spacy代替正则表达式。我用它就可以了。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

巨大的空间。但是，如果您只需要分隔成句子，则在处理数据管道时，将文本传递到空格会花费很长时间

– JFerro
19年6月19日在19:22

@Berlines我同意，但是找不到其他像spaCy一样干净的库。但是，如果您有任何建议，我可以尝试。

–精灵
19年8月16日在11:19

另外，对于在那里的AWS Lambda Serverless用户，spacy的支持数据文件很多100MB（英语是大于400MB），因此，您不能开箱即用这样的东西（非常可惜的是Spacy）

–朱利安（Julian H）
6月16日4:12

#5 楼

这是不依赖任何外部库的中间方法。我使用列表推导来排除缩写词和终止符之间的重叠以及排除终止符之间的重叠，例如：“。” vs.'。“'

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

我从以下条目中使用了Karl的find_all函数：
在Python中查找所有出现的子字符串

完美的方法！其他人抓不到...和？！。

– Shane Smiskol
16年7月30日在7:28

#6 楼

对于简单的情况（句子通常会终止），这应该起作用：

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

正则表达式为*\. +，它匹配一个由0或多个空格包围的周期和右边的1个或多个（以防止将re.split的时间段计为句子的变化）。

显然，这不是最可靠的解决方案，但在在大多数情况下。唯一不能解决的情况就是缩写（也许遍历句子列表，并检查sentences中的每个字符串都以大写字母开头吗？）

您能想到英语中句子不以句号结尾的情况吗？想象一下！我对此的回答是，“再想一想”。（看看我在那里做什么？）

– Ned Batchelder
2011年1月1日22:37

@Ned哇，简直不敢相信我是那么愚蠢。我一定喝醉了。

–拉夫·凯特勒（Rafe Kettler）
2011年1月1日在22:39

我在Win 7 x86上使用Python 2.7.2，上面代码中的正则表达式给我这个错误：SyntaxError：扫描字符串文字时，EOL指向结束括号（文字之后）。另外，您在文本中引用的正则表达式在代码示例中不存在。

– Sabuncu
13年7月23日在18:35

正则表达式并不完全正确，因为它应该是r'* [\。\ ?!] [\'“ \）\]] * +'

–社会
2015年9月9日在20:39

这可能会导致许多问题，并且还会将句子分割成较小的块。考虑一下我们有“我为这冰激凌支付了3.5美元”的情况，其中大块为“我为这冰激凌支付了3美元”和“为这冰激凌支付了5美元”。使用默认的nltk句子.tokenizer更安全！

– Reihan_amn
18年2月23日在19:19

#7 楼

您还可以在NLTK中使用句子标记化功能：

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

#8 楼

@Artyom，

嗨！您可以使用以下功能为俄语（和其他一些语言）创建新的令牌生成器：

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

，然后以这种方式调用：

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

祝你好运，
Marilena。

#9 楼

另外，请注意上面的某些答案中未包含的其他顶级域。
例如，.info，.biz，.ru，.online会抛出一些句子解析器，但不在上面。
以下是有关顶级域名出现频率的信息：https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/
可以解决通过编辑上面的代码以读取：

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"

这是有用的信息，但是将其添加为对原始答案的简短评论可能更合适。

–vlz
10月23日13:32

那是我最初的计划，但显然我还没有声誉。认为这可能会对某人有所帮助，所以我认为我会尽力而为。如果有办法做到这一点，并且首先解决“您需要50个声誉”，我很乐意:)

– cogijl
10月26日0:30

#10 楼

毫无疑问，NLTK最适合此目的。但是开始使用NLTK会很痛苦（但是一旦安装它，您就可以从中获得回报）

所以这是简单的基于re的代码，可从http://pythonicprose.blogspot.com/2009/获得。 09 / python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question

是的，但这很容易失败，并带有：“史密斯先生知道这是一个句子。”

–托马斯
2014年2月11日在10:15

#11 楼

我必须阅读字幕文件并将其拆分为句子。经过预处理（如删除.srt文件中的时间信息等）后，变量fullFile包含字幕文件的全文。下面的粗略方法将它们整齐地分成句子。可能我很幸运，句子总是（正确）以空格结尾。首先尝试执行此操作，如果有任何例外，请添加更多的制衡功能。

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

哦！好。我现在意识到，由于我的内容是西班牙语，所以我没有遇到与“史密斯先生”等人打交道的问题。但是，如果有人想要快速又脏的解析器...

#12 楼

希望对拉丁文，中文，阿拉伯文文本有帮助。

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

#13 楼

通过执行一些链接并进行了一些nltk练习，遇到了类似的任务并遇到了此查询，下面的代码对我来说就像魔术一样。

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)

输出：

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

来源：https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

#14 楼

import spacy
nlp = spacy.load（'en_core_web_sm'）
text =“你今天好吗？希望你过得愉快”
代币= nlp（text）
用于发送令牌。发送：

print(sent.string.strip())

编程黑洞网