用于对英语单词（名词，动词，副词等）进行语法分类的库

我有1600个英文单词，但我不知道语法类型。现在，我想知道列表中的每个单词是名词，动词，副词还是其他语法类型。如果一个单词可以用于多种类型，我希望将它们全部使用（但是最常见的一种就足够了）。我希望可以轻松上手的东西-读一本教程并编写20行代码将是理想的选择。

要求：英文单词的类型
任何编程语言
任何许可证
免费

这个问题不是更适合程序员吗？

@Izzy不，程序员更具理论性。我不知道这个问题适合哪个网站。

@Izzy Software Engineering将对一个问题提出建议，以推荐一种编程语言。我认为实际上，这个问题很好–它的核心是更多关于库的建议，随后将采用编程语言。

@Gilles库建议应该已经有一种语言。.“建议我使用可以执行x的随机语言的库”可能太广泛了，正如Tim在他的回答中指出的那样。

@Seth Tim的答案没有说明必须指定语言的库建议。许多语言都有轻松的跨语言绑定形式，因此在用语言B编写的程序中使用以语言A编写的库是很普遍的。Tim确实说：“建议我应该使用哪种语言来构建该项目，太宽泛了”，但是这里的“项目”基本上是在循环中调用一个库函数。

#1 楼

您正在寻找POS标记（=词性标记器）。

Stanford词性标记器是最准确和使用最广泛的一种：

免费（商业非开源软件除外）
用Java编写的开源软件
有据可查的
受过训练的英语，阿拉伯语，中文，法语和德语模型
可用许多其他语言进行绑定：Ruby，Python（NLTK（2.0+）包含与Stanford POS标记器的接口），PHP，F＃/ C＃/。NET等。

示例：

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.List;

import edu.stanford.nlp.ling.Sentence;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

class TaggerDemo {

  private TaggerDemo() {}

  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.println("usage: java TaggerDemo modelFile fileToTag");
      return;
    }
    MaxentTagger tagger = new MaxentTagger(args[0]);
    List<List<HasWord>> sentences = MaxentTagger.tokenizeText(new BufferedReader(new FileReader(args[1])));
    for (List<HasWord> sentence : sentences) {
      List<TaggedWord> tSentence = tagger.tagSentence(sentence);
      System.out.println(Sentence.listToString(tSentence, false));
    }
  }

}

其他工具：http://en.wikipedia.org/wiki/Part-of-speech_tagging

#2 楼

它不是内置函数，但是您可以使用python和nltk来实现。

简单的代码应如下所示：

import nltk

with open(file) as f:
    for line in f:
        tmp = nltk.word_tokenize(line)
        print nltk.pos_tag(tmp)

您可以在此处找到每个标签的说明（图5.1）。

麻烦的是，它将返回最可能的标签，而不是每个标签。

如果我开始这样说：没有找到资源'taggers / maxent_treebank_pos_tagger / english.pickle'。如果我在python控制台中启动nltk.download（），我必须下载什么？

–rubo77
14年6月18日在16:18

恐怕我无法回答这个问题，因为我刚开始使用它时就下载了所有内容。

–优点
14年6月18日在16:25

您如何下载所有内容？我用apt-get install python-nltk

–rubo77
2014年6月18日在16:25

输入nltk.download（）后，请全选并单击download :)。但是，请注意，这对于您的任务来说是过高的。

–优点
2014年6月18日在16:28

我现在在那儿选书，但这花了很长时间，但是行得通！那里的英国人。这存储在我的用户文件夹/ home / rubo77 / nltk_data中，我可以删除那里的其他文件夹吗？

–rubo77
2014年6月18日16:30

#3 楼

您可以使用Apache OpenNLP库：

用Java编写的免费开放源代码
支持最常见的NLP任务，例如标记化，句子分段，词性标记，命名实体提取，分块，解析和共指解析。
包括最大熵和基于感知器的机器学习。

词性标注器文档：

加载型号：

InputStream modelIn = null;

try {
  modelIn = new FileInputStream("en-pos-maxent.bin");
  POSModel model = new POSModel(modelIn);
}
catch (IOException e) {
  // Model loading failed, handle the error
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

标签：

POSTaggerME tagger = new POSTaggerME(model);
String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", "had",
                             "morning", "and", "afternoon", "newspapers", "."};         
String tags[] = tagger.tag(sent);
double probs[] = tagger.probs(); // confidence scores for each tag
Sequence topSequences[] = tagger.topKSequences(sent); // Some applications need to retrieve the n-best pos tag sequences and not only the best sequence

#4 楼

您可以免费使用IBM LanguageWare（Wikipedia）：

（需要注册才能下载）
Java
不确定项目的活跃程度，最新版本是2011-10-21。
请注意，LanguageWare当前不提供语音（POS）歧义消除功能，因此所有歧义都将传递回调用应用程序。
基于UIMA（ UIMA =非结构化信息管理体系结构）
AFAIK通常不是学术界的首选，但是IBM为NLP做出了重要贡献。

LanguageWare Resource Workbench是一个用于构建的Eclipse应用程序对IBM LanguageWare资源及其关联的UIMA注释器的定制语言分析。 UIMA（另请参阅Apache UIMA项目）是唯一的内容分析行业标准，被IBM Watson用来赢得Jeopardy Challenge。 UIMA最初是由IBM开发的，现在是开源的。

IBM语言软件的一个很好的展示：自然语言处理和早期现代的脏数据：将IBM语言软件应用于1641年沉积物

#5 楼

您可以使用TextBlob（开放源代码，MIT许可证）：

TextBlob是用于处理文本数据的Python（2和3）库。它提供了一个简单的API，可用于深入研究普通自然语言处理（NLP）任务，例如词性标记，名词短语提取，情感分析，分类，翻译等。

示例：

from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
blob.tags           # [('The', 'DT'), ('titular', 'JJ'),
                    #  ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases   # WordList(['titular threat', 'blob',
                    #            'ultimate movie monster',
                    #            'amoeba-like mass', ...])

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
# 0.060
# -0.341

blob.translate(to="es")  # 'La amenaza titular de The Blob...'

功能：

名词短语提取
词性标记
>情感分析
分类（朴素贝叶斯，决策树）
由Google翻译支持的语言翻译和检测
标记化（将文本分为单词和句子）
单词和短语的频率
/>解析
n-grams
单词变形（复数和单数化）和词形化
拼写校正
通过扩展添加新的模型或语言
WordNet集成

安装：

pip install -U textblob
python -m textblob.download_corpora

#6 楼

您可以使用spaCy：

Python
开放源代码免费研究（GNU Affero通用公共许可证v3），每年5kUSD的生产成本
Linux / Mac OSX。不支持Windows。
于2015年1月首次发布

安装：

pip install spacy
python -m spacy.en.download

或：

conda install spacy
python -m spacy.en.download

演示：

from spacy.parts_of_speech import ADV

def is_adverb(token):
    return token.pos == spacy.parts_of_speech.ADV

# These are data-specific, so no constants are provided. You have to look
# up the IDs from the StringStore.
NNS = nlp.vocab.strings['NNS']
NNPS = nlp.vocab.strings['NNPS']
def is_plural_noun(token):
    return token.tag == NNS or token.tag == NNPS

def print_coarse_pos(token):
    print(token.pos_)

def print_fine_pos(token):
    print(token.tag_)

#7 楼

您可以使用Python软件包polyglot，这是支持大量多语言应用程序的自然语言管道：

免费（GPLv3许可证）
开源

它进行词性标记：

import polyglot
from polyglot.text import Text, Word

text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")

print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
    print(u"{:<16}{:>2}".format(word, tag))
Word            POS Tag
------------------------------
O               DET
primeiro        ADJ
uso             NOUN
de              ADP
desobediência   NOUN
civil           ADJ
em              ADP
massa           NOUN
ocorreu         ADJ
em              ADP
setembro        NOUN
de              ADP
1906            NUM
.               PUNCT

POS标记器模型在Al-Rfou，Rami，Bryan Perozzi和Steven Skiena中得到了解释。 “ Polyglot：多语言nlp的分布式单词表示形式。” arXiv预印本arXiv：1307.1662（2013）。

编程黑洞网