“很久以前”_编程黑洞网

我正在开发一个小程序，主要在单词级别上对文本内容执行各种任务。我写这些方法是为了帮助将原始文本文件准备成更具延展性的方法，例如List<String>，以后我可以执行各种例程，例如对单词进行计数和排序等等。

关注点：

在splitTextStringIntoWordList中，我发现自己必须将textString参数拆分为String[]数组，然后立即将数组中的元素一次添加一次，用正则表达式解析为List<String>。
我的一些方法是否只是在参数和返回之间做太多事情了？
JavaDoc是否清晰，简洁，描述性强？
/>我可能会犯一些新手错误？

import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.Collections;
import java.util.regex.*;
import java.net.URL;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;

public class TextFileWordSplitter {

    /**
     * Remove punctuation marks from a String using regular expressions.
     * <p>
     *     This will account for contractions such as "it's" and "can't", as well as
     *     hyphenated words such as "first-class" and "low-budget", which in both cases
     *     will be considered as whole words.
     * </p>
     * @param input  The String from which to remove punctuation
     * @return  The String with the punctuation removed, or empty String
     */
    static String removePunctuationFromString(String input) {

        Pattern regex = Pattern.compile("([A-Za-z]?[\-']?[A-Za-z])+");
        Matcher matcher = regex.matcher(input);
        if (matcher.find()) {
            return matcher.group();
        } else {
            return "";
        }
    }

    /**
     * Create a String by fetching a text file at the provided URL.
     * @param url  The URL where the text file is located.
     * @return  The content of the text file, or null
     * @throws IOException
     */
    static String readUrlTextContent(String url) throws IOException {

        URL source = new URL(url);
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(source.openStream()))) {
            StringBuilder builder = new StringBuilder();
            String line = reader.readLine();

            while (line != null) {
                builder.append(line);
                builder.append("\n");
                line = reader.readLine();
            }
            return builder.toString();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

    /**
     * Split a String into an ArrayList of individual words as separate elements.
     * <p>
     *     The words are all converted to uppercase, such that "Hello", "hello" and "HELLO"
     *     will all become the same word string, "HELLO".
     * </p>
     * @param textString  The String which is intended to be split into a list of words
     * @return  An ArrayList containing one word per element, or null
     */
    static List<String> splitTextStringIntoWordList (String textString) {

        try {
            String allWhiteSpace = "\s+";
            String[] splitText = textString.toUpperCase().split(allWhiteSpace);
            List<String> wordList = new ArrayList();
            for (String word : splitText) {
                Collections.addAll(wordList, removePunctuationFromString(word));
            }
            wordList.removeAll(Arrays.asList("", null));
            return wordList;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

    public static void main(String[] args) {

        String textString = null;
        try {
            textString = readUrlTextContent("http://textfiles.com/stories/antcrick.txt");
        } catch (IOException e) {
            e.printStackTrace();
        }

        List<String> wordList = splitTextStringIntoWordList(textString);

        /* Print each word along with its index */
        int wordIndex = 0;
        for (String word : wordList) {
            System.out.println("[" + wordIndex++ + "] " + word);
        }
    }
}

以Ant和Cricket为源的程序输出如下：

[0] THE
[1] ANT
[2] AND
[3] THE
[4] CRICKET
[5] ONCE
[6] UPON
[7] A
[8] TIME
...
[368] WELL
[369] TRY
[370] DANCING
[371] NOW

在PasteBin上完成输出

#1 楼

问题

我不确定TextFileWordSplitter是该类的好名字，尤其是因为文本的来源通常是网络资源而不是java.io.File。

List<String> wordList = new ArrayList();应该是List<String> wordList = new ArrayList<>();，以禁止出现编译器警告。

readUrlTextContent()函数声明它为throws IOException，但实际上它捕获了每个IOException并返回null。（唯一可以抛出的IOException是MalformedURLException。）您应该让所有IOException都自然传播。

下定决心哪些函数是公用的，哪些是私有的。默认访问权限几乎永远不是一个好选择。

Nitpicks

您的正则表达式不需要捕获括号。它也不需要反斜杠在字符类中引用连字符，因为如果字符是字符类中的第一个字符或最后一个字符，则连字符实际上是采用的。

JavaDoc应该避免记录实现诸如“使用正则表达式”之类的详细信息-与调用者无关，除非您还想确切记录使用的正则表达式。 JavaDoc应该以第三人称的指示性语言而不是命令性语言编写。

分解

我喜欢将工作分解为函数，但是我将分解

将.toUpperCase()调用隐藏在splitTextStringIntoWordList()中令人惊讶。大写转换与拆分有什么关系？我会将其移至removePunctuationFromString()函数中，然后将该函数重命名为normalizeWord()。

readUrlTextContent()可能有害。无需随机访问流就可以完成此任务，因此您不需要将整个文本缓冲到字符串中。只需让BufferedReader发挥作用：它将缓冲足够的速度以增强性能，并丢弃已经处理过的文本部分。

建议您使用String.split()而不是调用Scanner，这是一次获取单词的便捷方法。我将定义words()函数的两个版本：一个接受Scanner，另一个接受URL。

import java.io.*;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordExtractor {
    private static final Pattern WORD_PATTERN =
        Pattern.compile("[A-Za-z]*[-']?[A-Za-z]+");

    /**
     * Extracts an alphabetic word, possibly containing up to one hyphen or
     * apostrophe, and returns it in uppercase.
     *
     * @return The extracted word, or an empty string if the input is all
     *         punctuation.
     */
    private static String normalizeWord(String s) {
        Matcher m = WORD_PATTERN.matcher(s);
        return m.find() ? m.group().toUpperCase() : "";
    }

    public static List<String> words(URL url) throws IOException {
        try ( InputStream is = url.openStream();
              BufferedInputStream bis = new BufferedInputStream(is);
              Scanner scanner = new Scanner(bis) ) {
            return words(scanner);
        }
    }

    public static List<String> words(Scanner scanner) {
        List<String> results = new ArrayList<>();
        scanner.reset();
        while (scanner.hasNext()) {
            String word = normalizeWord(scanner.next());
            if (!word.isEmpty()) {
                results.add(word);
            }
        }
        return results;
    }

    public static void main(String[] args) throws IOException {
        URL url = new URL("http://textfiles.com/stories/antcrick.txt");
        int i = 0;
        for (String word : words(url)) {
            System.out.printf("[%d]: %s\n", i++, word);
        }
    }
}

#2 楼

在splitTextStringIntoWordList中，我发现自己必须将textString参数拆分为String[]数组，然后立即将数组中的元素一次添加一次，并用regex解析为List<String>。有没有更好的方法可能不需要太多操作？

在Java 8中...

List<String> result = Pattern.compile(allWhitespace).splitAsStream(textString.toUpperCase())
                                .map(TextFileWordSplitter::removePunctuationFromString)
                                .collect(Collectors.toList());

我也不太确定是否需要try { } catch (Exception e) { }语句...通常，应该更特别地检查Exception的类型，而不是全部包含的Exception。如果没有找到要检查的Exception，那么最好删除try-catch语句，除非指定了需要特殊处理的运行时Exception（例如，在出现NullPointerException时提示用户，诸如此类）。

基于Java 8的建议，从BufferedReader读取面向流的方法可以是lines()方法（编辑：如@ 200_success和@Boris the Spider所指出的，在这里使用的更好的名称/方法参数我已经更新为使用URL参数）：

private static List<String> parseContent(URL url) {
    try (Stream<String> lines = new BufferedReader(
                    new InputStreamReader(url.openStream())).lines()) {
        return lines.flatMap(Pattern.compile(allWhitespace)::splitAsStream)
                .map(TextFileWordSplitter::removePunctuationFromString)
                .map(String::toUpperCase)
                .collect(Collectors.toList());
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
}

在这里，flatMap()用于“转换”每个流（即line）元素分成单独的单词流，并将它们附加在一起。然后，所有单词将与您的TextFileWordSplitter::removePunctuationFromString方法参考进行映射。

\ $ \ begingroup \ $
带有签名parseContent（String）的函数很可能被错误地用文本内容而不是URL作为参数来调用。最好让它接受URL而不是字符串，这样就不会发生这种错误。
\ $ \ endgroup \ $
– 200_success
2015年9月3日在10:03

\ $ \ begingroup \ $
@ 200_success当然，已经编辑了我的答案以包含该信息花絮。 :)
\ $ \ endgroup \ $
– h.j.k.
2015年9月3日，10：10

\ $ \ begingroup \ $
@ h.j.k。我认为200_success所说的是该方法应接受URL而不是String以避免混淆。通常，使用方法参数类型来记录所需的输入是一个好主意。
\ $ \ endgroup \ $
–蜘蛛鲍里斯（Boris）
2015年9月4日9:00

\ $ \ begingroup \ $
@BoristheSpider陷阱。 :)
\ $ \ endgroup \ $
– h.j.k.
2015年9月4日在9:04

#3 楼

您听说过土耳其吗？

尝试将以下行插入main的顶部：

java.util.Locale.setDefault(java.util.Locale.forLanguageTag("tr-TR"));

现在您的输出看起来会有点不同：

[0] THE
[1] ANT
[2] AND
[3] THE
[4] CRICKET
[5] ONCE
[6] UPON
[7] A
[8] T
...
[351] WELL
[352] TRY
[353] DANC
[354] NOW

在土耳其语中，"i".toUpperCase();是©（U + 0130“带点上方的拉丁大写字母I”）。

何时为显示以外的任何目的操作字符串时，应始终指定Locale。如果您从不使用toUpper或toLower而不是先进行规范化，然后在需要时使用大小写折叠或不区分大小写的比较，那会更好。

在正则表达式中，您同样使用不处理Unicode字母的[A-Za-z]。您可以使用[\p{L}]来获取所有字母。有关在Java正则表达式中处理Unicode的其他方法，请参见Pattern javadoc下的Unicode支持。

#4 楼

一些事情：

    if (matcher.find()) {
        return matcher.group();
    } else {
        return "";
    }

您可以将其与三元交换：

return matcher.find() ? matcher.group() : "";

您可以考虑将大小写i敏感标志添加到模式中，以避免出现a-zA-Z之类的块

Pattern.compile("([A-Za-z]?[\-']?[A-Za-z])+");

可以使用i标志（可以是Pattern.CASE_INSENSITIVE或(?i)）改进为：

([a-z]?['-]?[a-z])+

。有关示例，请参见regex101.com链接。

\ $ \ begingroup \ $
Javadoc中不区分大小写的标志：“指定此标志可能会带来轻微的性能损失。” （仅供说明，以供将来参考）
\ $ \ endgroup \ $
– h.j.k.
2015年9月3日在10:44

\ $ \ begingroup \ $
怎么样：regex101.com/r/tR1oN4/2？（以1179步对比2895步完成）
\ $ \ endgroup \ $
–伊斯梅尔·米格尔（Ismael Miguel）
2015年9月3日，11：39

\ $ \ begingroup \ $
发表为您的答案
\ $ \ endgroup \ $
– Quill
2015年9月3日，12：19

编程黑洞网

“很久以前”

#1 楼

#2 楼

评论

#3 楼

#4 楼

评论