在文本文件中搜索字符串的快速方法

我已编写此代码以在.txt文件中搜索字符串。是否可以优化代码，使其以最快的方式搜索字符串？假设文本文件很大（500MB-1GB）

我不想使用正则表达式。

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;


public class StringFinder {

public static void main(String[] args)
{
    double count = 0,countBuffer=0,countLine=0;
    String lineNumber = "";
    String filePath = "C:\Users\allen\Desktop\TestText.txt";
    BufferedReader br;
    String inputSearch = "are";
    String line = "";

    try {
        br = new BufferedReader(new FileReader(filePath));
        try {
            while((line = br.readLine()) != null)
            {
                countLine++;
                //System.out.println(line);
                String[] words = line.split(" ");

                for (String word : words) {
                  if (word.equals(inputSearch)) {
                    count++;
                    countBuffer++;
                  }
                }

                if(countBuffer > 0)
                {
                    countBuffer = 0;
                    lineNumber += countLine + ",";
                }

            }
            br.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    System.out.println("Times found at--"+count);
    System.out.println("Word found at--"+lineNumber);
}
}

您是否对使用哪种算法进行过研究？例如，当搜索固定的字符串（即没有通配符）时，Knuth-Morris-Pratt的速度明显快于朴素的匹配。

这可能会有所作为，请使用StringBuilderstackoverflow.com/questions/4645020/…

#1 楼

快速付出了代价。...代码复杂性和可读性。

假设您的代码现在可以产生正确的结果....这是一个很大的假设，因为：

期望单词在行的开头/结尾，或者被空格包围（不是逗号，标点符号等）。
它不会在另一个字符串中查找单词，因此将匹配'are'，但不匹配'bare'。

OK，一种更快的方法（将其保持为Java）是执行以下操作：

将搜索字符串（'are'）转换为与文件相同编码的字节数组。
从文件上的文件通道中打开内存映射的字节缓冲区。
扫描ByteBuffer，查找与搜索字节数组匹配的内容。
关闭字节缓冲区。
关闭ByteBuffer

如果文件大于内存，则必须偶尔重新定位字节缓冲区。我建议使用约4MB的Emopry映射大小加上搜索字符串的大小。这样，您可以搜索4MB窗口，然后在下一个4mb边界处开始下一个窗口。

一旦进入该窗口，这将很有意义。

此系统将很快，因为您将不必将文件的数据复制到Java中。一切实际上都会发生在事物的本机方面。

要使它起作用，需要阅读很多东西。

我将从教程开始....

StudyTrails
Yaldix
LinuxTopia

当然，如果您想真正快速，请使用grep。

这里以下是一些示例代码，可以帮助您入门：

import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.nio.charset.StandardCharsets;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;


public class NIOGrep {

    public static void main(String[] args) throws IOException {
        if (args.length != 2) {
            throw new IllegalArgumentException();
        }
        String grepfor = args[0];
        Path path = Paths.get(args[1]);

        String report = searchFor(grepfor, path);

        System.out.println(report);

    }

    private static final int MAPSIZE = 4 * 1024 ; // 4K - make this * 1024 to 4MB in a real system.

    private static String searchFor(String grepfor, Path path) throws IOException {
        final byte[] tosearch = grepfor.getBytes(StandardCharsets.UTF_8);
        StringBuilder report = new StringBuilder();
        int padding = 1; // need to scan 1 character ahead in case it is a word boundary.
        int linecount = 0;
        int matches = 0;
        boolean inword = false;
        boolean scantolineend = false;
        try (FileChannel channel = FileChannel.open(path, StandardOpenOption.READ)) {
            final long length = channel.size();
            int pos = 0;
            while (pos < length) {
                long remaining = length - pos;
                // int conversion is safe because of a safe MAPSIZE.. Assume a reaosnably sized tosearch.
                int trymap = MAPSIZE + tosearch.length + padding;
                int tomap = (int)Math.min(trymap, remaining);
                // different limits depending on whether we are the last mapped segment.
                int limit = trymap == tomap ? MAPSIZE : (tomap - tosearch.length);
                MappedByteBuffer buffer = channel.map(MapMode.READ_ONLY, pos, tomap);
                System.out.println("Mapped from " + pos + " for " + tomap);
                pos += (trymap == tomap) ? MAPSIZE : tomap;
                for (int i = 0; i < limit; i++) {
                    final byte b = buffer.get(i);
                    if (scantolineend) {
                        if (b == '\n') {
                            scantolineend = false;
                            inword = false;
                            linecount ++;
                        }
                    } else if (b == '\n') {
                        linecount++;
                        inword = false;
                    } else if (b == '\r' || b == ' ') {
                        inword = false;
                    } else if (!inword) {
                        if (wordMatch(buffer, i, tomap, tosearch)) {
                            matches++;
                            i += tosearch.length - 1;
                            if (report.length() > 0) {
                                report.append(", ");
                            }
                            report.append(linecount);
                            scantolineend = true;
                        } else {
                            inword = true;
                        }
                    }
                }
            }
        }
        return "Times found at--" + matches + "\nWord found at--" + report;
    }

    private static boolean wordMatch(MappedByteBuffer buffer, int pos, int tomap, byte[] tosearch) {
        //assume at valid word start.
        for (int i = 0; i < tosearch.length; i++) {
            if (tosearch[i] != buffer.get(pos + i)) {
                return false;
            }
        }
        byte nxt = (pos + tosearch.length) == tomap ? (byte)' ' : buffer.get(pos + tosearch.length); 
        return nxt == ' ' || nxt == '\n' || nxt == '\r';
    }
}

\ $ \ begingroup \ $
使用map-reduce样式算法可以更快地实现此想法，该算法可以同时搜索文件的多个区域。
\ $ \ endgroup \ $
–mauhiz
2014年11月11日下午5:13

\ $ \ begingroup \ $
“该系统将很快，因为您将不必将文件的数据复制到Java中。所有事情实际上都将发生在事物的本机方面。”并非完全正确，您正在使用buffer.get（i）一次将数据复制到Java。
\ $ \ endgroup \ $
–艾米莉·L。
2014年3月11日17:08

#2 楼

如果要提高性能，可以尝试使用其他算法。这就是grep的作用：

GNU grep使用著名的Boyer-Moore算法，该算法首先查找目标字符串的最后一个字母，并使用查找表来告诉它

为什么GNU grep快速（您将在此页面上找到其他聪明的想法）。

它会在输入中跳过多远。 br />
您可以在相应的Wikipedia页面上找到更多详细信息。

#3 楼

如果您希望将“ are”与其周围的空格匹配，只需添加如下空格：“ are”，然后查看该行是否包含该字符串（考虑一些边缘情况）。

        String paddedInput = " " + inputSearch + " ";
        String paddedInputStart = inputSearch + " ";
        String paddedInputEnd = " " +inputSearch ;

        while((line = br.readLine()) != null)
        {
            countLine++;

            if(line.equals(inputSearch) || 
               line.startsWith(paddedInputStart) ||
               line.endsWith(paddedInputEnd) || 
               (line.contains(paddedInput)) {
                 lineNumber += countLine + ",";
            }
        }

首先以最少的费用进行失败的检查。等于将首先检查字符串的长度，因此这是一种快速检查，除非该行到搜索空间的长度相等（不是那么频繁）。功能startsWith和endsWith是快速检查，因为它们不执行搜索。 contains最后一次完成，因为这是最昂贵的。

以上内容避免了单词列表的拆分（可能很慢）和迭代。而是让最有可能在本机代码中实现的字符串API为您完成工作。必须确保在循环之前构造使用的字符串，以避免重复的字符串操作，尽管我不确定Java编译器会对此进行优化，但我不确定。

String.contains()的良好实现将使用Boyer-Moore，但没有必要这样做。 Java并没有规定它是哪种算法。如果您想确定，请参见答案中的链接：https://codereview.stackexchange.com/a/44042/36120

\ $ \ begingroup \ $
是的@Emily你是对的。现在唯一困扰我的是我必须处理所有标点符号。例如：标点如“，。！）'-> /
\ $ \ endgroup \ $
–艾伦·萨维奥（Allen Savio）
2014年11月11日17:38

\ $ \ begingroup \ $
在这种情况下，更容易使用int i = String.indexOf（inputSearch），如果（i！= -1），则查看line.charAt（i-1）和line.charAt（i + inputSearch）。长度）（用于标点符号或空格）（明显检查行的范围）。如果匹配失败，请使用String.indexOf（inputSearch，i + inputSearch.length）搜索下一个重复出现并重复。
\ $ \ endgroup \ $
–艾米莉·L。
2014年11月11日17:53

#4 楼

由于您想找到匹配成功的行号，因此，我将尝试根据您基于BufferedReader.readLine()的当前策略进行改进，并仅在必要时才采用NIO等更奇特的方式。

有两个昂贵的操作是字符串拆分和字符串连接。

将行拆分为单词时，它必须为每个单词分配并复制新的String的字符，以及用于存储结果的数组相反，您可以在行中搜索，然后检查匹配的开始和结尾是否出现在单词边界处。

int len = inputSearch.length();
int countInLine = 0;
int pos = -1;
while ((pos = line.indexOf(inputSearch, pos + 1)) >= 0) {
    if ((pos == 0 || Character.isWhitespace(line.charAt(pos - 1))) &&
        (pos + len == line.length() || Character.isWhitespace(line.charAt(pos + len))) {
        countInLine++;
    }
}

您应该将标点符号视为空格作为单词边界。

另一个低效率是重复的字符串连接。 Java中的字符串是不可变的。每当为字符串a + b和a编写b时，代码实际上都会编译为new StringBuilder(a).append(b).toString()。因此，lineNumber应该改为StringBuilder，以便您可以高效地附加到其后。

请注意，FileNotFoundException是IOException的一种。您可以使用一个catch-block来处理两者。但是，如果出现IOException，您可能不应该尝试报告字数统计，这可能是无效的。为此，您可以从main()中消除所有try-catch块，然后声明main(String[] args) throws IOException。然后，如果发生错误，它将只打印堆栈跟踪并退出。

\ $ \ begingroup \ $
现在我们处理空白。会有大量标点符号要处理。
\ $ \ endgroup \ $
–艾伦·萨维奥（Allen Savio）
2014年11月11日17:42

#5 楼

我不想使用正则表达式。

也许您应该使用。

一个鲜为人知的事实是，Matcher不会占用String作为参数，但CharSequence。并且String实现了该接口。

由于您正在处理大型文本文件，所以，我只为您提供一个库：largetext。它在大型文本文件上实现CharSequence，这意味着LargeText实例可直接与Matcher一起使用：

private static final Pattern PATTERN = ...
private static final LargeTextFactory = LargeTextFactory.defaultFactory();

// in code;

final Path path = Paths.get("...");

try (
    final LargeText largeText = factory.fromPath(...);
) {
    final Matcher m = PATTERN.matcher(largeText);
    // do stuff with m
}

\ $ \ begingroup \ $
我尝试了您的库并使其正常运行-很好！-但问题是仅返回匹配项的索引而不是行号，因此尽管LargeText很酷，但我不确定它是不是非常适合此任务。
\ $ \ endgroup \ $
–james.garriss
15年7月17日在14:38

\ $ \ begingroup \ $
@ james.garriss嗯，对不起，Matcher也有其局限性;）
\ $ \ endgroup \ $
–fge
15年7月17日在15:55

#6 楼

您可以将字符串搜索为字节数组：在https://stackoverflow.com/questions/22234021/search-for-a-string-as-an-byte-in-a-binary-stream/中检查我的public static int search(byte[] input, byte[] searchedFor)版本22236277＃22236277

当然，在进行字节搜索时，您必须捕获所有“换行符”字符并对其进行计数，以便向用户提供找到匹配项的行号。 br />

编程黑洞网