用百万字解析一个文件

Question

用百万字解析一个文件

我已经实现了一些代码来找到txt sample.txt文件中的anagrams字,并在控制台上输出它们.txt文档在每行中包含String(word).

如果我想在txt.file中找到带有百万或二十亿字的字谜词,这是正确的使用方法吗？如果不是,在这种情况下我应该使用哪种技术？

我感谢任何帮助.

样品

abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi

Run Code Online (Sandbox Code Playgroud)

oupt

abac aabc abca
xyz zyx yxz

Run Code Online (Sandbox Code Playgroud)

码

package org.reader;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Test {
    // To store the anagram words
    static List<String> match = new ArrayList<String>();
    // Flag to check whether the checkWorld1InMatch() was invoked.
    static boolean flagCheckWord1InMatch;

    public static void main(String[] args) {
        String fileName = "G:\\test\\sample2.txt";
        StringBuilder sb = new StringBuilder();
        // In case of matching, this flag is used to append the first word to
        // the StringBuilder once.
        boolean flag = true;

        BufferedReader br = null;
        try {
            // convert the data in the sample.txt file to list
            List<String> list = Files.readAllLines(Paths.get(fileName));

            for (int i = 0; i < list.size(); i++) {

                flagCheckWord1InMatch = true;
                String word1 = list.get(i);

                for (int j = i + 1; j < list.size(); j++) {

                    String word2 = list.get(j);

                    boolean isExist = false;

                    if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
                        isExist = checkWord1InMatch(word1);

                    }

                    if (isExist) {
                        // A word with the same characters was checked before
                        // and there is no need to check it again. Therefore, we
                        // jump to the next word in the list.
                        // flagCheckWord1InMatch = true;
                        break;
                    } else {
                        boolean result = isAnagram(word1, word2);
                        if (result) {

                            if (flag) {
                                sb.append(word1 + " ");
                                flag = false;
                            }

                            sb.append(word2 + " ");

                        }
                        if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
                            match.add(sb.toString().trim());
                            sb.setLength(0);
                            flag = true;

                        }

                    }

                }
            }

        } catch (

        IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) {
                    br.close();
                }
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }

        for (String item : match) {
            System.out.println(item);
        }

        // System.out.println("Sihwail");

    }

    private static boolean checkWord1InMatch(String word1) {
        flagCheckWord1InMatch = false;
        boolean isAvailable = false;
        for (String item : match) {
            String[] content = item.split(" ");
            for (String word : content) {
                if (word1.equals(word)) {
                    isAvailable = true;
                    break;

                }
            }
        }
        return isAvailable;
    }

    public static boolean isAnagram(String firstWord, String secondWord) {
        char[] word1 = firstWord.toCharArray();
        char[] word2 = secondWord.toCharArray();
        Arrays.sort(word1);
        Arrays.sort(word2);
        return Arrays.equals(word1, word2);
    }

}

Run Code Online (Sandbox Code Playgroud)

Answer 1

MrS*_*h42 6

对于200亿字而言,您将无法将所有这些字体保存在RAM中,因此您需要一种方法来处理它们.

20,000,000,000字.Java需要相当多的内存来存储字符串,因此每个字符可以计算2个字节,并且至少可以计算38个字节的开销.

这意味着一个字符的20,000,000,000个单词需要800,000,000,000字节或800 GB,这比我所知道的任何计算机都要多.

您的文件将包含少于20,000,000,000个不同的单词,因此如果您只存储一个单词(例如,在a中Set),则可以避免内存问题.

归档时间：	9 年，8 月前
查看次数：	836 次
最近记录：	9 年，4 月前