Sou*_*uad 1 java whitespace file removing-whitespace
我试图通过首先删除停用词并对它们应用词干算法来处理文本,然后将它们分成单词并将它们保存到文件中.我做了所有这些,我遇到的问题是文件中包含以下单词的空格:
Hi
teacher
mother
sister
father .... and so on
Run Code Online (Sandbox Code Playgroud)
问题是老师和母亲之间的空间.我想删除它.我无法弄清楚它的原因.
这是相关代码的药水.
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(".txt")) {
fileNameList.add(f.getName());
Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8");
in = new BufferedReader(fstream);
StringBuilder sb = new StringBuilder();
String s=null;
String word = null;
while ((s = in.readLine()) != null) {
s=s.trim().replaceAll("[^A-Za-z0-9]", " "); //remove all punctuation for English text
Scanner input = new Scanner(s);
while(input.hasNext()) {
word= input.next();
word=word.trim().toLowerCase();
if(stopword.isStopword(word)==true)
{
word= word.replace(word, "");
}
String stemmed=stem.stem (word);
sb.append(stemmed+"\t");
}
//System.out.print(sb);
}
String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms (English)
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
System.out.print(term+"\t");
}
}
termsDocsArray.add(tokenizedTerms);
}
}
//System.out.print("file names="+fileNameList);
}
Run Code Online (Sandbox Code Playgroud)
请帮忙.谢谢
为什么不使用if检查线是否为空?
while ((s = in.readLine()) != null) {
if (!s.trim().isEmpty()) {
...
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2280 次 |
| 最近记录: |