srp*_*srp 0 java regex nlp opennlp
我想从句子中提取名词并从POS标签中取回原始句子
//Extract the words before _NNP & _NN from below and also how to get back the original sentence from the Pos TAG.
Original Sentence:Hi. How are you? This is Mike·
POSTag: Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN
Run Code Online (Sandbox Code Playgroud)
我试过这样的事
String txt = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
String re1 = "((?:[a-z][a-z0-9_]*))"; // Variable Name 1
String re2 = ".*?"; // Non-greedy match on filler
String re3 = "(_)"; // Any Single Character 1
String re4 = "(NNP)"; // Word 1
Pattern p = Pattern.compile(re1 + re2 + re3 + re4, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find()) {
String var1 = m.group(1);
System.out.print( var1.toString() );
}
}
Run Code Online (Sandbox Code Playgroud)
输出:嗨但我需要一个句子中所有名词的列表.
要提取名词,您可以这样做:
public static String[] extractNouns(String sentenceWithTags) {
// Split String into array of Strings whenever there is a tag that starts with "._NN"
// followed by zero, one or two more letters (like "_NNP", "_NNPS", or "_NNS")
String[] nouns = sentenceWithTags.split("_NN\\w?\\w?\\b");
// remove all but last word (which is the noun) in every String in the array
for(int index = 0; index < nouns.length; index++) {
nouns[index] = nouns[index].substring(nouns[index].lastIndexOf(" ") + 1)
// Remove all non-word characters from extracted Nouns
.replaceAll("[^\\p{L}\\p{Nd}]", "");
}
return nouns;
}
Run Code Online (Sandbox Code Playgroud)
要提取原始句子,您可以这样做:
public static String extractOriginal(String sentenceWithTags) {
return sentenceWithTags.replaceAll("_([A-Z]*)\\b", "");
}
Run Code Online (Sandbox Code Playgroud)
证明它有效:
public static void main(String[] args) {
String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
System.out.println(java.util.Arrays.toString(extractNouns(sentence)));
System.out.println(extractOriginal(sentence));
}
Run Code Online (Sandbox Code Playgroud)
输出:
[Hi, Mike]
Hi. How are you? This is Mike.
Run Code Online (Sandbox Code Playgroud)
注意:对于从提取的名词中删除所有非单词字符(如标点符号)的正则表达式,我使用了此Stack Overflow问题/答案.