我在apache spark上运行UIMA应用程序.UIMA RUTA需要处理数百万个页面才能进行计算.但是有一段时间我面临内存异常.它会在成功处理2000页时抛出异常,但有些时候会在500页上失败.
应用日志
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57)
at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39)
at org.apache.uima.cas.impl.Heap.grow(Heap.java:187)
at org.apache.uima.cas.impl.Heap.add(Heap.java:241)
at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844)
at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489)
at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172)
at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68)
at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73)
at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) …Run Code Online (Sandbox Code Playgroud) 我是UIMA Ruta的新手.我使用脚本语言制作了一些注释器.我可以在EclipseIDE中运行它们.我想编写一个JAVA API来自动运行提供的输入脚本.
我使用的是UIMA文档中提供的相同示例项目.
到目前为止,我已经能够做到这一点
try {
File taeDescriptor = null;
File inputDir = null;
// Read and validate command line arguments
boolean validArgs = false;
if (args.length == 2) {
taeDescriptor = new File(args[0]);
inputDir = new File(args[1]);
validArgs = taeDescriptor.exists()
&& !taeDescriptor.isDirectory()
&& inputDir.isDirectory();
}
if (!validArgs) {
printUsageMessage();
} else {
// get Resource Specifier from XML file
XMLInputSource in = new XMLInputSource(taeDescriptor);
ResourceSpecifier specifier = UIMAFramework.getXMLParser()
.parseResourceSpecifier(in);
// for debugging, output the Resource Specifier
// System.out.println(specifier); …Run Code Online (Sandbox Code Playgroud) 我无法在我的简单管道中运行uima ruta脚本.我正在使用下一个库:
而我正在使用org.apache.uima.fit.pipeline.SimplePipeline:
SimplePipeline.runPipeline(
UriCollectionReader.getCollectionReaderFromDirectory(filesDirectory), //directory with text files
UriToDocumentTextAnnotator.getDescription(),
StanfordCoreNLPAnnotator.getDescription(),//stanford tokenize, ssplit, pos, lemma, ner, parse, dcoref
AnalysisEngineFactory.createEngineDescription(RUTA_ANALYSIS_ENGINE),//RUTA script
AnalysisEngineFactory.createEngineDescription(//
XWriter.class,
XWriter.PARAM_OUTPUT_DIRECTORY_NAME, outputDirectory,
XWriter.PARAM_FILE_NAMER_CLASS_NAME, ViewURIFileNamer.class.getName())
);
Run Code Online (Sandbox Code Playgroud)
我要做的是使用StandfordNLP注释器(来自ClearTK)并应用ruta脚本.目前,一切都运行没有错误,默认的ruta注释被添加到CAS,但我的规则创建的注释没有添加到CAS.
我的脚本是:
PACKAGE edu.isistan.carcha.concern;
TYPESYSTEM org.cleartk.ClearTKTypeSystem;
DECLARE persistence
Token{FEATURE("lemma","storage") -> MARK(persistence)};
Run Code Online (Sandbox Code Playgroud)
查看带注释的文件:

那里有基本的ruta注释,如"SPACE"或"SW",所以RutaEngine正在创建并添加到管道......
如何正确创建AnalysisEngineDescriptor以运行Ruta脚本?
注意: RUTA_ANALYSIS_ENGINE它是我从RUTA工作台复制的引擎描述符.
感谢您的大力支持!
我有一些类似以下的文字
aaaaa aaaa aaaaa aaaaaa
bbbbb bbbbb bbbb bbbbbb
cccccc ccccc ccccc cccccc
Run Code Online (Sandbox Code Playgroud)
我想使用Ruta创建与换行符之间的所有字符串匹配的注释。我希望我的注释创建以下三个匹配项:
1. aaaaa aaaa aaaaa aaaaaa
2. bbbbb bbbbb bbbb bbbbbb
3. cccccc ccccc ccccc cccccc
Run Code Online (Sandbox Code Playgroud)
我尝试匹配换行符之间的所有内容,如下所示
BREAK #{-> MARK(Stuff)} BREAK;
Run Code Online (Sandbox Code Playgroud)
但是没有运气。任何人都可以提出一些建议吗?
非常感谢你!
我的用例是这样的,我在 WORDLIST“MonthNames.txt”中有一个匹配单词列表。
现在我想标记给定文档中这些单词的所有出现,无论文本大小写如何。
PACKAGE uima.ruta.example;
WORDLIST MonthNameList = 'MonthNames.txt';
DECLARE MonthNames;
DECLARE MonthNameValue;
// Regex to be used in finding dates
STRING monthNameValueRegex = "(?i)(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|jun|jul|aug|sept|oct|nov|dec)";
// Mark month name
Document{-> MARKFAST(MonthNames, MonthNameList)};
Document{CONTAINS(MonthNames) -> MARK(MonthNameValue)};
Document{REGEXP(monthNameValueRegex) -> MARK(MonthNameValue)};
Run Code Online (Sandbox Code Playgroud)
有什么办法可以做到吗?
我试过
Document{-> MARKFAST(MonthNames, MonthNameList,true)};
Run Code Online (Sandbox Code Playgroud)
但这只是忽略空格而不是文本大小写。
请帮忙