>>> import boilerpipe
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\boilerpipe\__init__.py", line 10, in <module>
jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=%s" % os.pathsep.join(jars))
File "C:\Anaconda\lib\site-packages\jpype\_core.py", line 50, in startJVM
_jpype.startup(jvm, tuple(args), True)
RuntimeError: Unable to load DLL [C:\Program Files\Java\jre7\bin\client\jvm.dll], error = The specified module could not be found.
at native\common\include\jp_platform_win32.h:58
Run Code Online (Sandbox Code Playgroud)
尝试:重新安装jvm
>> import ctypes
>> import os
>> os.chdir(r"<path to Java bin client folder>")
>> ctypes.CDLL("jvm.dll")
Still unable to fix
Run Code Online (Sandbox Code Playgroud)
编辑:尝试下面的代码,仍然卡住:
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
它给出了与以前相同的错误.
我试图boilerpipe用Python 运行multiprocessing.这样做是为了解析来自多个来源的RSS源.问题是它在处理一些链接后挂在其中一个线程中.如果我删除池并在循环中运行它,整个流程都有效.
这是我的多处理代码:
proc_pool = Pool(processes=4)
for each_link in data:
proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()
Run Code Online (Sandbox Code Playgroud)
这是我boilerpipe在里面调用的代码process_link_for_feeds():
def parse_using_bp(in_url):
extracted_html = ""
if ContentParser.url_skip_p.match(in_url):
return extracted_html
try:
extractor = Extractor(extractor='ArticleExtractor', url=in_url)
extracted_html = extractor.getHTML()
del extractor
except BaseException as e:
print "Something's wrong at Boilerpipe -->", in_url, "-->", e
extracted_html = ""
finally:
return extracted_html
Run Code Online (Sandbox Code Playgroud)
我对它悬挂的原因一无所知.proc_pool代码中有什么问题吗?
有人知道samppipe库的.net端口吗?
我正在使用samppipe,看起来很棒,但我想输出JSON.我正在使用Java版本并在NetBeans中进行如下测试:
final URL url = new URL("http://mashable.com/2012/09/26/worlds-best-father-kickstarter-calendar");
System.out.println(ArticleExtractor.INSTANCE.getText(url));
Run Code Online (Sandbox Code Playgroud)
谁能告诉我我是怎么做到的?
我只是想知道如何使用Tika从html中提取主文本和纯文本?
也许一种可能的解决方案是使用BoilerPipeContentHandler,但你有一些示例/演示代码来显示它吗?
首先十分感谢
我想托管我自己的版本的samppipe web API(http://code.google.com/p/boilerpipe/).appspot网站是http://boilerpipe-web.appspot.com/
我想自己主持它.有人可以告诉我如何使用Boilerpipe JAR创建网页吗?
我的本地磁盘上有一个html文件,并希望使用BoilerPipe从中提取文本.
来自ExtractorBase类的"getText"方法接受一个读者,所以我写道:
FileReader fr = new FileReader("D:/myHTMLfile");
System.out.println(ArticleExtractor.INSTANCE.getText(fr));
Run Code Online (Sandbox Code Playgroud)
但后来我得到一个指向第二行代码的错误.
任何线索?谢谢!
编辑:整个错误消息是:
Exception in thread "pool-1-thread-1" java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLConfiguration
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:50)
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:41)
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:69)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:101)
at neuromarket.BoilerPlateExtractor.run(BoilerPlateExtractor.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException: org.cyberneko.html.HTMLConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 9 more
Exception in thread "pool-1-thread-2" java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLConfiguration
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:50)
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:41)
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:69)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:101)
at neuromarket.BoilerPlateExtractor.run(BoilerPlateExtractor.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
BUILD …Run Code Online (Sandbox Code Playgroud) 我很新boilerpipe,我正在尝试以下基本代码:
package contentExtraction;
import java.net.URL;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class ContentExtractor {
public static void main(String[] args) throws Exception {
final URL url = new URL(
// "http://www.l3s.de/web/page11g.do?sp=page11g&link=ln104g&stu1g.LanguageISOCtxParam=en"
"http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik"
);
System.out.println(ArticleExtractor.INSTANCE.getText(url));
}
}
Run Code Online (Sandbox Code Playgroud)
但是在尝试运行上面的代码时出现以下错误:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xerces/parsers/AbstractSAXParser
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
at …Run Code Online (Sandbox Code Playgroud) 这是我第三次安装它.我让它在Windows上工作,直到几天前在Linux上工作.我已经完成了所有我能做的事情,我不明白如何运行这个Java程序.
源代码是一个带有lib的文件夹,src是一些jars以及一个类路径和项目文件.类路径文件生成一些声明,如classpathentry = src/main和path = lib,path = src.
所有这些都是有道理的.'src'里面有一个'main'文件夹.
我试图运行的小文件开始于
import de.l3s.boilerpipe.demo
我正在尝试运行'Oneliner.java'.我无法编译它.
无论该类文件是什么/哪里,我都无法运行它.它导致了一个noclassdeffound.我在main,src,root,demo,...中运行它.我已经尝试在不同的目录中编译它,使用推荐的各种java命令行开关运行它.据说你可以让它"搜索"我尚未体验过的文件.这个java环境的纯粹顽固是可怕的.并为我大辱骂.
对这个实用程序相当兴奋,但遇到了一些实现问题.安装它但在服务器上执行HTML文件时没有结果.URL代码上的语法错误.
<script LANGUAGE="JavaScript" SRC="boilerpipe-1.1.0.jar">
</script>
<script type="text/javascript">
URL url = new URL("http://www.mywebsite.com");
var text = ArticleExtractor.INSTANCE.getText(url);
document.write(text);
</script>
Run Code Online (Sandbox Code Playgroud)
编辑=====>此代码似乎有效.
<?php
$html = file_get_contents("http://www.google.com");
?>
<script language="JavaScript" src="boilerpipe-1.1.0.jar"></script>
<script language="javascript" type="text/javascript">
var sStr = "<?php echo $html?>";
var text = ArticleExtractor.INSTANCE.getText(sStr);
document.write(text);
Run Code Online (Sandbox Code Playgroud)
?>
boilerpipe ×10
java ×6
.net ×1
apache-tika ×1
c# ×1
classpath ×1
html ×1
html-parsing ×1
jar ×1
javac ×1
javascript ×1
json ×1
jvm ×1
python ×1
python-2.7 ×1
src ×1
url ×1