是否有一个干净的维基百科API仅用于检索内容摘要?

spa*_*kle 141 api wikipedia wikipedia-api

我只需要检索维基百科页面的第一段.内容必须是html格式化,随时可以在我的网站上显示(所以没有BBCODE或WIKIPEDIA特殊代码!)

Mik*_*das 195

有一种方法可以在没有任何html解析的情况下获得整个"介绍部分"!与AnthonyS的另一个参数的答案类似explaintext,您可以以纯文本形式获取介绍部分文本.

询问

以纯文本格式获取Stack Overflow的简介:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow

JSON响应

(警告剥离)

{
    "query": {
        "pages": {
            "21721040": {
                "pageid": 21721040,
                "ns": 0,
                "title": "Stack Overflow",
                "extract": "Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open alternative to earlier Q&A sites such as Experts Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.\nIt features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. Users of Stack Overflow can earn reputation points and \"badges\"; for example, a person is awarded 10 reputation points for receiving an \"up\" vote on an answer given to a question, and can receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site or forum. All user-generated content is licensed under a Creative Commons Attribute-ShareAlike license. Questions are closed in order to allow low quality questions to improve. Jeff Atwood stated in 2010 that duplicate questions are not seen as a problem but rather they constitute an advantage if such additional questions drive extra traffic to the site by multiplying relevant keyword hits in search engines.\nAs of April 2014, Stack Overflow has over 2,700,000 registered users and more than 7,100,000 questions. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML."
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

文档:API:query/prop = extract


编辑:&redirects=1按评论中的建议添加.

  • 非常值得推荐使用**&redirects = 1**,它会自动重定向到同义词的内容 (27认同)
  • 如果我不知道页码,如何从此JSON响应中获取信息.我无法访问包含"extract"的JSON数组 (5认同)

Ant*_*nyS 75

实际上有一个非常好的道具称为提取物,可以用于专门为此目的设计的查询.提取允许您获取文章摘录(截断的文章文本).有一个名为exintro的参数可用于检索第0部分中的文本(没有其他资源,如图像或信息框).您还可以检索更精细的提取,例如通过一定数量的字符(exchars)或一定数量的句子(exsentences)

这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20OverflowAPI沙箱 http://en.wikipedia.org/wiki/特殊:ApiSandbox #action = query&prop = extract&format = json&exintro =&titles = Stack%20Overflow以使用此查询进行更多实验.

请注意,如果您想要特定的第一段,您仍需要按照所选答案中的建议进行一些额外的解析.这里的区别在于此查询返回的响应比建议的其他一些api查询要短,因为在api响应中没有其他资产,如图像.

  • 你是炸弹. (2认同)

lw1*_*.at 52

自2017年以来,Wikipedia提供了一个具有更好缓存的REST API.在文档中,您可以找到完全适合您的用例的以下API.(因为它被新的Page Previews功能使用)

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow 返回以下数据,可用于显示带有小缩略图的夏季:

{
  "type": "standard",
  "title": "Stack Overflow",
  "displaytitle": "Stack Overflow",
  "extract": "Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.",
  "extract_html": "<p><b>Stack Overflow</b> is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of <i>Coding Horror</i>, Atwood's popular programming blog.</p>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q549037",
  "titles": {
    "canonical": "Stack_Overflow",
    "normalized": "Stack Overflow",
    "display": "Stack Overflow"
  },
  "pageid": 21721040,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/en/thumb/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png/320px-Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 320,
    "height": 149
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/en/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 462,
    "height": 215
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "902900099",
  "tid": "1a9cdbc0-949b-11e9-bf92-7cc0de1b4f72",
  "timestamp": "2019-06-22T03:09:01Z",
  "description": "website hosting questions and answers on a wide range of topics in computer programming",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.wikipedia.org/wiki/Stack_Overflow?action=history",
      "edit": "https://en.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Stack_Overflow"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Stack_Overflow",
      "edit": "https://en.m.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Stack_Overflow"
    }
  },
  "api_urls": {
    "summary": "https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow",
    "metadata": "https://en.wikipedia.org/api/rest_v1/page/metadata/Stack_Overflow",
    "references": "https://en.wikipedia.org/api/rest_v1/page/references/Stack_Overflow",
    "media": "https://en.wikipedia.org/api/rest_v1/page/media/Stack_Overflow",
    "edit_html": "https://en.wikipedia.org/api/rest_v1/page/html/Stack_Overflow",
    "talk_page_html": "https://en.wikipedia.org/api/rest_v1/page/html/Talk:Stack_Overflow"
  }
}
Run Code Online (Sandbox Code Playgroud)

默认情况下,它遵循重定向(因此/api/rest_v1/page/summary/StackOverflow也可以),但可以禁用此功能?redirect=false

如果您需要从另一个域访问API,可以使用&origin=(例如&origin=*)设置CORS标头

  • 你救了我的命 (2认同)
  • 这还包括“类型”,如果您需要知道您搜索的内容是否具有“消歧”,那么这非常有用。 (2认同)
  • 为了避免CORS错误,请在查询中添加`&origin = *`。 (2认同)
  • 是否也可以通过 wikidata ID 进行查询?我有一些我提取的 json 数据,它看起来像 `"other_tags" : "\"addr:country\"=&gt;\"CW\",\"historic\"=&gt;\"ruins\",\"name:nl\ "=&gt;\"Riffort\",\"wikidata\"=&gt;\"Q4563360\",\"wikipedia\"=&gt;\"nl:Riffort\""` 我们现在可以通过 QID 获取提取物吗? (2认同)

Vai*_*urt 39

此代码允许您以纯文本格式检索页面第一段的内容.

这个答案的部分来自这里,因此在这里.有关更多信息,请参阅MediaWiki API文档.

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in json format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Baseball&prop=text&section=0';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // get the main text content of the query (it's parsed HTML)

// pattern for first match of a paragraph
$pattern = '#<p>(.*)</p>#Us'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match($pattern, $content, $matches))
{
    // print $matches[0]; // content of the first paragraph (including wrapping <p> tag)
    print strip_tags($matches[1]); // Content of the first paragraph without the HTML tags.
}
Run Code Online (Sandbox Code Playgroud)

  • 当然,你的问题陈述_I我只需要检索第一段_. (9认同)

svi*_*ick 30

就在这里.例如,如果您想获取文章Stack Overflow的第一部分的内容,请使用如下查询:

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

这些部分意味着:

  • format=xml:将结果格式化程序作为XML返回.其他选项(如JSON)可用.这不会影响页面内容本身的格式,只会影响封闭的数据格式.

  • action=query&prop=revisions:获取有关页面修订的信息.由于我们没有指定哪个版本,因此使用了最新版本.

  • titles=Stack%20Overflow:获取有关该页面的信息Stack Overflow.如果你把它们的名字分开,可以一次性得到更多页面的文本|.

  • rvprop=content:返回修订的内容(或文本).

  • rvsection=0:仅返回第0部分中的内容.

  • rvparse:返回解析为HTML的内容.

请记住,这会返回整个第一部分,包括帽子("其他用途..."),信息框或图像等内容.

有几种库可用于各种语言,使得使用API​​变得更容易,如果你使用其中一种,它可能会更好.

  • 自从这个答案发布以来已经有一段时间了,但我想让你知道它对我帮助很大!谢谢! (4认同)
  • 我不希望内容解析广告HTML,我只想获得"纯文本"(既不是维基百科代码) (3认同)
  • 将"&redirects = true"添加到链接的末尾可确保您到达目标文章(如果存在). (2认同)

01A*_*key 15

这是我正在使用的代码,我正在制作的网站需要获得关于维基百科文章的主要段落/摘要/第0部分,并且所有这些都是在浏览器(客户端javascript)内完成的,这要归功于magick JSONP!- > http://jsfiddle.net/gautamadude/HMJJg/1/

它使用Wikipedia API来获取HTML中的前导段落(称为第0部分),如下所示:http://en.wikipedia.org/w/api.php?format = json&action =arse&page = Stack_Overflow&prop = text& section = 0&callback =?

然后它会删除HTML和其他不需要的数据,为您提供一个干净的文章摘要字符串,如果您愿意,可以通过一些调整,在前导段落周围获得"p"html标记但是现在只有换行符他们之间的性格.

码:

var url = "http://en.wikipedia.org/wiki/Stack_Overflow";
var title = url.split("/").slice(4).join("/");

//Get Leading paragraphs (section 0)
$.getJSON("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=" + title + "&prop=text&section=0&callback=?", function (data) {
    for (text in data.parse.text) {
        var text = data.parse.text[text].split("<p>");
        var pText = "";

        for (p in text) {
            //Remove html comment
            text[p] = text[p].split("<!--");
            if (text[p].length > 1) {
                text[p][0] = text[p][0].split(/\r\n|\r|\n/);
                text[p][0] = text[p][0][0];
                text[p][0] += "</p> ";
            }
            text[p] = text[p][0];

            //Construct a string from paragraphs
            if (text[p].indexOf("</p>") == text[p].length - 5) {
                var htmlStrip = text[p].replace(/<(?:.|\n)*?>/gm, '') //Remove HTML
                var splitNewline = htmlStrip.split(/\r\n|\r|\n/); //Split on newlines
                for (newline in splitNewline) {
                    if (splitNewline[newline].substring(0, 11) != "Cite error:") {
                        pText += splitNewline[newline];
                        pText += "\n";
                    }
                }
            }
        }
        pText = pText.substring(0, pText.length - 2); //Remove extra newline
        pText = pText.replace(/\[\d+\]/g, ""); //Remove reference tags (e.x. [1], [4], etc)
        document.getElementById('textarea').value = pText
        document.getElementById('div_text').textContent = pText
    }
});
Run Code Online (Sandbox Code Playgroud)


Ami*_*arg 8

此URL将以xml格式返回摘要.

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Agra&MaxHits=1
Run Code Online (Sandbox Code Playgroud)

我创建了一个从维基百科中获取关键字描述的函数.

function getDescription($keyword){
    $url='http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString='.urlencode($keyword).'&MaxHits=1';
    $xml=simplexml_load_file($url);
    return $xml->Result->Description;
}
echo getDescription('agra');
Run Code Online (Sandbox Code Playgroud)


Ruf*_*ock 5

您还可以通过DBPedia获取第一个pagagraph等内容,它可以获取Wikipedia内容并从中创建结构化信息(RDF)并通过API提供.DBPedia API是一个SPARQL(基于RDF),但它输出JSON,它很容易包装.

作为一个例子,这是一个名为WikipediaJS的超级简单JS库,它可以提取结构化内容,包括摘要第一段:http://okfnlabs.org/wikipediajs/

您可以在此博客文章中阅读更多相关信息:http://okfnlabs.org/blog/2012/09/10/wikipediajs-a-javascript-library-for-accessing-wikipedia-article-information.html

JS库代码可以在这里找到:https://github.com/okfn/wikipediajs/blob/master/wikipedia.js