For*_*vin 5 javascript regex facebook app-secret cheerio
提前澄清一下,我没有 Facebook 帐户,也无意创建一个帐户。另外,我想要实现的目标在我的国家和美国是完全合法的。
我不想使用 Facebook API 获取 Facebook 页面的最新时间线帖子,而是直接向页面 URL(例如this page)发送 get 请求并从 HTML 源代码中提取帖子。
(我想获取帖子的文字和创建时间。)
当我在 Web 控制台中运行此命令时:
document.getElementsByClassName('userContent')
Run Code Online (Sandbox Code Playgroud)
我得到一个包含最新帖子文本的元素列表。
但我想从 Node.js 脚本中提取该信息。我可能可以使用无头浏览器等轻松完成此操作puppeteer,但这会产生大量不必要的开销。我真的很想要一种简单的方法,比如下载 HTML 代码,将其传递给 Cheerio 并使用 Cheerio 的类似 jQuery 的 API 来提取帖子。
这是我的尝试:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent');
console.log(timeLinePostEls.html()); // should NOT be null
const newestPostEl = timeLinePostEls.get(0);
console.log(newestPostEl.html()); // should NOT be null
const newestPostText = newestPostEl.text();
console.log(newestPostText);
//const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
//console.log(newestPostTime);
}).catch(console.error);
Run Code Online (Sandbox Code Playgroud)
不幸的$('.userContent')是不起作用。不过,我能够验证我正在查找的数据是否嵌入到该 HTML 代码中的某个位置。
但我真的无法想出一个好的正则表达式方法或类似的方法来提取该数据。
根据帖子内容,帖子中的 HTML 标签数量差异很大。
以下是包含一个链接的帖子的简单示例:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}"><p>We're proud to be named one of Built In NYC's Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fbit.ly%2F2H3Kbr2&h=AT29h2HyDsEk0rHRWqJA-Fa4M1qi3nJT1NBi95othaR3qeFuFAMNiVS2Dgtv5KR5m0xqjw6kfwZdhZt0_D3UQT1Oel2UhxRql-KwkA1xqWvrql4u1jDhzrkGVT_XxoUd8_w8_fzLZzzhz23a8yPCK6IPfWKB76_CEFjG3b78y4dFJvY9Z08AYlR01dmi5_FvWVEVytkN-123u6alYE8pqL6Jb6dtIQUTWGXYJPaNMrtxkCUZniEVXEcILkwHGSuHqCTAarboyMP55F1vhYO3OAiVMkvjbN274fVq92YvbK3bi90bU9T-5ADWHDUJ-CwcofSBTW47chstQeY0n_UluD_rBIPLsfXVSnCtpRkR2kXi9zzHLnNeIYeNssv3i7UKS_f5Z2pnVT6xe3zJbNpB68doH1Z__I9nsTCNIyFyKx2VxabecoL03DIawbRrzBoxLAwzNPLACBjTkpEQhdVn4_wdAIjXRL4cLQDcZkLEoG_sspBgRePH23TFbNufQOBly-FNtLHnkUDO2Ca-FYvAGXpcu6J4B1aH3XFPB803lsz-GRdACyOFOgXDXJfwr4WtWzUHxfiOPULWiI43yI5L4aU6wYRhPjxua3RuRZ8oj9fXa1w4Jrht94Ue2wfKtz8" target="_blank" data-ft="{"tn":"-U"}" rel="noopener nofollow" data-lynx-mode="async">http://*******/2H3Kbr2</a></p></div>
Run Code Online (Sandbox Code Playgroud)
以更易读的形式格式化,它看起来有点像这样:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}">
<p>
We're proud to be named one of Built In NYC's Best Places to Work in
2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for
Best Perks and Benefits. See what it took to make the list and check out our
profile to see some of our job openings.
<a href="VERY_LONG_URL.........." target="_blank" data-ft="{"tn":"-U"}" rel="noopener nofollow" data-lynx-mode="async">SHORT_LINK.....</a>
</p>
</div>
Run Code Online (Sandbox Code Playgroud)
这个正则表达式似乎工作正常,但我认为它不是很可靠:
/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g
Run Code Online (Sandbox Code Playgroud)
例如,如果帖子包含另一个 div 元素,那么它将无法正常工作。除此之外,我无法知道使用这种方法创建帖子的时间/日期?
有什么想法可以相对可靠地提取最近 2-3 个帖子(包括创建日期/时间)吗?
好吧,我终于想通了。我希望这对其他人有用。该函数将提取 20 个最新帖子,包括创建时间:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
function GetFbPosts(pageUrl) {
const requestOptions = {
url: pageUrl,
headers: {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'
}
};
return rp.get(requestOptions).then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post=>{
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
});
}
GetFbPosts('https://www.facebook.com/pg/officialstackoverflow/posts/').then(posts=>{
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});
Run Code Online (Sandbox Code Playgroud)
由于 Facebook 消息可能具有复杂的格式,因此该消息不是纯文本,而是 HTML。message: post.html()但是您可以删除格式并通过替换为 来获取文本message: post.text()。
编辑: 如果你想获取超过最新 20 条帖子,那就更复杂了。前 20 个帖子在初始 html 页面上静态提供。以下所有帖子均通过 ajax 以 8 个帖子为一组进行检索。可以这样实现:
// make sure your node.js version supports async/await (v10 and above should be fine)
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
class FbScrape {
constructor(options={}) {
this.headers = options.headers || {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' // you may have to update this at some point
};
}
async getPosts(pageUrl, limit=20) {
const staticPostsHtml = await rp.get({ url: pageUrl, headers: this.headers });
if (limit <= 20) {
return this._parsePostsHtml(staticPostsHtml);
} else {
let staticPosts = this._parsePostsHtml(staticPostsHtml);
const nextResultsUrl = this._getNextPageAjaxUrl(staticPostsHtml);
const ajaxPosts = await this._getAjaxPosts(nextResultsUrl, limit-20);
return staticPosts.concat(ajaxPosts);
}
}
_parsePostsHtml(postsHtml) {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post => {
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
}
async _getAjaxPosts(resultsUrl, limit=8, posts=[]) {
const responseBody = await rp.get({ url: resultsUrl, headers: this.headers });
const extractedJson = JSON.parse(responseBody.substr(9));
const postsHtml = extractedJson.domops[0][3].__html;
const newPosts = this._parsePostsHtml(postsHtml);
const allPosts = posts.concat(newPosts);
const nextResultsUrl = this._getNextPageAjaxUrl(postsHtml);
if (allPosts.length+1 >= limit)
return allPosts;
else
return await this._getAjaxPosts(nextResultsUrl, limit, allPosts);
}
_getNextPageAjaxUrl(html) {
return 'https://www.facebook.com' + /"(\/pages_reaction_units\/more[^"]+)"/g.exec(html)[1].replace(/&/g, '&') + '&__a=1';
}
}
const fbScrape = new FbScrape();
const minimum = 28; // minimum number of posts to request (gets rounded up to 20, 28, 36, 44, 52, 60, 68 etc... because of page sizes (page1=20; all_following_pages=8)
fbScrape.getPosts('https://www.facebook.com/pg/officialstackoverflow/posts/', minimum).then(posts => { // get at least the 28 latest posts
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4110 次 |
| 最近记录: |