以下,当直接复制并粘贴到R中时可以正常工作:
> character_test <- function() print("R??????GNU S????????????????????????????????????...")
> character_test()
[1] "R??????GNU S??????????????,???????,?????????????..."
Run Code Online (Sandbox Code Playgroud)
但是,如果我创建一个名为character_test.R的文件,其中包含EXACT SAME代码,请将其保存为UTF-8编码(以便保留特殊的中文字符),然后当我在R中使用source()时,我收到以下错误:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Run Code Online (Sandbox Code Playgroud)
您可以提供任何帮助以解决并帮助我理解这里发生的事情,我将不胜感激.
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 …Run Code Online (Sandbox Code Playgroud) 我可以在github上获取一个R脚本(使用'raw'文本链接),如下所示:
# load package
require(RCurl)
# check 1
ls()
#character(0)
# read script lines from website
u <- "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R"
script <- getURL(u, ssl.verifypeer = FALSE)
eval(parse(text = script))
# clean-up
rm("script", "u")
# check 2
ls()
#[1] "bingSearchXScraper"
Run Code Online (Sandbox Code Playgroud)
但是,我真正想要做的是将其包装在一个函数中.这是我遇到问题的地方,我怀疑它与脚本的功能有关,它只在本地存在于它所调用的函数中.例如,这是我的目标:
source_github <- function(u) {
# load package
require(RCurl)
# read script lines from website and evaluate
script <- getURL(u, ssl.verifypeer = FALSE)
eval(parse(text = script))
}
source_github("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")
Run Code Online (Sandbox Code Playgroud)
非常感谢你的时间.
目标
我想使用R下载Google搜索网页的HTML,如网络浏览器所示.
问题
当我在R中下载Google搜索网页HTML时,使用网络浏览器中完全相同的网址,我注意到R下载的HTML与网络浏览器HTML不同,例如对于高级Google搜索网址,日期参数会被忽略.由R读入的HTML,而在Web浏览器中则保留.
例
我在网络浏览器中为"West End Theatre"进行Google搜索,并指定日期范围为2012年1月1日至1月31日.然后我复制生成的URL并将其粘贴到R.
# Google Search URL from Firefox web browser
url <- "http://www.google.co.uk/search?q=west+end+theatre&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a#q=west+end+theatre&hl=en&client=firefox-a&hs=z7I&rls=org.mozilla:en-GB%3Aofficial&prmd=imvns&sa=X&ei=rJE7T8fwM82WhQe_6eD2CQ&ved=0CGoQpwUoBw&source=lnt&tbs=cdr:1%2Ccd_min%3A1%2F1%2F2012%2Ccd_max%3A31%2F1%2F2012&tbm=&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=6f92152f78004c6d&biw=1600&bih=810"
u <- URLdecode(url)
# Webpage as seen in browser
browseURL(u)
# Webpage as seen from R
HTML <- paste(readLines(u), collapse = "\n")
cat(HTML, file = "output01.html")
shell.exec("output01.html")
# Webpage as seen from R through RCurl
library(RCurl)
cookie = 'cookiefile.txt'
curl = getCurlHandle(cookiefile = cookie,
useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
header = FALSE, …Run Code Online (Sandbox Code Playgroud) 假设我有几年的数据,如下所示
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 …Run Code Online (Sandbox Code Playgroud) (道歉,我不确定这篇文章的最佳标题是什么,随意编辑).
可以说我在单词和它们的类型(即字典)之间有以下关系结构:
dictionary <- data.frame(level1=c(rep("Positive", 3), rep("Negative", 3)), level2 = c("happy", "fantastic", "great", "sad", "rubbish", "awful"))
# level1 level2
# 1 Positive happy
# 2 Positive fantastic
# 3 Positive great
# 4 Negative sad
# 5 Negative rubbish
# 6 Negative awful
Run Code Online (Sandbox Code Playgroud)
我们已经计算了七个文件(即术语 - 文档矩阵)的出现次数:
set.seed(42)
range = 0:3
df <- data.frame(row.names = c("happy", "fantastic", "great", "sad", "rubbish", "awful"), doc1 = sample(x=range, size=6, replace=TRUE), doc2 = sample(x=range, size=6, replace=TRUE), doc3 = sample(x=range, size=6, replace=TRUE), doc4 = sample(x=range, size=6, replace=TRUE), …Run Code Online (Sandbox Code Playgroud) 用另一个相等或更大长度的字符串替换固定位置子字符串的有效方法是什么?
例如,下面通过首先找到"abc"的位置然后替换它来替换子字符串"abc":
sub("abc", "123", "iabc.def", fixed = TRUE)
#[1] "i123.def"
sub("abc", "1234", "iabc.def", fixed = TRUE)
#[1] "i1234.def"
Run Code Online (Sandbox Code Playgroud)
但是,我们知道子字符串"abc"始终位于字符位置2,3和4. 在这种情况下,是否有一种指定这些位置的方法,以便不需要执行字符串匹配并使用字符索引代替?
我确实尝试使用substr()但是当替换字符串大于被替换的子字符串时,它没有像我希望的那样工作:
x <- "iabc.def"
substr(x, 2, 4) <- "123"
#[1] "i123.def"
x <- "iabc.def"
substr(x, 2, 4) <- "1234"
#[1] "i123.def"
Run Code Online (Sandbox Code Playgroud)
非常感谢你的时间,
Tony Breyal
PS以上可能是做我想要的最有效的方式,但我想我会问以防万一有更好的方法:)
=====时间=====
# test elapsed relative
# 7 francois.fx_wb(x, replacement) 0.94 1.000000
# 1 f(x) 1.56 1.659574
# 6 francois.fx(x, replacement) 2.23 2.372340
# 5 Sobala(x) 3.89 4.138298
# 2 Hong.Ooi(x) 4.41 4.691489
# 3 …Run Code Online (Sandbox Code Playgroud) 编辑:更新谢谢@daroczig下面的可爱答案.然而,测试2仍然感觉它需要比测试1更长的时间,这是我想知道的.
更新:在二读时,@ daroczig的答案确实解释了我的困惑 - 问题是由于我没有正确地考虑system.time(expr)代码行.
我想制作一个system.time函数的版本,在理解运行到运行时间波动方面对我自己来说会提供更多信息:
system.time.summary <- function(N, expr) {
t.mat <- replicate(N, system.time(expr))
as.data.frame(apply(t.mat[1:3,], 1, summary))
}
Run Code Online (Sandbox Code Playgroud)
然而问题是,在下面的自包含代码中,test.2 感觉它需要花费更长的时间test.1(并且我已经多次运行它们来检查),即使代码几乎完全相同(test.1使用包装函数,而test.2只是原始代码)
# set up number of runs
N <- 100
# test 1
system.time.summary(N, (1:1e8)^2 + 1)
user.self sys.self elapsed
Min. 0.000 0.000 0.000
1st Qu. 0.000 0.000 0.000
Median 0.000 0.000 0.000
Mean 0.058 0.031 0.089
3rd Qu. 0.000 0.000 0.000
Max. 0.580 0.310 0.890
# test 2
t.mat …Run Code Online (Sandbox Code Playgroud) 有没有办法告诉R或RCurl包,如果它超过指定的时间段并且转到下一行代码,就放弃尝试下载网页?例如:
> library(RCurl)
> u = "http://photos.prnewswire.com/prnh/20110713/NY34814-b"
> getURL(u, followLocation = TRUE)
> print("next line") # programme does not get this far
Run Code Online (Sandbox Code Playgroud)
这将挂在我的系统上,而不是进入最后一行.
编辑:基于@Richie Cotton的答案,虽然我可以"实现我想要的",但我不明白为什么需要比预期更长的时间.例如,如果我执行以下操作,系统会挂起,直到我在RGUI中选择/取消选择"Misc >> Buffered Output"选项:
> system.time(getURL(u, followLocation = TRUE, .opts = list(timeout = 1)))
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
Operation timed out after 1000 milliseconds with 0 out of 0 bytes received
Timing stopped at: 0.02 0.08 ***6.76***
Run Code Online (Sandbox Code Playgroud)
解决方案:根据@Duncan的帖子,然后看看卷曲文档,我通过使用maxredirs选项找到了解决方案,如下所示:
> getURL(u, followLocation = TRUE, .opts = list(timeout = 1, …Run Code Online (Sandbox Code Playgroud) 问题:有没有办法避免在进行OAuth握手时手动输入PIN?
上下文:在进行ROAuth握手时,我被要求输入我通过以下链接获得的PIN:
rm(list=ls())
library("twitteR")
library("ROAuth")
Credentials <- OAuthFactory$new(
consumerKey = "...",
consumerSecret = "...",
oauthKey = "...",
oauthSecret = "...",
requestURL = "https://api.twitter.com/oauth/request_token",
authURL = "https://api.twitter.com/oauth/authorize",
accessURL = "https://api.twitter.com/oauth/access_token")
Credentials$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
Run Code Online (Sandbox Code Playgroud)
哪个输出:
凭证$ handshake(cainfo = system.file("CurlSSL","cacert.pem",package ="RCurl"))要启用连接,请将您的Web浏览器指向:https://api.twitter.com/oauth / authorize?oauth_token = ...完成后,记录提供给您的PIN并在此处提供:
我输入了一个PIN码.
我想避免这一步,因为每次我在新的R会话中运行脚本时,我都必须手动打开浏览器来检索PIN.我是唯一使用此脚本的人.
可以说我有以下内容:
library(XML)
my.xml <- '
<tv>
<show>
<name>Star Trek TNG</name>
<rating>1.0</rating>
</show>
<show>
<name>Doctor Who</name>
</show>
<show>
<name>Babylon 5</name>
<rating>2.0</rating>
</show>
</tv>
'
doc <- xmlParse(my.xml)
xpathSApply(doc, "/tv/show/rating", xmlValue)
# [1] "1.0" "2.0"
Run Code Online (Sandbox Code Playgroud)
有三个'show'节点.我怎样才能使输出为:
# [1] "1.0" NULL "2.0"
Run Code Online (Sandbox Code Playgroud)
为了说明没有在xml中评分但是长度仍为3的Doctor Who?
这是我已经问过的一个非常类似的问题的跟进,但这次我试图获取xmlAttrs而不是xmlValue.所以我们假设我们有以下内容:
my.xml <- '
<tv>
<show>
<name>Star Trek TNG</name>
<rating>1.0</rating>
<a href="http://www.google.com">google</a>
</show>
<show>
<name>Doctor Who</name>
<a href="http://www.google.com">google</a>
</show>
<show>
<name>Babylon 5</name>
<rating>2.0</rating>
</show>
</tv>
'
library(XML)
doc <- xmlParse(my.xml)
xpathSApply(doc, '/tv/show', function(x) xmlValue(xmlChildren(x)$a))
# [1] "google" "google" NA
Run Code Online (Sandbox Code Playgroud)
我宁愿输出
# [1] "http://www.google.com" "http://www.google.com" NA
Run Code Online (Sandbox Code Playgroud)
但是我无法弄明白.我以为它可能是这样的,但我错了:
xpathSApply(doc, '/tv/show', function(x) xmlAttrs(xmlChildren(x)$a))
# Error in UseMethod("xmlAttrs", node) :
# no applicable method for 'xmlAttrs' applied to an object of class "NULL"
Run Code Online (Sandbox Code Playgroud)
我得到的最接近的是:
xpathSApply(doc, '/tv/show', function(x) xmlChildren(x)$a)
# [[1]]
# <a …Run Code Online (Sandbox Code Playgroud) 可以说我有以下字符串向量:
x <- c("this!", "is!", "not my name[!]!", "Understrand[!] Mate!",
"Because!I[!] said so!")
Run Code Online (Sandbox Code Playgroud)
我需要一种替换惊叹号的方法"!" 使用"!\n"但仅当感叹号未被方括号括起时.所以输出看起来像这样:
"this!\n"
"is!\n"
"not my name[!]!\n"
"Understrand[!] Mate!\n"
"Because!\nI[!] said so!\n"
Run Code Online (Sandbox Code Playgroud)
我一直在玩,只是想不出来.
非常感谢您的帮助.
托尼B.