我有一个带有日期的刮字符向量.我的问题:使用时as.Date(),每个日期都包含月份名称"März"(=德语中的"march")NA.这是为什么?
这是一个(希望可重复的)示例:
require(RCurl)
require(XML)
doc <- htmlParse(getURL("http://www.amazon.de/product-reviews/3836218984/?ie=UTF8&pageNumber=5&showViewpoints=0&sortBy=byRankDescending"), 
                 encoding="UTF-8")
(dates <- xpathSApply(doc, "//div/span[2]/nobr", xmlValue))
# [1] "12. Februar 2009"   "12. November 2006"  "19. März 2010"     
# [4] "30. Juni 2007"      "7. März 2006"       "19. März 2007"     
# [7] "22. Januar 2006"    "24. September 2005" "15. Februar 2012"  
# [10] "28. März 2007" 
Sys.setlocale("LC_TIME", "German") # on Windows, see ?Sys.setlocale
as.Date(dates,  "%d. %B %Y")
# [1] "2009-02-12" "2006-11-12" NA           "2007-06-30" NA          
# [6] NA           "2006-01-22" "2005-09-24" "2012-02-15" NA 
关于下一步尝试的任何想法?
请注意,如果我在dputed和复制/粘贴的字符向量上应用相同的内容,一切都很好:
dates <- c("12. Februar 2009", "12. November 2006", "19. März 2010", "30. Juni 2007", 
           "7. März 2006", "19. März 2007", "22. Januar 2006", "24. September 2005", 
           "15. Februar 2012", "28. März 2007")
as.Date(dates,  "%d. %B %Y")
# [1] "2009-02-12" "2006-11-12" "2010-03-19" "2007-06-30"
# [5] "2006-03-07" "2007-03-19" "2006-01-22" "2005-09-24"
# [9] "2012-02-15" "2007-03-28"
为了完整性我的会话信息:
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
loaded via a namespace (and not attached):
[1] tools_3.0.2
我可以在Windows 7 x64上重现这一点.R和Windows如何与字符编码进行交互有很多问题,我不会假装理解它们.在您的情况下,只需latin1在转换为a之前转换为编码即可Date.
as.Date(iconv(dates,from='UTF-8',to='latin1'),'%d. %B %Y')
#  [1] "2009-02-12" "2006-11-12" "2010-03-19" "2007-06-30" "2006-03-07" "2007-03-19"
#  [7] "2006-01-22" "2005-09-24" "2012-02-15" "2007-03-28"
可能有一种方法可以as.Date识别Windows中的不同编码,但我不知道.