以下是我正在使用的用例:我enable Streams在DynamoDB使用new and old Image.I创建了一个Kinesis Firehose delivery stream目的地为Redshift(Intermediate s3)时已经配置.
从Dynamodb我的小溪到达Firhose并从那里到下面给出的JSON(S3 Bucket -Gzip)的Bucket.我的问题是我cannot COPY this JSON to redshift.
我无法得到的东西:
JSON加载到S3如下所示:
{
"Keys": {
"vehicle_id": {
"S": "x011"
}
},
"NewImage": {
"heart_beat": {
"N": "0"
},
"cdc_id": {
"N": "456"
},
"latitude": {
"N": "1.30951"
},
"not_deployed_counter": {
"N": "1"
},
"reg_ind": {
"N": "0"
},
"operator": {
"S": …Run Code Online (Sandbox Code Playgroud) amazon-s3 amazon-dynamodb amazon-redshift amazon-dynamodb-streams amazon-kinesis-firehose
我的目标是自动将反馈电子邮件路由到相应的部门.
我的字段是FNUMBER,CATEGORY,SUBCATEGORY,Description.
我有上述6个月的数据格式 - 整个电子邮件Description与CATEGORY和一起存储SUBCATEGORY.
我必须分析DESCRIPTION列并找到Keywordsfor Each Category/subcategory和下一个反馈电子邮件进入时,它应该根据Keyword历史数据生成自动分类到类别和子类别.
我已将一个XML文件导入R - 用于R中的文本分类,然后将XML转换为带有必填字段的数据框.我有一个特定月份的23017记录 - 我只列出了前20个列作为下面的数据帧.
我有超过100个类别和子类别.
我是文本挖掘概念的新手 - 但是在SO和tm包的帮助下 - 我尝试了下面的代码:
step1 <- structure(list(FNUMBER = structure(1:20, .Label = c(" 20131202-0885 ",
"20131202-0886 ", "20131202-0985 ", "20131202-1145 ", "20131202-1227 ",
"20131202-1228 ", "20131202-1235 ", "20131202-1236 ", "20131202-1247 ",
"20131202-1248 ", "20131202-1249 ", "20131222-0157 ", "20131230-0668 ",
"20131230-0706 ", "20131230-0776 ", "20131230-0863 ", …Run Code Online (Sandbox Code Playgroud) 我试图将Text文档分类为多个类别.我的下面的代码工作正常
matrix[[i]] <- create_matrix(trainingdata[[i]][,1], language="english",removeNumbers=FALSE,stemWords=FALSE,weighting=weightTf,minWordLength=3)
container[[i]] <- create_container(matrix[[i]],trainingdata[[i]][,2],trainSize=1:50,testSize=51:100) ,
models[[i]] <- train_models(container[[i]], algorithms=c("MAXENT","SVM"))
results[[i]] = classify_models(container[[i]],models[[i]])
Run Code Online (Sandbox Code Playgroud)
当我尝试下面的代码来获得精度,召回,准确度值:
analytic[[i]] <- create_analytics(container[[i]], results[[i]])
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Error in `row.names<-.data.frame`(`*tmp*`, value = c(NA_real_, NA_real_ :
duplicate 'row.names' are not allowed
Run Code Online (Sandbox Code Playgroud)
我Categories的text格式.如果我转换categories成Numeric- 上面的代码工作正常.
是否有工作来保持text格式的类别,并获得精度,召回,准确值.
我的目标是获得多级分类器的精度,召回率,准确度值和混淆矩阵.是否有任何其他包来获取多类文本分类器的上述值(一个与所有)
precision r text-mining document-classification confusion-matrix
> dc1
V1 V2
1 20140211-0100 |Box
2 20140211-1782 |Office|Ball
3 20140211-1783 |Office
4 20140211-1784 |Office
5 20140221-0756 |Box
6 20140203-0418 |Box
> strsplit(as.character(dc1[,2]),"^\\|")
[[1]]
[1] "" "Box"
[[2]]
[1] "" "Office" "Ball"
[[3]]
[1] "" "Office"
[[4]]
[1] "" "Office"
[[5]]
[1] "" "Box"
[[6]]
[1] "" "Box"
Run Code Online (Sandbox Code Playgroud)
如何从结果中删除空白("")strsplit.结果应如下所示:
Run Code Online (Sandbox Code Playgroud)[[1]] [1] "Box"
[[2]]
[1] "Office" "Ball"
Run Code Online (Sandbox Code Playgroud) 我正在研究R中的通勤旅行模式(起源 - 目的地)流程图.我所拥有的数据是通勤者的日常交易(Date,Card,Entry_lat,Entry_Long,Exit_Lat,Exit_Long).旅行路径可能类似(因为他们通勤上班).
我需要绘制这个map (great circles).如果原点和目的地相同 - 连接线的不间断应指示相同的原点 - 目的地.
structure(list(business_date = structure(c(17245, 17245, 17245,
17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245,
17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245,
17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245,
17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245,
17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245,
17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245, 17245,
17245, 17245, 17245, 17245, 17245, 17245, 17245, …Run Code Online (Sandbox Code Playgroud) 我正在创建一个DocumentTermMatrix使用create_matrix()来RTextTools创建container并model基于它.它适用于极大的数据集.
我为每个类别(因子级别)执行此操作.因此,对于每个类别,它必须运行矩阵,容器和模型.当我运行下面的代码(例如16核/ 64 GB)时 - 它只在一个核心中运行,并且使用的内存小于10%.
有没有办法加快这个过程?也许用doparallel&foreach?任何信息肯定会有所帮助.
#import the required libraries
library("RTextTools")
library("hash")
library(tm)
for ( n in 1:length(folderaddress)){
#Initialize the variables
traindata = list()
matrix = list()
container = list()
models = list()
trainingdata = list()
results = list()
classifiermodeldiv = 0.80`
#Create the directory to place the models and the output files
pradd = paste(combinedmodelsaveaddress[n],"SelftestClassifierModels",sep="")
if (!file.exists(pradd)){
dir.create(file.path(pradd))
}
Data$CATEGORY <- as.factor(Data$CATEGORY)
#Read the …Run Code Online (Sandbox Code Playgroud) 我有两个字符形式的时间戳,我想在R中转换为POSIX格式.这些timestamps是:
1: "2013-03-30 17:45:00"
2: "2013-03-31 02:05:00"
Run Code Online (Sandbox Code Playgroud)
第一个转换为精细,第二个转换为精细NA.该timestamps下载为characters从SQL server.任何人都有任何想法出了什么问题?
我没有附加屏幕截图的声誉,因此我的R控制台的屏幕截图显示了结果:http://emillarsen.com/r%20console.jpg
我有一个数据集如下
> head(resultsclassifiedfinal_MC_TC_P1)
FEEDBACK_NUMBER Biz_Div_Num ACCURACY Category_Num CLASSIFIED_BY ACTIVE_IND CRT_BY_USR_NUM
1 20140211-1173 556 99.48% 2303 CMC 1 SYSTEM
2 20140211-1886 556 99.6% 2232 CMC 1 SYSTEM
3 20140209-0115 556 66.09% 2232 CMC 1 SYSTEM
4 20140202-0337 556 93.7% 2232 CMC 1 SYSTEM
5 20140203-0418 552 50% 2232 CMC 1 SYSTEM
6 20140303-1339 552 54.45% 2232 CMC 1 SYSTEM
Run Code Online (Sandbox Code Playgroud)
我能够将这些记录插入Oracle DB中已存在的表中
> library(RODBC)
> channel <- odbcConnect("R", uid="xxx", pwd="xxx@123")
> sqlSave(channel,resultsclassifiedfinal_MC_TC_P1, tablename="table1", rownames=FALSE, append=TRUE,fast = FALSE,nastring = NULL)
> odbcClose(channel)
Run Code Online (Sandbox Code Playgroud)
要 …
我正在尝试绘制流程图(适用于新加坡).我有Entry(Lat,Long)和Exit(Lat,long).我试图在新加坡地图中映射从入口到出口的流量.
structure(list(token_id = c(1.12374e+19, 1.12374e+19, 1.81313e+19,
1.85075e+19, 1.30752e+19, 1.30752e+19, 1.32828e+19, 1.70088e+19,
1.70088e+19, 1.70088e+19, 1.05536e+19, 1.44818e+19, 1.44736e+19,
1.44736e+19, 1.44736e+19, 1.44736e+19, 1.89909e+19, 1.15795e+19,
1.15795e+19, 1.15795e+19, 1.70234e+19, 1.70234e+19, 1.44062e+19,
1.21512e+19, 1.21512e+19, 1.95909e+19, 1.95909e+19, 1.50179e+19,
1.50179e+19, 1.24174e+19, 1.36445e+19, 1.98549e+19, 1.92068e+19,
1.18468e+19, 1.18468e+19, 1.92409e+19, 1.92409e+19, 1.21387e+19,
1.9162e+19, 1.9162e+19, 1.40385e+19, 1.40385e+19, 1.32996e+19,
1.32996e+19, 1.69103e+19, 1.69103e+19, 1.57387e+19, 1.40552e+19,
1.40552e+19, 1.00302e+19), Entry_Station_Lat = c(1.31509, 1.33261,
1.28425, 1.31812, 1.33858, 1.29287, 1.39692, 1.37773, 1.33858,
1.33322, 1.28179, 1.30036, 1.43697, 1.39752, 1.27637, 1.39752,
1.41747, 1.35733, 1.28405, 1.37773, 1.35898, 1.42948, 1.32774,
1.42948, 1.349, …Run Code Online (Sandbox Code Playgroud) Formattable 有一些简单的格式化表格选项,例如:
library(shiny)
library(DT)
library(formattable)
df <- formattable(iris, lapply(1:4, function(col){
area(col = col) ~ color_tile("red", "green")
Run Code Online (Sandbox Code Playgroud)
这以后可以coverted到DT数据表
df <- as.datatable(df)
Run Code Online (Sandbox Code Playgroud)
对我来说,在RStudion的Viewer中查看是完美的.但是,我想以某种方式将其部署为Shiny应用程序.完整代码:
library(DT)
library(shiny)
ui <- fluidPage(
DT::dataTableOutput("table1"))
server <- function(input, output){
df <- formattable(iris, lapply(1:4, function(col){
area(col = col) ~ color_tile("red", "green")
}))
df <- as.datatable(df)
output$table1 <- DT::renderDataTable(DT::datatable(df))
}
shinyApp(ui, server)
Run Code Online (Sandbox Code Playgroud)
这不起作用,有什么工作吗?我喜欢条件格式formattable,但也想使用一些DT提供的选项,例如过滤,搜索,colvis等.
要将它部署为formattable有一个线程:
我有一个Rscript文件(Main_Script.R)Main_Script.R,它每30分钟在Windows任务计划程序中作为一个重要的工作运行.在- 我有大约13个脚本,每30分钟运行一次.
我希望从R发送电子邮件 - 每当迭代失败或被骚扰时.我正在使用sendMailR包 - 我在SO中看过一篇文章how to send email with attachment from R in windows- 关于如何从R Windows发送emqil.
但我不确定 - 如何发送email automatically with the error message- 当计划任务迭代失败或受到攻击时.
我的Main_Script.R- 有source13个代码.
source(paste(rootAddress,"Scripts/Part1.R",sep =''))
source(paste(rootAddress,"Scripts/Part2.R",sep =''))
:
:
:
:
source(paste(rootAddress,"Scripts/Part13.R",sep =''))
Run Code Online (Sandbox Code Playgroud)
我的Sheduled任务看起来像下面的日志文件
"D:\xxx\R-3.0.2\bin\x64\Rscript.exe" "D:\xx\Batch_Processing\Batch_Processing_Run\Scripts\Main_Test.R" >> "D:\XXX\Batch_Processing\Batch_Processing_Run\error.txt" 2>&1
Run Code Online (Sandbox Code Playgroud)
更新:
当脚本遇到错误时 - 它应该触发电子邮件 - 使用erorr meassge和脚本名称或编号 - 来表示13个脚本中的哪一个失败并发送到邮件ID.
我正在尝试写一个新的专栏文章duration_probablity,该文章将值的概率降到6到12小时之间。P(6 < Origin_Duration ? 12)
dput(df)
structure(list(CRD_NUM = c(1000120005478330, 1000130009109199,
1000140001635234, 1000140002374747, 1000140003618308, 1000140007236959,
1000140015078086, 1000140026268650, 1000140027281272, 1000148000012215
), Origin_Duration = c("10:48:38", "07:41:34", "11:16:41", "09:19:35",
"17:09:19", "08:59:05", "11:27:28", "12:17:41", "10:45:42", "12:19:05"
)), .Names = c("CRD_NUM", "Origin_Duration"), class = c("data.table",
"data.frame"), row.names = c(NA, -10L))
CRD_NUM Origin_Duration
1: 1000120005478330 10:48:38
2: 1000130009109199 07:41:34
3: 1000140001635234 11:16:41
4: 1000140002374747 09:19:35
5: 1000140003618308 17:09:19
6: 1000140007236959 08:59:05
7: 1000140015078086 11:27:28
8: 1000140026268650 12:17:41
9: 1000140027281272 10:45:42
10: 1000148000012215 12:19:05
Run Code Online (Sandbox Code Playgroud)
我不确定如何在R中执行此操作。我正在尝试获取标准正态分布的累积分布函数。通勤者在某些车站停留的时间介于6到12小时之间的概率。例如,持续时间11:16:41的输出为0.96 …
我有一个Data如下数据集:
dput(Data)
structure(list(FN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "20131202-0985 ", class = "factor"), Values = structure(c(1L,
8L, 7L, 6L, 5L, 9L, 2L, 4L, 3L), .Label = c("|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567",
"|8121|B01|SOMERSET STN", "|96942883", "|SN30|SMRT\n", "CENTRAL",
"FOUR SEASONS HOTEL", "HOTEL", "IKEA", "nanyang avenue"), class = "factor"),
IND = structure(c(4L, 1L, 1L, 1L, 1L, 6L, 3L, 2L, 5L), .Label = c("BN",
"BR", "BS", "LOC", "PN", "RN"), class = "factor")), .Names = c("FN",
"Values", "IND"), class …Run Code Online (Sandbox Code Playgroud) r ×12
ggplot2 ×2
gis ×2
text-mining ×2
amazon-s3 ×1
doparallel ×1
dt ×1
email ×1
foreach ×1
formattable ×1
google-maps ×1
great-circle ×1
leaflet ×1
oracle ×1
precision ×1
r-leaflet ×1
reshape ×1
rodbc ×1
sendmailr ×1
shiny ×1
text ×1
timestamp ×1
tm ×1