小编Tak*_*aka的帖子

在Apache Camel中处理大型CSV文件的最佳策略

我想开发一个轮询包含CSV文件的目录的路由,并且对于每个文件,它使用Bindy解组每一行并在activemq中对其进行排队.

问题是文件可能非常大(一百万行)所以我更喜欢一次排队一行,但我得到的是Bindy末尾的java.util.ArrayList中的所有行记忆问题.

到目前为止,我有一点测试和解组工作,所以使用注释的Bindy配置是可以的.

这是路线:

from("file://data/inbox?noop=true&maxMessagesPerPoll=1&delay=5000")
  .unmarshal()
  .bindy(BindyType.Csv, "com.ess.myapp.core")           
  .to("jms:rawTraffic");

Run Code Online (Sandbox Code Playgroud)

环境是:Eclipse Indigo,Maven 3.0.3,Camel 2.8.0

谢谢

apache-camel

Tak*_*aka

lucky-day

14
推荐指数

1
解决办法

1万
查看次数

在 Google Cloud Source Repositories 中将 Bitbucket/github 配置为外部存储库时出错：“无法连接存储库”

我一直在尝试将我在 Bitbucket (BB) 上创建的存储库配置为 GCP 上的外部存储库，但没有成功。

我在 BB 上获得了一个新的 repo，向我用于 GCP 的同一用户授予了管理员权限，并使用相同的浏览器在两个站点上登录。

在 GCP，“连接外部存储库”，我按照 [1] 中的步骤，我可以看到我的 BB 帐户和 GCP 上的存储库。我选择了正确的一个，作为最后一步，我按下“连接选定的存储库”并获得“连接存储库...”一段时间，但最终“连接存储库失败”。没有更多信息

我读了 [2] 但仍然没有运气。

我怀疑我可能会遗漏一些必须在 BB 方面完成的事情。

有什么帮助吗？非常感谢

[1] https://cloud.google.com/source-repositories/docs/mirroring-a-bitbucket-repository

[2]使用 Cloud Build 和 Source Repositories 连接到 BitBucket 的问题

bitbucket google-cloud-platform

Tak*_*aka

2019 09-03

6
推荐指数

2
解决办法

1762
查看次数

GridSearchCV.best_score_意味着当得分设置为'准确度'和CV时

我正在尝试找到应用于众所周知的威斯康星癌症数据集(569个样本,31个特征+目标)的乳腺癌样本分类的最佳模型神经网络模型.我正在使用sklearn 0.18.1.到目前为止我还没有使用Normalization.当我解决这个问题时,我会添加它.

# some init code omitted
X_train, X_test, y_train, y_test = train_test_split(X, y)

Run Code Online (Sandbox Code Playgroud)

为GridSearchCV定义params NN params

tuned_params = [{'solver': ['sgd'], 'learning_rate': ['constant'], "learning_rate_init" : [0.001, 0.01, 0.05, 0.1]},
                {"learning_rate_init" : [0.001, 0.01, 0.05, 0.1]}]

Run Code Online (Sandbox Code Playgroud)

CV方法和模型

cv_method = KFold(n_splits=4, shuffle=True)
model = MLPClassifier()

Run Code Online (Sandbox Code Playgroud)

应用网格

grid = GridSearchCV(estimator=model, param_grid=tuned_params, cv=cv_method, scoring='accuracy')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)

Run Code Online (Sandbox Code Playgroud)

如果我跑:

print(grid.best_score_)
print(accuracy_score(y_test, y_pred))

Run Code Online (Sandbox Code Playgroud)

结果为0.746478873239和0.902097902098

根据文档"best_score_:float,best_estimator得分左侧数据".我认为在运行8种不同配置的那些中获得的最佳准确度是在tuned_params中指定的次数,由KFold指定的次数,在左边的数据中由KFold指定.我对吗？

还有一个问题.有没有一种方法可以找到在train_test_split中使用的最佳测试数据大小,默认为0.25？

非常感谢

参考

python pandas scikit-learn cross-validation grid-search

Tak*_*aka

2018 12-29

1
推荐指数

1
解决办法

6960
查看次数

标签统计

apache-camel ×1

bitbucket ×1

cross-validation ×1

google-cloud-platform ×1

grid-search ×1

pandas ×1

python ×1

scikit-learn ×1

在Apache Camel中处理大型CSV文件的最佳策略

在 Google Cloud Source Repositories 中将 Bitbucket/github 配置为外部存储库时出错：“无法连接存储库”

GridSearchCV.best_score_意味着当得分设置为'准确度'和CV时

标签 统计

小编Tak_aka的帖子

标签统计