标签: large-data-volumes

MySql:使用复合PK的长列表在许多行上运行

在MySql中处理多行的好方法是什么,因为我在与ODBC连接的客户端应用程序中有很长的键列表?

注意:我的经验主要是SQL Server,所以我知道一点,而不是专门针对MySQL.

任务是从9个表中删除一些行,但我可能有超过5,000个密钥对.

我开始时使用简单的方法循环遍历所有密钥,并针对每个表为每个密钥提交一个语句,例如:

DELETE FROM Table WHERE Key1 = 123 AND Key2 = 567 -- and 8 more tables
DELETE FROM Table WHERE Key1 = 124 AND Key2 = 568 -- and 8 more tables
DELETE FROM Table WHERE Key1 = 125 AND Key2 = 569 -- and 8 more tables
...
Run Code Online (Sandbox Code Playgroud)

除此之外,有45,000个单独的陈述,你可以想象这有点慢.

因此,在不担心我在前端使用的编程语言的情况下,提交列表的好方法是什么,以便我可以一次或至少大批量地加入并执行操作?以下是我的想法:

  • 创建临时表并插入,然后加入.我很乐意查找MySQL的语法来创建临时表,但这是一条很好的路线吗?

  • 假设我使用临时表,那么填充临时表的最佳方法是什么?5000条INSERT Table VALUES ()陈述?SELECT 123, 456 UNION ALL SELECT 124, 457?我刚刚测试过MySql允许这种不针对表发出的SELECT.但是如果列表太长,SQL Server最终会爆炸,所以这在MySQL中是一个好方法吗?我应该一次将列表保持几百个吗?

    --CREATE …
    Run Code Online (Sandbox Code Playgroud)

mysql stored-procedures list large-data-volumes set-based

5
推荐指数
1
解决办法
549
查看次数

RDBMS的实际大小限制

我正在开发一个必须存储非常大的数据集和相关参考数据的项目.我从未遇到过需要这么大的表的项目.我已经证明,至少有一个开发环境不能应对数据库层与复杂查询对应用程序层生成的视图所需的处理(具有多个内部和外部联接的视图,对具有9000万行的表进行分组,求和和求平均值) ).

我测试过的RDBMS是AIX上的DB2.失败的开发环境加载了将在生产中处理的卷的1/20.我确信生产硬件优于dev和staging硬件,但我不相信它会处理大量的数据和查询的复杂性.

在开发环境失败之前,需要花费超过5分钟的时间来返回由大型表格复杂查询(许多连接,大量分组,求和和平均)生成的小数据集(数百行).

我的直觉是数据库架构必须改变,以便视图当前提供的聚合作为非高峰批处理过程的一部分执行.

现在我的问题.声称有这类事情经验的人(我不这样认为)我的担心是没有根据的,我向我保证.是吗?现代RDBMS(SQL Server 2008,Oracle,DB2)能否应对我所描述的数量和复杂性(给定适当数量的硬件),还是我们处于谷歌BigTable等技术领域?

我希望得到那些实际上不得不在非理论层面上使用这种音量的人的答案.

数据的性质是金融交易(日期,金额,地理位置,业务),因此几乎所有数据类型都有代表.所有参考数据都被标准化,因此是多个连接.

sql rdbms large-data-volumes

5
推荐指数
1
解决办法
2628
查看次数

在Google App Engine上创建大型站点地图?

我有一个大约有100,000个独特页面的网站.

(1)如何为所有这些链接创建Sitemap?我应该在大型站点地图协议兼容文件中列出它们吗?

(2)需要在Google App Engine上实现此功能,其中有1000个项目查询限制,并且我的所有单个站点URL都存储为单独的条目.我该如何解决这个问题?

sitemap google-app-engine large-data-volumes

4
推荐指数
1
解决办法
1355
查看次数

Common Lisp:在非常大的列表中使用此过滤器功能的缺点是什么?

我想过滤掉列表'a from list'b中的所有元素并返回过滤后的'b.这是我的功能:

(defun filter (a b)
  "Filters out all items in a from b"
    (if (= 0 (length a)) b
      (filter (remove (first a) a) (remove (first a) b))))
Run Code Online (Sandbox Code Playgroud)

我是lisp的新手,不知道'删除它是怎么回事,这个过滤器运行的时间是什么?

lisp large-data-volumes common-lisp filter

4
推荐指数
1
解决办法
976
查看次数

一次只获取N行(MySQL)

我正在寻找一种方法来从较小的块中获取大表中的所有数据.

请指教.

mysql sql large-data-volumes query-optimization

4
推荐指数
1
解决办法
2419
查看次数

在数据库中存储大量图形数据结构

这个问题询问关于在关系数据库中存储单个图.在这种情况下,解决方案很明确:一个表用于节点,一个表用于边缘.

我有一个随着时间的推移而发展的图形数据结构,所以我想将这个图的"快照"存储在数据库中.我想有数百个这样的快照.

一种解决方案是为每个快照创建一对全新的节点和边对(如上所述).有更好的解决方案吗?

编辑:有人问我想用这个数据库做什么.我相信除了将所有图表从C++ 转储到MySQL然后将它们全部加载回C++数据结构之外,我不会进行任何查询.所以我希望使用MySQL进行存储而不是有效的随机访问/搜索.

mysql database computer-science graph-theory large-data-volumes

4
推荐指数
1
解决办法
1469
查看次数

如何使用Perl计算大型CSV文件中的行数?

我必须在工作的Windows环境中使用Perl,并且我需要能够找出大型csv文件包含的行数(大约1.4Gb).知道如何以最少的资源浪费做到这一点吗?

谢谢

PS这必须在Perl脚本中完成,我们不允许在系统上安装任何新模块.

csv perl large-data-volumes

3
推荐指数
2
解决办法
8010
查看次数

通过ASMX Web服务传输大型数据集的最佳方法是什么?

我继承了一个与Web服务对话的C#.NET应用程序,并且Web服务与Oracle数据库进行通信.我需要向UI添加导出功能,以生成一些数据的Excel电子表格.

我已经创建了一个Web服务函数来运行数据库查询,将数据加载到DataTable然后返回它,这适用于少量行.但是,在完整运行中有足够的数据,客户端应用程序锁定几分钟,然后返回超时错误.显然,这不是检索如此大型数据集的最佳方法.

在我开始并提出一些分裂呼叫的狡猾方式之前,我想知道是否已经存在可以处理这个问题的东西.目前我正在考虑一个startExport函数,然后重复调用next50Rows函数,直到没有数据,但因为Web服务是无状态的,这意味着我将不得不保留某种ID号并处理相关权限.这意味着我不必将整个数据集加载到Web服务器的内存中,这是一件好事.

因此,如果有人知道通过aN ASMX Web服务检索大量数据(以表格格式)的更好方法,请告诉我们!

.net web-services large-data-volumes asmx

3
推荐指数
1
解决办法
5186
查看次数

从大型 mysql 数据库中的另一个表更新列(700 万行)

描述

我有 2 个具有以下结构的表(删除了不相关的列):

mysql> explain parts;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| code        | varchar(32)  | NO   | PRI | NULL    |       |
| slug        | varchar(255) | YES  |     | NULL    |       |
| title       | varchar(64)  | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)

mysql> explain details;
+-------------------+--------------+------+-----+---------+-------+
| Field             | Type         | Null | Key | Default | Extra …
Run Code Online (Sandbox Code Playgroud)

mysql large-data-volumes

3
推荐指数
1
解决办法
4122
查看次数

需要在python中比较1.5GB左右的非常大的文件

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
Run Code Online (Sandbox Code Playgroud)

以上是样本数据.数据根据电子邮件地址排序,文件非常大,约为1.5Gb

我希望在另一个csv文件中输出这样的东西

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH@GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days
Run Code Online (Sandbox Code Playgroud)

即如果第一次进入,我需要追加1如果它发生第二次我需要追加2同样我的意思是我需要计算文件中的电子邮件地址的出现次数,如果电子邮件存在两次或更多我想要区别日期和记住日期之间没有排序所以我们必须针对特定的电子邮件地址对它们进行排序,我正在寻找python中的解决方案,使用numpy或pandas库或任何其他可以处理这种类型的大数据的库而不放弃绑定内存异常我有双核处理器与centos 6.3和4GB的内存

python csv numpy large-data-volumes pandas

3
推荐指数
2
解决办法
2905
查看次数