我正在尝试将一些 Oracle DB 表迁移到云(Snowflake),我想知道从表创建 .csv 文件的最佳方法是什么。
我有大约 200 个表,有些表超过 30M 记录。我想要批量数据
所以我的场景是快速获取 300GB oracle db 的 CSV 导出并将它们存储在 S3 中进行 Spark/Hive 分析,spool 非常慢,SQL 开发人员非常慢。好吧,接下来怎么办?
\n\nhttps://github.com/hyee/OpenCSV
\n\n超级快,这里是如何使用的示例,您需要为 Oracle db 注册 odbc jar:
\n\npackage com.company;\n\nimport com.opencsv.CSVWriter;\nimport com.opencsv.ResultSetHelperService;\n\nimport java.sql.*;\n\npublic class Main {\n\n public static void main(String[] args) throws Exception {\n\n // write your code here\n //step1 load the driver class\n Class.forName("oracle.jdbc.driver.OracleDriver");\n\n//step2 create the connection object\n Connection con= DriverManager.getConnection(\n "jdbc:oracle:thin:@host:port:service_name",\n "ora_user","password");\n\n//step3 create the statement object\n Statement stmt=con.createStatement();\n\n//step4 execute query\n ResultSet rs=stmt.executeQuery("select c1,c2,c3 from my shitty table");\n// while(rs.next())\n// System.out.println(rs.getInt(1)+" "+rs.getString(2)+" "+rs.getString(3));\n\n//step5 close the connection object\n\n\n String fileName = "C:\\\\Temp\\\\output.csv";\n boolean async = true;\n\n try (CSVWriter writer = new CSVWriter(fileName)) {\n\n //Define fetch size(default as 30000 rows), higher to be faster performance but takes more memory\n ResultSetHelperService.RESULT_FETCH_SIZE=50000;\n //Define MAX extract rows, -1 means unlimited.\n ResultSetHelperService.MAX_FETCH_ROWS=-1;\n writer.setAsyncMode(async);\n int result = writer.writeAll(rs, true);\n //return result - 1;\n System.out.println("Result: " + (result - 1));\n }\n con.close();\n }\n\n //Extract ResultSet to CSV file, auto-compress if the fileName extension is ".zip" or ".gz"\n//Returns number of records extracted\n public static int ResultSet2CSV(final ResultSet rs, final String fileName, final String header, final boolean aync) throws Exception {\n try (CSVWriter writer = new CSVWriter(fileName)) {\n //Define fetch size(default as 30000 rows), higher to be faster performance but takes more memory\n ResultSetHelperService.RESULT_FETCH_SIZE=10000;\n //Define MAX extract rows, -1 means unlimited.\n ResultSetHelperService.MAX_FETCH_ROWS=20000;\n writer.setAsyncMode(aync);\n int result = writer.writeAll(rs, true);\n return result - 1;\n }\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n\n另一个快速解决方案,但我仍然认为它比上面慢,将直接使用 Spark:
\n\nquery = "(select empno,ename,dname from emp, dept where emp.deptno = dept.deptno) emp"\nempDF = spark.read \\\n .format("jdbc") \\\n .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \\\n .option("dbtable", query) \\\n .option("user", "db_user_name") \\\n .option("password", "password") \\\n .option("driver", "oracle.jdbc.driver.OracleDriver") \\\n .load()\nempDF.printSchema()\nempDF.show()\n\n# Write to S3\nempDF.write().format(\xe2\x80\x9corc/parquet/csv.gz\xe2\x80\x9d).save(\xe2\x80\x9cs3://bucketname/key/\xe2\x80\x9d)\nRun Code Online (Sandbox Code Playgroud)\n\n当然你可以重新分区,并做一些其他的优化。
\n| 归档时间: |
|
| 查看次数: |
11481 次 |
| 最近记录: |