将XML文件导入PostgreSQL

Tom*_*eif 12 xml postgresql bash

我想在表中导入很多XML文件xml_data:

create table xml_data(result xml);
Run Code Online (Sandbox Code Playgroud)

为此,我有一个带循环的简单bash脚本:

#!/bin/sh
FILES=/folder/with/xml/files/*.xml
for f in $FILES
do
  psql psql -d mydb -h myhost -U usr -c \'\copy xml_data from $f \'
done
Run Code Online (Sandbox Code Playgroud)

但是,这将尝试将每个文件的每一行导入为单独的行.这会导致错误:

ERROR:  invalid XML content
CONTEXT:  COPY address_results, line 1, column result: "<?xml version="1.0" encoding="UTF-8"?>"
Run Code Online (Sandbox Code Playgroud)

我理解它失败的原因,但无法弄清楚如何\copy将整个文件一次导入单行.

Ste*_*ger 14

Necromancing:对于那些需要一个有效例子的人:

DO $$
   DECLARE myxml xml;
BEGIN

myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'));

DROP TABLE IF EXISTS mytable;
CREATE TEMP TABLE mytable AS 

SELECT 
     (xpath('//ID/text()', x))[1]::text AS id
    ,(xpath('//Name/text()', x))[1]::text AS Name 
    ,(xpath('//RFC/text()', x))[1]::text AS RFC
    ,(xpath('//Text/text()', x))[1]::text AS Text
    ,(xpath('//Desc/text()', x))[1]::text AS Desc
FROM unnest(xpath('//record', myxml)) x
;

END$$;


SELECT * FROM mytable;
Run Code Online (Sandbox Code Playgroud)

或者噪音更小

SELECT 
     (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
    ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
    ,myTempTable.myXmlColumn as myXmlElement
FROM unnest(
    xpath
    (    '//record'
        ,XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'))
    )
) AS myTempTable(myXmlColumn)
;
Run Code Online (Sandbox Code Playgroud)

使用此示例XML文件(MyData.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
    <record>
        <ID>1</ID>
        <Name>A</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Address record</Text>
        <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
    </record>
    <record>
        <ID>2</ID>
        <Name>NS</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Name server record</Text>
        <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
    </record>
</data-set>
Run Code Online (Sandbox Code Playgroud)

注意:
MyData.xml需要位于PG_Data目录(pg_stat目录的父目录)中.
例如,/var/lib/postgresql/9.3/main/MyData.xml
这需要PostGreSQL 9.1+

总的来说,你可以实现无文件化,如下所示:

SELECT 
     (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
    ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
    ,myTempTable.myXmlColumn as myXmlElement 
    -- Source: https://en.wikipedia.org/wiki/List_of_DNS_record_types
FROM unnest(xpath('//record', 
 CAST('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
    <record>
        <ID>1</ID>
        <Name>A</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Address record</Text>
        <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
    </record>
    <record>
        <ID>2</ID>
        <Name>NS</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Name server record</Text>
        <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
    </record>
</data-set>
' AS xml)   
)) AS myTempTable(myXmlColumn)
;
Run Code Online (Sandbox Code Playgroud)


Erw*_*ter 12

我会尝试不同的方法:将XML文件直接读入plpgsql函数中的变量并从那里继续.应该更快,更强大.但是,您需要超级用户权限.

CREATE OR REPLACE FUNCTION f_sync_from_xml()
  RETURNS boolean AS
$BODY$
DECLARE
    myxml    xml;
    datafile text := 'path/to/my_file.xml';
BEGIN
   myxml := pg_read_file(datafile, 0, 100000000);  -- arbitrary 100 MB max.

   CREATE TEMP TABLE tmp AS
   SELECT (xpath('//some_id/text()', x))[1]::text AS id
   FROM   unnest(xpath('/xml/path/to/datum', myxml)) x;
   ...
Run Code Online (Sandbox Code Playgroud)

在这个密切相关的答案中找到一个包含解释和链接的完整代码示例:


Vic*_*art 5

扩展@ stefan-steiger的出色答案,这是一个示例,该示例从包含多个同级的子节点(例如,<synonym>特定<synomyms>父节点的多个元素)中提取XML元素。

我在数据中遇到了这个问题,并搜索了很多解决方案。他的回答对我最有帮助。

数据文件示例hmdb_metabolites_test.xml

<?xml version="1.0" encoding="UTF-8"?>
<hmdb>
<metabolite>
  <accession>HMDB0000001</accession>
  <name>1-Methylhistidine</name>
  <synonyms>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
    <synonym>1-Methylhistidine</synonym>
    <synonym>Pi-methylhistidine</synonym>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
  </synonyms>
</metabolite>
<metabolite>
  <accession>HMDB0000002</accession>
  <name>1,3-Diaminopropane</name>
  <synonyms>
    <synonym>1,3-Propanediamine</synonym>
    <synonym>1,3-Propylenediamine</synonym>
    <synonym>Propane-1,3-diamine</synonym>
    <synonym>1,3-diamino-N-Propane</synonym>
  </synonyms>
</metabolite>
<metabolite>
  <accession>HMDB0000005</accession>
  <name>2-Ketobutyric acid</name>
  <synonyms>
    <synonym>2-Ketobutanoic acid</synonym>
    <synonym>2-Oxobutyric acid</synonym>
    <synonym>3-Methyl pyruvic acid</synonym>
    <synonym>alpha-Ketobutyrate</synonym>
  </synonyms>
</metabolite>
</hmdb>
Run Code Online (Sandbox Code Playgroud)

另外:原始XML文件的Document元素中有一个URL

<hmdb xmlns="http://www.hmdb.ca">
Run Code Online (Sandbox Code Playgroud)

导致无法xpath解析数据。它运行(没有错误消息),但关系/表是空的:

[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/sql/hmdb_test.sql
DO
 accession | name | synonym 
-----------+------+---------
Run Code Online (Sandbox Code Playgroud)

由于源文件为3.4GB,因此我决定使用来编辑该行sed

sed -i '2s/.*hmdb xmlns.*/<hmdb>/' hmdb_metabolites.xml
Run Code Online (Sandbox Code Playgroud)

[ 在这种情况下,碰巧的是,添加2(指示sed编辑“第2行”)也使sed命令执行速度加倍。]


我的postgres数据文件夹(PSQL:)SHOW data_directory;

/mnt/Vancouver/Programming/RDB/postgres/postgres/data
Run Code Online (Sandbox Code Playgroud)

因此,作为sudo,我需要在那里复制我的XML数据文件并chown在PostgreSQL中使用:

sudo chown postgres:postgres /mnt/Vancouver/Programming/RDB/postgres/postgres/data/hmdb_metabolites_test.xml
Run Code Online (Sandbox Code Playgroud)

脚本(hmdb_test.sql):

DO $$DECLARE myxml xml;

BEGIN

myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('hmdb_metabolites_test.xml'), 'UTF8'));

DROP TABLE IF EXISTS mytable;

-- CREATE TEMP TABLE mytable AS 
CREATE TABLE mytable AS 
SELECT 
    (xpath('//accession/text()', x))[1]::text AS accession
    ,(xpath('//name/text()', x))[1]::text AS name 
    -- The "synonym" child/subnode has many sibling elements, so we need to
    -- "unnest" them,otherwise we only retrieve the first synonym per record:
    ,unnest(xpath('//synonym/text()', x))::text AS synonym
FROM unnest(xpath('//metabolite', myxml)) x
;

END$$;

-- select * from mytable limit 5;
SELECT * FROM mytable;
Run Code Online (Sandbox Code Playgroud)

执行,输出(以表示PSQL):

[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/hmdb_test.sql

accession  |        name        |                         synonym                          
-------------+--------------------+----------------------------------------------------------
HMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid
HMDB0000001 | 1-Methylhistidine  | 1-Methylhistidine
HMDB0000001 | 1-Methylhistidine  | Pi-methylhistidine
HMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate
HMDB0000002 | 1,3-Diaminopropane | 1,3-Propanediamine
HMDB0000002 | 1,3-Diaminopropane | 1,3-Propylenediamine
HMDB0000002 | 1,3-Diaminopropane | Propane-1,3-diamine
HMDB0000002 | 1,3-Diaminopropane | 1,3-diamino-N-Propane
HMDB0000005 | 2-Ketobutyric acid | 2-Ketobutanoic acid
HMDB0000005 | 2-Ketobutyric acid | 2-Oxobutyric acid
HMDB0000005 | 2-Ketobutyric acid | 3-Methyl pyruvic acid
HMDB0000005 | 2-Ketobutyric acid | alpha-Ketobutyrate

[hmdb_test]#
Run Code Online (Sandbox Code Playgroud)