将大文本拆分为单独行中的块

stk*_*flw 0 sql google-bigquery

我有一张桌子,里面的一些文字大得惊人。我想确保查询输出中的每一行不超过 100.000 个字符。我怎么做?

这是一个快速示例:

WITH large_texts AS (
  (SELECT 'humongous text goes here' AS text ,'1' AS id)
  UNION ALL 
  (SELECT 'small one' AS text ,'2' AS id)
  UNION ALL
  (SELECT 'and another big one over here' AS text ,'3' AS id)
)

SELECT * FROM large_texts
Run Code Online (Sandbox Code Playgroud)

假设我希望输出text列少于 10 个字符。所以,我需要这个结果:

+----+------------+
| id | text       |
+----+------------+
| 1  | humongous  |
+----+------------+
| 1  | text goes  |
+----+------------+
| 1  | here       |
+----+------------+
| 2  | small one  |
+----+------------+
| 3  | and anothe |
+----+------------+
| 3  | r big one  |
+----+------------+
| 3  | over here  |
+----+------------+
Run Code Online (Sandbox Code Playgroud)

如果我也能避免在单词中间分裂就更好了。

Mik*_*ant 5

如果我也能避免在单词中间分裂就更好了。

考虑以下方法

create temp function split_with_limit(text STRING, len FLOAT64)
returns ARRAY<STRING>
language js AS r"""
    let input = text.trim().split(' ');
    let [index, output] = [0, []]
    output[index] = '';
    input.forEach(word => {
        let temp = `${output[index]} ${word}`.trim()
        if (temp.length < len) {
            output[index] = temp;
        } else {
            index++;
            output[index] = word;
        }
    })
    return output
""";
select id, small_chunk
from yourtable_with_large_texts, 
unnest(split_with_limit(text, 10)) small_chunk with offset
order by id, offset
Run Code Online (Sandbox Code Playgroud)

如果应用于问题中的样本数据 - 输出为

在此输入图像描述