langchain CharacterTextSplitter 的 chunk_size 参数有什么作用？

Question

langchain CharacterTextSplitter 的 chunk_size 参数有什么作用？

Max*_*wer 7 python text nlp machine-learning langchain

我的默认假设是该chunk_size参数将为该方法产生的块/分割的大小设置上限split_text，但这显然是不对的：

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 6
chunk_overlap = 2

c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text = 'abcdefghijklmnopqrstuvwxyz'

c_splitter.split_text(text)

Run Code Online (Sandbox Code Playgroud)

打印：['abcdefghijklmnopqrstuvwxyz']，即比更大的单个块chunk_size=6。

所以我知道它没有将文本分割成块，因为它从未遇到分隔符。但问题是，even 在做什么chunk_size？

langchain.text_splitter.CharacterTextSplitter 我检查了此处的文档页面，但没有看到这个问题的答案。我询问了“可修复的”chat-with-langchain-docs 搜索功能，但得到了答案“CharacterTextSplitter 的 chunk_size 参数决定了每个文本块中的最大字符数。”...这是不正确的，因为上面的代码示例显示了。

Answer 1

DMc*_*McC 14

CharacterTextSplitter 只会在分隔符上分割（默认为“\n\n”）。chunk_size 是在可以分割的情况下将分割的最大块大小。如果一个字符串以 n 个字符开头，有一个分隔符，并且在下一个分隔符之前还有 m 个字符，则第一个块大小将为 n（如果 chunk_size < n + m + len(separator)）。

您的示例字符串没有匹配的分隔符，因此没有任何可分割的内容。

基本上，它尝试创建 <= chunk_size 的块，但如果可以创建的最小大小块 > chunk_size ，仍然会生成 > chunk_size 的块。

Answer 2

Yil*_*maz 6

CharacterTextSpliiter行为与您的预期不同。

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=6,
)

Run Code Online (Sandbox Code Playgroud)

它首先查找前 6 个字符，然后从最近的分隔符（而不是从第 7 个字符）分割下一个块。

如文档中所述，默认分隔符是“\n”。

这是最简单的方法。这基于字符（默认为“\n\n”）进行分割，并按字符数测量块长度。

您可以使用示例代码测试行为。首先test.txt用这个创建一个文件

1.Respect for Others: Treat others with kindness.
2.Honesty and Integrity: Be truthful and act with integrity in your interactions with others.
3.Fairness and Justice: Treat people equitably.
4.Respect for Property: Respect public and private property.
5.Good Citizenship: Contribute positively to your community by obeying laws, voting, volunteering, and supporting communal well-being.

Run Code Online (Sandbox Code Playgroud)

然后写下这段代码：

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# it will first find first 20 character then it will make the next chunk at the closest separator
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=20,
    chunk_overlap=0
)

loader = TextLoader("test.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter
)

for doc in docs:
    print(doc.page_content)
    print("\n")

Run Code Online (Sandbox Code Playgroud)

它是这样的：

这应该是公认的答案。 (2认同)

归档时间：	2 年，2 月前
查看次数：	13542 次
最近记录：	1 年，8 月前