Ada*_*phy 5 python langchain py-langchain
我正在尝试创建(最大)350 个字符长、100 个块重叠的块。
我知道这chunk_size
是一个上限,所以我可能会得到比这个更短的块。但为什么我没有得到任何chunk_overlap
?
是因为重叠也必须在分隔符之一上分割吗?那么如果separator
分割的 100 个字符以内可以分割,那么它就是 100 个字符 chunk_overlap 吗?
from langchain.text_splitter import RecursiveCharacterTextSplitter
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=350,
chunk_overlap=100,
separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
x = r_splitter.split_text(some_text)
print(x)
for thing in x:
print(len(thing))
Run Code Online (Sandbox Code Playgroud)
输出
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
248
243
Run Code Online (Sandbox Code Playgroud)
小智 3
我发现 RecursiveCharacterTextSplitter 不会重叠由分隔符分割的块,就像您所拥有的那样:
separators=["\n\n", "\n", "(?<=\. )", " ", ""]
发生的情况是,由于分隔符,两个段落中的每一个都被分成自己的整个块\n\n
。因此这些块被认为是独立的并且不会产生重叠。如果您的段落大于 350 块大小(或者您的块大小更小),则该段落将被拆分为多个块,并且这些块将重叠。
我认为该包的逻辑是,由于您有意在语义上分隔这些段落,因此您不希望它们的消息重叠。如果这是您想要的,我建议删除相关的分隔符。
注意:当您认为这也是一个分隔符时,我的回答会有点崩溃" "
。你可能会认为这会让每个单词成为自己的块。我还不明白那部分。