我有大约 30 GB 的 JSON 数据和多个文件,想要在此基础上构建查询机器人。我已经用文本文件构建了相同的内容,但我不确定它如何适用于 JSON 数据。
我已经探索过 JSONLoader,但不知道如何使用它将 JSON 数据转换为向量并将其存储到 ChromaDB 中,以便我可以查询它们。 https://python.langchain.com/docs/modules/data_connection/document_loaders/json
示例 JSON 文件:http://jsonblob.com/1147948130921996288
文本数据代码:
# Loading and Splitting the Documents
from langchain.document_loaders import DirectoryLoader
directory = '/content/drive/MyDrive/Data Science/LLM/docs/text files'
def load_docs(directory):
loader = DirectoryLoader(directory)
documents = loader.load()
return documents
documents = load_docs(directory)
len(documents)
from langchain.text_splitter import RecursiveCharacterTextSplitter
def split_docs(documents,chunk_size=1000,chunk_overlap=20):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
return docs
docs = split_docs(documents)
print(len(docs))
# Embedding Text Using Langchain
from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") …Run Code Online (Sandbox Code Playgroud)