我尝试使用UnstructuredURLLoader如下
from langchain.document_loaders import UnstructuredURLLoader\n\nloaders = UnstructuredURLLoader(urls=urls)\ndata = loaders.load()\nRun Code Online (Sandbox Code Playgroud)\n但有些页面报告说
\nlibmagic is unavailable but assists in filetype detection on file-like objects. Please consider installing libmagic for better results.\nError fetching or processing https://wellfound.com/company/chorus-one, exception: Invalid file. The FileType.UNK file type is not supported in partition.\nRun Code Online (Sandbox Code Playgroud)\n而在我的 conda 环境中我似乎拥有它
\n%pip list | grep libmagic\nlibmagic 1.0\nRun Code Online (Sandbox Code Playgroud)\n但我没有python-libmagic。当我尝试安装它时:
pip install python-libmagic
我不断收到错误:
\nCollecting python-libmagic\n Using cached python_libmagic-0.4.0-py3-none-any.whl\nCollecting cffi==1.7.0 (from python-libmagic)\n Using …Run Code Online (Sandbox Code Playgroud) 我正在尝试复制 LangChain 文档中提供的代码(URL - LangChain 0.0.167),以便能够将 HTML 文件从 URL 列表加载到文档格式中,然后可以由复杂的自然语言处理模型进行处理以执行下游任务。但是,我遇到了一个问题,代码url_data = url_loader.load()挂起半个多小时而没有加载任何 HTML 文件。
我还遇到了堆栈跟踪,并且无法解释该错误的错误消息TP_NUM_C_BUFS too small: 50。此错误之前已在 LangChain 存储库中报告为已解决的问题(链接)。该问题的作者报告说,执行之前TP_NUM_C_BUFS too small: 50在 Windows 命令提示符下导致错误的脚本解决了该问题。但是,在 Windows 命令提示符下执行我的脚本并没有解决问题。
有人能够找出这个问题的根源并提供解决方案吗?
from langchain.document_loaders import UnstructuredURLLoader
import session_info
session_info.show()
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]
print(urls)
loader = UnstructuredURLLoader(urls=urls)
print(loader)
data = loader.load()
print(data)
Run Code Online (Sandbox Code Playgroud)
D:\path>C:/Python310/python.exe d:/path/src/langchain-url-mwe.py
-----
langchain 0.0.157
session_info 1.0.0
-----
Python 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 …Run Code Online (Sandbox Code Playgroud) 我正在尝试LangChain的AgentType.CHAT_ZERO_SHOT_REACT代理。从它的名字来看,我认为这是一个用于聊天的代理,我已经给了它内存,但它似乎无法访问它的内存。我还需要做什么才能访问它的内存?或者我是否错误地认为该代理可以处理聊天?
这是我的代码和示例输出:
llm = ChatOpenAI(model_name="gpt-4",
temperature=0)
tools = load_tools(["llm-math", "wolfram-alpha", "wikipedia"], llm=llm)
memory = ConversationBufferMemory(memory_key="chat_history")
agent_test = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION,
handle_parsing_errors=True,
memory=memory,
verbose=True
)
Run Code Online (Sandbox Code Playgroud)
>>> agent_test.run("What is the height of the empire state building?")
'The Empire State Building stands a total of 1,454 feet tall, including its antenna.'
>>> agent_test.run("What was the last question I asked?")
"I'm sorry, but I can't provide the information you're looking for."
Run Code Online (Sandbox Code Playgroud) import os
from langchain.llms import OpenAI
import bs4
import langchain
from langchain import hub
from langchain.document_loaders import UnstructuredFileLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
os.environ["OPENAI_API_KEY"] = "KEY"
loader = UnstructuredFileLoader(
'path_to_file'
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.get_relevant_documents(
"What is X?"
)
Run Code Online (Sandbox Code Playgroud)
这将返回:
[Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
Document(page_content="...", metadata={'source': …Run Code Online (Sandbox Code Playgroud) 我是 Langchain 的新手,我遇到了一个问题。我的最终目标是读取文件的内容并创建数据的矢量存储,以便稍后查询。
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
loader = TextLoader("elon_musk.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
Run Code Online (Sandbox Code Playgroud)
我的数据文件似乎存在一些问题,因此它无法读取我的文件的内容。是否可以加载 utf-8 格式的文件?我的假设是使用 utf-8 编码我不应该遇到这个问题。
以下是我在代码中遇到的错误:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:41, in TextLoader.load(self)
40 with open(self.file_path, encoding=self.encoding) as f:
---> 41 text = f.read()
42 except UnicodeDecodeError as e:
File ~\anaconda3\envs\langchain-test\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] …Run Code Online (Sandbox Code Playgroud)