agent需要爬取网页怎么办？docker快速搭建browserless 服务

docker run -p 6000:3000 ghcr.io/browserless/chrome

对应的 docs 位于：

http://ip:6000/docs

然后在 Browser REST APIs 下面可以找到对应的 api, 如:http://ip:6000/chrome/content.

python 简单的请求实现：

 import requests
 
# 定义要访问的地址
url = 'http://ip:6000/chrome/content'
 
# 准备发送的数据
data = {"url": base_url}
# 发送 POST 请求
response = requests.post(url, json=data)
 
# 检查响应状态码
if response.status_code == 200:
    # 如果请求成功，打印响应内容
    print(response.text)
else:
    # 如果请求失败，打印错误信息
    print(f'Error: {response.status_code}')

封装为函数：

 def scrape(url: str):
    # scrape website, and also will summarize the content based on objective if the content is too large
    # objective is the original objective & task that user give to the agent, url is the url of the website to be scraped
 
    print("Scraping website...")
    # Define the headers for the request
    headers = {
        'Cache-Control': 'no-cache',
        'Content-Type': 'application/json',
    }
 
    # Define the data to be sent in the request
    data = {"url": url}
 
    # Convert Python object to JSON string
    data_json = json.dumps(data)
 
    # Send the POST request
    post_url = "http://ip:6000/chrome/content"
    response = requests.post(post_url, headers=headers, data=data_json)
 
    # Check the response status code
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        text = soup.get_text()
        print("CONTENT:", text)
        if len(text) > 8000:
            output = summary(text)
            return output
        else:
            return text
    else:
        print(f"HTTP request failed with status code {response.status_code}")

正文完

发表至： NLP

2024-02-22

1

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请联系tensortimes@gmail.com。

Kotaemon：开源、可定制的RAG UI

Agent 随便记

4. 医疗知识图谱实战

MLOps 基础系列指南，附带Jupyter Notebook实验

python 安装talib

agent需要爬取网页怎么办？docker快速搭建browserless 服务

AI Engineering Hub 开源仓库

Maple Mono — 为编码流畅性而生的开源等宽字体

	import requests

	# 定义要访问的地址
	url = 'http://ip:6000/chrome/content'

	# 准备发送的数据
	data = {"url": base_url}
	# 发送 POST 请求
	response = requests.post(url, json=data)

	# 检查响应状态码
	if response.status_code == 200:
	# 如果请求成功，打印响应内容
	print(response.text)
	else:
	# 如果请求失败，打印错误信息
	print(f'Error: {response.status_code}')

	def scrape(url: str):
	# scrape website, and also will summarize the content based on objective if the content is too large
	# objective is the original objective & task that user give to the agent, url is the url of the website to be scraped

	print("Scraping website...")
	# Define the headers for the request
	headers = {
	'Cache-Control': 'no-cache',
	'Content-Type': 'application/json',
	}

	# Define the data to be sent in the request
	data = {"url": url}

	# Convert Python object to JSON string
	data_json = json.dumps(data)

	# Send the POST request
	post_url = "http://ip:6000/chrome/content"
	response = requests.post(post_url, headers=headers, data=data_json)

	# Check the response status code
	if response.status_code == 200:
	soup = BeautifulSoup(response.content, "html.parser")
	text = soup.get_text()
	print("CONTENT:", text)
	if len(text) > 8000:
	output = summary(text)
	return output
	else:
	return text
	else:
	print(f"HTTP request failed with status code {response.status_code}")