什么是 OpenClaw?
OpenClaw 是一个开源的数据抓取和自动化工具,可以用于网页爬虫、API调用、数据处理等任务,它提供了简单易用的接口来抓取和处理网络数据。

安装 OpenClaw
系统要求
- Python 3.7 或更高版本
- pip(Python包管理器)
安装步骤
-
使用pip安装
pip install openclaw
-
或者从源码安装
git clone https://github.com/openclaw/openclaw.git cd openclaw pip install -e .
快速开始
基础爬虫示例
from openclaw import Claw
# 创建爬虫实例
claw = Claw()
# 抓取单个网页
result = claw.fetch("https://example.com")
print(result.text)
# 抓取多个页面
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
for url in urls:
result = claw.fetch(url)
# 处理结果...
配置选项
from openclaw import Claw
# 带配置的爬虫
claw = Claw(
user_agent="MyCustomBot/1.0", # 自定义User-Agent
timeout=10, # 超时时间
retries=3, # 重试次数
delay=2, # 请求间隔
proxy=None, # 代理设置
headers={ # 自定义请求头
"Accept": "text/html",
"Accept-Language": "en-US"
}
)
核心功能
数据提取
from openclaw import Claw
import json
claw = Claw()
# HTML解析和提取
result = claw.fetch("https://example.com")
# 使用CSS选择器s = result.css("h1.title")in titles:
print(title.text)
# 使用XPath
links = result.xpath("//a[@href]")
for link in links:
print(link.get("href"))
# 提取JSON数据
json_data = result.json()
print(json.dumps(json_data, indent=2))
并发抓取
from openclaw import Claw
from concurrent.futures import ThreadPoolExecutor
claw = Claw()
def fetch_url(url):
result = claw.fetch(url)
return result.text[:100] # 返回前100个字符
urls = ["https://example.com/page{}".format(i) for i in range(1, 11)]
# 使用线程池并发抓取
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(fetch_url, urls))
for i, result in enumerate(results, 1):
print(f"Page {i}: {result}")
数据存储
import csv
import json
from openclaw import Claw
claw = Claw()
result = claw.fetch("https://api.example.com/data")
# 保存为JSON
data = result.json()
with open("data.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# 保存为CSV
with open("data.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Title", "URL", "Content"]) # 表头
# 写入数据...
进阶用法
处理JavaScript渲染的页面
from openclaw import Claw
claw = Claw(
enable_js=True, # 启用JavaScript渲染
js_timeout=5000 # JavaScript执行超时
)
# 抓取需要JS渲染的页面
result = claw.fetch("https://example.com/spa-page")
print(result.text)
使用中间件
from openclaw import Claw
# 自定义中间件
def custom_middleware(request):
# 修改请求
request.headers["X-Custom-Header"] = "MyValue"
return request
claw = Claw()
claw.add_middleware(custom_middleware)
result = claw.fetch("https://example.com")
最佳实践
遵守robots.txt
from openclaw import Claw
from urllib.robotparser import RobotFileParser
claw = Claw()
def can_fetch(url):
rp = RobotFileParser()
base_url = "/".join(url.split("/")[:3])
rp.set_url(base_url + "/robots.txt")
rp.read()
return rp.can_fetch("*", url)
if can_fetch("https://example.com/page"):
result = claw.fetch("https://example.com/page")
错误处理
from openclaw import Claw
from openclaw.exceptions import RequestError, ParseError
claw = Claw()
try:
result = claw.fetch("https://example.com")
if result.status_code == 200:
data = result.json()
else:
print(f"HTTP Error: {result.status_code}")
except RequestError as e:
print(f"Request failed: {e}")
except ParseError as e:
print(f"Parse failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
限速控制
import time
from openclaw import Claw
claw = Claw(delay=1) # 每次请求间隔1秒
urls = [...] # 大量URL
for url in urls:
result = claw.fetch(url)
# 处理结果...
# 动态调整延迟
if result.status_code == 429: # Too Many Requests
time.sleep(10) # 等待更长时间
项目结构示例
my_crawler/
├── crawler.py # 主爬虫脚本
├── config.py # 配置文件
├── items/
│ ├── __init__.py
│ ├── models.py # 数据模型
│ └── pipelines.py # 数据处理管道
├── middlewares/
│ └── custom.py # 自定义中间件
├── utils/
│ └── helpers.py # 辅助函数
└── data/ # 存储抓取的数据
├── raw/
└── processed/
常见问题
Q1: 如何处理登录?
from openclaw import Claw
claw = Claw()
# 方法1: 使用Session
session = claw.create_session()
session.post("https://example.com/login", data={
"username": "your_username",
"password": "your_password"
})
# 方法2: 使用Cookies
claw = Claw(cookies={"session_id": "your_session_id"})
Q2: 如何避免被屏蔽?
- 使用代理轮换
- 设置合理的请求间隔
- 随机User-Agent
- 遵守网站的爬虫政策
Q3: 如何调试?
import logging from openclaw import Claw # 启用调试日志 logging.basicConfig(level=logging.DEBUG) claw = Claw(debug=True) # 启用调试模式
学习资源
- 官方文档: https://docs.openclaw.org
- 示例项目: https://github.com/openclaw/examples
- API参考: https://docs.openclaw.org/api
注意事项
⚠️ 重要提示:
- 尊重网站的服务条款
- 不要对目标网站造成过大负担
- 遵守相关法律法规
- 妥善处理个人数据
- 考虑使用API代替网页抓取(如果可用)
这个教程应该能帮助你开始使用OpenClaw,根据你的具体需求,可能还需要查阅更详细的文档和示例。
版权声明:除非特别标注,否则均为本站原创文章,转载时请以链接形式注明文章出处。