本文目录导读:

- 基础版:使用 requests + BeautifulSoup
- 进阶版:Scrapy框架
- 轻量级:Selenium(处理JavaScript页面)
- 简单实用:PyQuery(类似jQuery语法)
- 最佳实践建议
- 注意事项
- 快速选择指南
我来介绍几种简单易用的Python爬虫工具和编写方法。
基础版:使用 requests + BeautifulSoup
安装依赖
pip install requests beautifulsoup4 lxml
简单示例 - 爬取标题
import requests
from bs4 import BeautifulSoup
# 获取网页内容
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
# 解析HTML
soup = BeautifulSoup(response.text, 'lxml')
title = soup.find('title').text
print(f"网页标题: {title}")
# 提取所有链接
links = soup.find_all('a')
for link in links:
href = link.get('href')
text = link.text.strip()
print(f"链接: {text} -> {href}")
进阶版:Scrapy框架
安装
pip install scrapy
创建项目
scrapy startproject my_crawler cd my_crawler
编写爬虫 (spiders/my_spider.py)
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://example.com']
def parse(self, response):
# 提取数据
title = response.css('title::text').get()
yield {
'title': title,
'url': response.url
}
# 跟随链接
for link in response.css('a::attr(href)'):
yield response.follow(link, self.parse)
运行爬虫
scrapy crawl my_spider -o output.json
轻量级:Selenium(处理JavaScript页面)
安装
pip install selenium # 下载对应浏览器的WebDriver
示例代码
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
# 初始化浏览器
driver = webdriver.Chrome() # 确保已安装ChromeDriver
# 访问页面
driver.get("https://example.com")
# 等待页面加载
time.sleep(2)
# 提取信息= driver.find_element(By.TAG_NAME, "h1").text
print(f"标题: {title}")
# 获取所有段落
paragraphs = driver.find_elements(By.TAG_NAME, "p")
for p in paragraphs:
print(p.text)
# 关闭浏览器
driver.quit()
简单实用:PyQuery(类似jQuery语法)
安装
pip install pyquery requests
示例
from pyquery import PyQuery as pq
import requests
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
doc = pq(response.text)
# 类似jQuery语法提取内容= doc('title').text()
links = doc('a')
print(f"标题: {title}")
# 遍历链接
for link in links.items():
print(f"{link.text()} -> {link.attr('href')}")
最佳实践建议
通用爬虫模板
import requests
from bs4 import BeautifulSoup
import time
import random
class SimpleCrawler:
def __init__(self):
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
self.session = requests.Session()
def fetch(self, url):
"""获取网页内容"""
try:
response = self.session.get(url, headers=self.headers, timeout=10)
response.raise_for_status()
return response.text
except requests.RequestException as e:
print(f"请求失败: {e}")
return None
def parse(self, html):
"""解析网页"""
if not html:
return None
return BeautifulSoup(html, 'lxml')
def extract_data(self, soup):
"""提取所需数据"""
data = {}
# 自定义提取逻辑
data['title'] = soup.find('title').text if soup.find('title') else ''
return data
def run(self, urls):
"""运行爬虫"""
results = []
for url in urls:
html = self.fetch(url)
soup = self.parse(html)
data = self.extract_data(soup)
results.append(data)
# 礼貌性延迟
time.sleep(random.uniform(1, 3))
return results
# 使用示例
crawler = SimpleCrawler()
urls = ['https://example.com/page1', 'https://example.com/page2']
data = crawler.run(urls)
print(data)
注意事项
⚠️ 法律和道德规范
- 遵守网站的
robots.txt文件 - 设置合理的请求间隔,不要对服务器造成压力
- 尊重版权和数据使用条款
- 不要爬取敏感或受保护的数据
💡 实用技巧
# 1. 设置代理
proxies = {
"http": "http://proxy.example.com:8080",
"https": "http://proxy.example.com:8080"
}
# 2. 处理反爬虫
headers = {
"User-Agent": "Mozilla/5.0",
"Referer": "https://www.google.com",
"Cookie": "your_cookie_here"
}
# 3. 错误重试
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[500, 502, 503, 504]
)
快速选择指南
| 工具 | 适用场景 | 难度 |
|---|---|---|
| requests + BeautifulSoup | 简单静态页面 | |
| Scrapy | 大规模爬取 | |
| Selenium | 动态JS页面 | |
| PyQuery | jQuery爱好者 |
选择哪种工具取决于你的具体需求,但建议从 requests + BeautifulSoup 开始学习,这是最基础也最实用的组合。
标签: 爬虫入门
版权声明:除非特别标注,否则均为本站原创文章,转载时请以链接形式注明文章出处。