电脑工具爬虫工具如何编写简单网页爬虫

联启电脑工具 2026-07-04 1

本文目录导读：

电脑工具爬虫工具如何编写简单网页爬虫-第1张图片-电脑手机工具软件下载 - 免费实用工具合集 | 联启科技

基础版：使用 requests + BeautifulSoup
进阶版：Scrapy框架
轻量级：Selenium（处理JavaScript页面）
简单实用：PyQuery（类似jQuery语法）
最佳实践建议
注意事项
快速选择指南

我来介绍几种简单易用的Python爬虫工具和编写方法。

基础版：使用 requests + BeautifulSoup

安装依赖

pip install requests beautifulsoup4 lxml

简单示例 - 爬取标题

import requests
from bs4 import BeautifulSoup
# 获取网页内容
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
# 解析HTML
soup = BeautifulSoup(response.text, 'lxml')
title = soup.find('title').text
print(f"网页标题: {title}")
# 提取所有链接
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    text = link.text.strip()
    print(f"链接: {text} -> {href}")

进阶版：Scrapy框架

安装

pip install scrapy

创建项目

scrapy startproject my_crawler
cd my_crawler

编写爬虫 (spiders/my_spider.py)

import scrapy
class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://example.com']
    def parse(self, response):
        # 提取数据
        title = response.css('title::text').get()
        yield {
            'title': title,
            'url': response.url
        }
        # 跟随链接
        for link in response.css('a::attr(href)'):
            yield response.follow(link, self.parse)

运行爬虫

scrapy crawl my_spider -o output.json

轻量级：Selenium（处理JavaScript页面）

安装

pip install selenium
# 下载对应浏览器的WebDriver

示例代码

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
# 初始化浏览器
driver = webdriver.Chrome()  # 确保已安装ChromeDriver
# 访问页面
driver.get("https://example.com")
# 等待页面加载
time.sleep(2)
# 提取信息= driver.find_element(By.TAG_NAME, "h1").text
print(f"标题: {title}")
# 获取所有段落
paragraphs = driver.find_elements(By.TAG_NAME, "p")
for p in paragraphs:
    print(p.text)
# 关闭浏览器
driver.quit()

简单实用：PyQuery（类似jQuery语法）

安装

pip install pyquery requests

示例

from pyquery import PyQuery as pq
import requests
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
doc = pq(response.text)
# 类似jQuery语法提取内容= doc('title').text()
links = doc('a')
print(f"标题: {title}")
# 遍历链接
for link in links.items():
    print(f"{link.text()} -> {link.attr('href')}")

最佳实践建议

通用爬虫模板

import requests
from bs4 import BeautifulSoup
import time
import random
class SimpleCrawler:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        }
        self.session = requests.Session()
    def fetch(self, url):
        """获取网页内容"""
        try:
            response = self.session.get(url, headers=self.headers, timeout=10)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            print(f"请求失败: {e}")
            return None
    def parse(self, html):
        """解析网页"""
        if not html:
            return None
        return BeautifulSoup(html, 'lxml')
    def extract_data(self, soup):
        """提取所需数据"""
        data = {}
        # 自定义提取逻辑
        data['title'] = soup.find('title').text if soup.find('title') else ''
        return data
    def run(self, urls):
        """运行爬虫"""
        results = []
        for url in urls:
            html = self.fetch(url)
            soup = self.parse(html)
            data = self.extract_data(soup)
            results.append(data)
            # 礼貌性延迟
            time.sleep(random.uniform(1, 3))
        return results
# 使用示例
crawler = SimpleCrawler()
urls = ['https://example.com/page1', 'https://example.com/page2']
data = crawler.run(urls)
print(data)

注意事项

⚠️ 法律和道德规范

遵守网站的 robots.txt 文件
设置合理的请求间隔,不要对服务器造成压力
尊重版权和数据使用条款
不要爬取敏感或受保护的数据

💡 实用技巧

# 1. 设置代理
proxies = {
    "http": "http://proxy.example.com:8080",
    "https": "http://proxy.example.com:8080"
}
# 2. 处理反爬虫
headers = {
    "User-Agent": "Mozilla/5.0",
    "Referer": "https://www.google.com",
    "Cookie": "your_cookie_here"
}
# 3. 错误重试
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[500, 502, 503, 504]
)