爬虫入门

爬虫简介

爬虫的概念

通过编写程序，模拟浏览器上网，然后让其抓取数据的过程

爬虫分类

通用爬虫（整张页面）
聚焦爬虫（局部内容）
增量式爬虫（抓取更新内容）

爬虫的矛与盾

爬虫
反爬机制
反反爬策略

robots.txt协议

君子协议，规定可爬取，不可爬取
网址后面加/robots.txt 即可访问

http协议

概念：服务器与客户端进行数据交互的一种形式

常用请求头信息

User-Agent:请求载体的身份标识
Connection:请求完毕后，是断开连接还是保持连接

常用响应头信息

Content-Type: 服务器响应回客户端的数据类型

https协议:

安全的超文本传输协议

加密方式

对称秘钥加密
非对称秘钥加密
证书秘钥加密

requests请求

requests模块: python中原生的一款基于网络请求的模块，功能非常强大，简单便捷，效率极高。

作用:

模拟浏览器发请求。

如何使用: (requests模块的编码流程)

指定urT
发起请求
获取响应数据
持久化存储
环境安装:
pip install requests

实战编码:

需求: 爬取搜狗首页的页面数据

import requests

if __name__ == "__main__":
    # 指定URL
    url = 'https://www.sogou.com/'
    # 发起请求
    response = requests.get(url=url)
    # 获取响应数据,text返回的是字符串
    page_text = response.text
    print(page_text)
    # 持久化存储
    with open("./sogou.html",'w',encoding='utf-8') as fp:
        fp.write(page_text)
    print("爬取数据结束")

https://blog.csdn.net/m0_46778548/article/details/121201868
Python文件读写操作

https://blog.csdn.net/qiqicos/article/details/79200089
python 里with… as.. 的操作方法

需求: 爬取搜狗指定头条

学习点UA伪装

import requests
#UA:User-Agent.(请求载体的身份标识)
# UA检测门户网站的服务器会检测对应请求的载体身份标识,不正常则服务器端就很有可能拒绝该次请求

# UA伪装
if __name__ == "__main__":
    # 指定URL
    url = 'https://www.sogou.com/web'

    # UA伪装:将对应的User-Agent封装到一个完典中
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'}

    # 处理参数并字典封装
    ws = input("Please enter a word:")
    param = {'query': ws}

    # 对指定的url发起的请求对应的urL是携带参数的，并且请求过程中处理了参数
    response = requests.get(url=url, params=param, headers=headers)

    # 持久化存储
    page_text = response.text
    print(page_text)
    with open("./"+ws+".html",'w',encoding='utf-8') as fp:
        fp.write(page_text)

    print("爬取数据结束")

https://www.runoob.com/python3/python-requests.html
request模块详解

本文采用署名-非商业性使用-相同方式共享 4.0 国际许可协议，转载请注明出处。