爬虫解析方法分为：正则解析、xpath解析、bs4解析。

正则表达式直接对html字符串进行解析（最快）。xpath和bs4需要通过lxml和bs4对其进行解析成html页面才能提取数据。

一、BS4解析

在bs4中，soup = BeautifulSoup(html, lxml )有四种解析器：html.parser、lxml、xml、html5lib。其中lxml能解析90%的网页，html5lib能解析剩下10%的网页。针对lxml无法解析的，就使用html5lib进行解析。若要使用lxml,需要先安装lxml库；而html5lib也需要pip install html5lib,才能使用soup = BeautifulSoup(html, html5lib )

#导入库
from bs4 import BeautifulSoup
import requests

#实例化,header里是user-agent
html = requests.get(url,headers=header)

#提取数据
   
1、获取所有div标签   
divs = soup.find_all( div )

   
2、获取指定div
   
div = soup.find_all( div )[1]

   
3、获取第二个到第十个div
   
divs = soup.find_all( div )[1:10]

   
4、获取id=even的div标签
   
divs = soup.find_all( div ,id= even )
divs = soup.find_all( div ,id= even ,class_= 123 )#多个属性
divs = soup.find_all( div ,attrs={ id : even , class : 123 })#多个属性
   
5、获取标签的属性值
   
alist = soup.find_all( a )
for a in list:
    #方法一
    href = a[ href ]
    #方法二
    href = a.attrs[ href ]  
 
   
6、获取具体信息
   
divs = soup.find_all( a )[1:]
#获取某一标签的内容信息
for div in divs:
    a = div.find_all( a )#从div中取出所有a标签
    name = a.string
    age = div.find_all( span )[1].string
    ...
#获取div标签内所有内容
for div in divs:
    #列表中可能会出现
等其他符号
    info = list(div.strings)
    #去掉无意义的符号
    info = list(div.stripped_strings)

(1)根据标签名查找

soup.a 只能找到第一个符合要求的标签

(2)获取属性

soup.a.attrs 获取a所有的属性和属性值，返回一个字典

soup.a.attrs[ href ] 获取href属性

soup.a[ href ] 也可简写为这种形式

(3)获取内容

soup.a.string

soup.a.text

soup.a.get_text()

【注意】如果标签还有标签，那么string获取到的结果为None，而其它两个，可以获取文本内容

(4)find：找到第一个符合要求的标签

soup.find( a ) 找到第一个符合要求的

soup.find( a , )

soup.find( a , class_="xxx")

soup.find( a , id="xxx")

(5)find_all：找到所有符合要求的标签

soup.find_all( a )

soup.find_all([ a , b ]) 找到所有的a和b标签

soup.find_all( a , limit=2) 限制前两个

二、正则解析

单字符：

. : 除换行以外所有字符

[] ：[aoe] [a-w] 匹配集合中任意一个字符

d ：数字 [0-9]

D : 非数字

w ：数字、字母、下划线、中文

W : 非w

s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ fv]。

S : 非空白

数量修饰：

: 任意多次 >=0

: 至少1次 >=1

? : 可有可无 0次或者1次

{m} ：固定m次 hello{3,}

{m,} ：至少m次

{m,n} ：m-n次

边界：

$ : 以某某结尾

^ : 以某某开头

分组：

(ab)

贪婪模式： .*

非贪婪(惰性)模式： .*?

re.I : 忽略大小写

re.M ：多行匹配

re.S ：单行匹配

re.sub(正则表达式, 替换内容, 字符串)

三、xpath解析

from lxml import etree
tree = etree.parse(html)
tree.xpath("xpath表达式")