python模拟用户登录爬取阳光采购平台

发布时间:2019-06-09 21:55:20编辑:auto阅读(2004)

    原创内容,爬取请指明出处:https://www.cnblogs.com/Lucy151213/p/10968868.html

    阳光采购平台每月初会把当月的价格挂到平台上,现模拟用户登录平台,将需要的数据保存到csv文件和数据库,并且发送给指定人员。Python初学者,遇见很多坑,这里记录一下。

    环境 Python2.7
    开发工具 PyCharm
    运行环境 Centos7
    运行说明 设置定时任务每月1号凌晨1点执行这个python代码
    实现功能 根据账号密码及解析处理的验证码自动登录系统,解析需要的数据,并保存在csv文件和mysql数据库中,爬取完成后将csv文件发给指定的人。支持请求断开后自动重连。

    开发环境搭建:

    网上教程一大堆,不赘述了。安装好后需要安装一些必须的库,如下:

    bs4(页面html解析)

    csv(用于保存csv文件)

    smtplib(用于发送邮件)

    mysql.connector(用于连接数据库)

    部分需要下载的内容我放在网盘共享,包括leptonica-1.72.tar.gz,Tesseract3.04.00.tar.gz,以及语言包:

    链接:https://pan.baidu.com/s/1J4SZDgmn6DpuQ1EHxE6zkw
    提取码:crbl

    图像识别:

    网上也有很多教程,整理了一版在centos7上能正常安装图像识别库的操作步骤。

    • 因为是下载源码编译安装,所有需要安装响应的编译工具:

    yum install gcc gcc-c++ make

    yum install autoconf automake libtool

    • 安装对图片识别相关支持工具,没有这些在后续执行Tesseract命令时会报错:

    yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel

    • 安装leptonica,首先去leptonica下载,下载后放到服务器解压并编译,leptonica是一个tesseract必须的工具:

    下载地址:http://www.leptonica.org/

    #到leptonica的目录执行

    ./configure

    make

    make install

    • 下载对应的Tesseract

    下载地址:https://link.jianshu.com/?t=https://github.com/tesseract-ocr/tesseract/wiki/Downloads

    #到tesseract-3.04.00目录执行

    ./autogen.sh

    ./configure

    make

    make install

    ldconfig

    • 下载语言包

    下载地址:https://github.com/tesseract-ocr/tessdata

    下载后的文件放在目录tessdata下面

    • 环境配置

    拷贝tessdata:cp tessdata /usr/local/share –R

    修改环境变量:

    打开配置文件:vi /etc/profile

    添加一行:export TESSDATA_PREFIX=/usr/local/share/tessdata

    生效:source /etc/profile

    • 测试

    tesseract –v 查看tesseract的版本相关信息。如果没有报错,那么表示安装成功了。

    放入找到一张图片image.png,然后执行:tesseract image.png 123

    当前目录下会生成123.txt文件,这个文件就记录了识别的文字。

    • 安装库pytesseract

    这个库是用于在python代码里面调用tesseract

    命令:pip install pytesseract

    测试代码如下:

    1 import pytesseract
    2 from PIL import Image
    3  
    4 im1=Image.open('image.png')
    5 print(pytesseract.image_to_string(im1))

    代码:

    我要获取的数据长相如下:

    首先获取一共有多少页,循环访问每一页,将每一页数据保存到csv和数据库里面,如果在访问某页的时候抛出异常,那么记录当前broken页数,重新登录,从broken那页继续爬取数据。

     

    写了一个gl.py,用于保存全局变量:

     1 #!/usr/bin/python
     2 # -*- coding: utf-8 -*-
     3 #coding=utf-8
     4 import time
     5 
     6 timeStr = time.strftime('%Y%m%d', time.localtime(time.time()))
     7 monthStr = time.strftime('%m', time.localtime(time.time()))
     8 yearStr = time.strftime('%Y', time.localtime(time.time()))
     9 LOG_FILE = "log/" + timeStr + '.log'
    10 csvFileName = "csv/" + timeStr + ".csv"
    11 fileName = timeStr + ".csv"
    12 fmt = '%(asctime)s - %(filename)s:%(lineno)s  - %(message)s'
    13 loginUrl = "http://yourpath/Login.aspx"
    14 productUrl = 'http://yourpath/aaa.aspx'
    15 username = 'aaaa'
    16 password = "aaa"
    17 preCodeurl = "yourpath"
    18 host="yourip"
    19 user="aaa"
    20 passwd="aaa"
    21 db="mysql"
    22 charset="utf8"
    23 postData={
    24             '__VIEWSTATE':'',
    25             '__EVENTTARGET':'',
    26             '__EVENTARGUMENT':'',
    27             'btnLogin':"登录",
    28             'txtUserId':'aaaa',
    29             'txtUserPwd':'aaa',
    30             'txtCode':'',
    31             'hfip':'yourip'
    32             }
    33 tdd={
    34 '__VIEWSTATE':'',
    35 '__EVENTTARGET':'ctl00$ContentPlaceHolder1$AspNetPager1',
    36 'ctl00$ContentPlaceHolder1$AspNetPager1_input':'1',
    37 'ctl00$ContentPlaceHolder1$AspNetPager1_pagesize':'50',
    38 'ctl00$ContentPlaceHolder1$txtYear':'',
    39 'ctl00$ContentPlaceHolder1$txtMonth':'',
    40 '__EVENTARGUMENT':'',
    41 }
    42 vs={
    43 '__VIEWSTATE':''
    44 }

    主代码中设置日志,csv,数据库连接,cookie:

     1 handler = logging.handlers.RotatingFileHandler(gl.LOG_FILE, maxBytes=1024 * 1024, backupCount=5)
     2     formatter = logging.Formatter(gl.fmt)
     3     handler.setFormatter(formatter)
     4     logger = logging.getLogger('tst')
     5     logger.addHandler(handler)
     6     logger.setLevel(logging.DEBUG)
     7     csvFile = codecs.open(gl.csvFileName, 'w+', 'utf_8_sig')
     8     writer = csv.writer(csvFile)
     9     conn = mysql.connector.connect(host=gl.host, user=gl.user, passwd=gl.passwd, db=gl.db, charset=gl.charset)
    10     cursor = conn.cursor()
    11 
    12     cookiejar = cookielib.MozillaCookieJar()
    13     cookieSupport = urllib2.HTTPCookieProcessor(cookiejar)
    14     httpsHandLer = urllib2.HTTPSHandler(debuglevel=0)
    15     opener = urllib2.build_opener(cookieSupport, httpsHandLer)
    16     opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')]
    17     urllib2.install_opener(opener)

    登录方法:

    首先是识别验证码,转为数字。然后用(密码+用户名+验证)提交到登录方法,可能会失败,因为识别验证码有时候识别的不正确。如果登录失败,那么重新获取验证码,再次识别,再次登录,直到登录成功。

     1 def get_logined_Data(opener,logger,views):
     2     print "get_logined_Data"
     3     indexCount = 1
     4     retData = None
     5     while indexCount <= 15:
     6         print "begin login ", str(indexCount), " time"
     7         logger.info("begin login " + str(indexCount) + " time")
     8         vrifycodeUrl = gl.preCodeurl + str(random.random())
     9         text = get_image(vrifycodeUrl)#封装一个方法,传入验证码URL,返回识别出的数字
    10         postData = gl.postData
    11         postData["txtCode"] = text
    12         postData["__VIEWSTATE"]=views
    13 
    14 
    15         data = urllib.urlencode(postData)
    16         try:
    17             headers22 = {
    18                 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    19                 'Accept-Encoding': 'gzip, deflate, br',
    20                 'Accept-Language': 'zh-CN,zh;q=0.9',
    21                 'Connection': 'keep-alive',
    22                 'Content-Type': 'application/x-www-form-urlencoded',
    23                 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
    24             }
    25             request = urllib2.Request(gl.loginUrl, data, headers22)
    26             opener.open(request)
    27         except Exception as e:
    28             print "catch Exception when login"
    29             print e
    30 
    31         request = urllib2.Request(gl.productUrl)
    32         response = opener.open(request)
    33         dataPage = response.read().decode('utf-8')
    34 
    35         bsObj = BeautifulSoup(dataPage,'html.parser')
    36         tabcontent = bsObj.find(id="tabcontent") #登录成功后,页面才有tabcontent这个元素,所以更具这个来判断是否登录成功
    37         if (tabcontent is not None):
    38             print "login succesfully"
    39             logger.info("login succesfully")
    40             retData = bsObj
    41             break
    42         else:
    43             print "enter failed,try again"
    44             logger.info("enter failed,try again")
    45             time.sleep(3)
    46             indexCount += 1
    47     return retData

    分析代码发现,每次请求获取数据都需要带上’__VIEWSTATE’这个参数,这个参数是存放在页面,所以需要把‘__VIEWSTATE’提出出来,用于访问下一页的时候带到参数里面去。

     

    验证码解析:

    通过验证码的url地址,将验证码保存到本地,因为验证码是彩色的,所有需要先把验证码置灰,然后再调用图像识别转为数字。这个验证码为4位数字,但是调用图像识别的时候,可能会转成字母,所有手动将字母转为数字,转换后识别率还能接受。

     1 #获取验证码对应的数字,返回为4个数字才为有效
     2 def get_image(codeurl):
     3     print(time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + " begin get code num")
     4     index = 1
     5     while index<=15:
     6         file = urllib2.urlopen(codeurl).read()
     7         im = cStringIO.StringIO(file)
     8         img = Image.open(im)
     9         imgName = "vrifycode/" + gl.timeStr + "_" + str(index) + ".png"
    10         print 'begin get vrifycode'
    11         text = convert_image(img, imgName)
    12         print "vrifycode", index, ":", text
    13         # logger.info('vrifycode' + str(index) + ":" + text)
    14 
    15         if (len(text) != 4 or text.isdigit() == False):  # 如果验证码不是4位那么肯定是错误的。
    16             print 'vrifycode:', index, ' is wrong'
    17             index += 1
    18             time.sleep(2)
    19             continue
    20         return text
    21 
    22 #将图片转为数字
    23 def convert_image(image,impName):
    24     print "enter convert_image"
    25     image = image.convert('L')  # 灰度
    26     image2 = Image.new('L', image.size, 255)
    27     for x in range(image.size[0]):
    28         for y in range(image.size[1]):
    29             pix = image.getpixel((x, y))
    30             if pix < 90:  # 灰度低于120 设置为 0
    31                 image2.putpixel((x, y), 0)
    32     print "begin save"
    33     image2.save(impName)  # 将灰度图存储下来看效果
    34     print "begin convert"
    35     text = pytesseract.image_to_string(image2)
    36     print "end convert"
    37     snum = ""
    38     for j in text:#进行简单转换
    39         if (j == 'Z'):
    40             snum += "2"
    41         elif (j == 'T'):
    42             snum += "7"
    43         elif (j == 'b'):
    44             snum += "5"
    45         elif (j == 's'):
    46             snum += "8"
    47         elif (j == 'S'):
    48             snum += "8"
    49         elif (j == 'O'):
    50             snum += "0"
    51         elif (j == 'o'):
    52             snum += "0"
    53         else:
    54             snum += j
    55     return snum

    数据转换:

    将html数据转换为数组,供保存csv文件和数据库时使用

     1 def paras_data(nameList,logger):
     2     data = []
     3     mainlist = nameList
     4     rows = mainlist.findAll("tr", {"class": {"row", "alter"}})
     5     try:
     6         if (len(rows) != 0):
     7             for name in rows:
     8                 tds = name.findAll("td")
     9                 if tds == None:
    10                     print "get tds is null"
    11                     logger.info("get tds is null")
    12                 else:
    13                     item = []
    14                     for index in range(len(tds)):
    15                         s_span = (tds[index]).find("span")
    16                         if (s_span is not None):
    17                             tmp = s_span["title"]
    18                         else:
    19                             tmp = (tds[index]).get_text()
    20                         # tmp=(tds[index]).get_text()
    21                         item.append(tmp.encode('utf-8'))  # gb2312  utf-8
    22                     item.append(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))#本条数据获取时间
    23                     data.append(tuple(item))
    24 
    25     except Exception as e:
    26         print "catch exception when save csv", e
    27         logger.info("catch exception when save csv" + e.message)
    28     return data

    保存csv文件:

    def save_to_csv(data ,writer):
        for d in data:
            if d is not None:
                writer.writerow(d)

    保存数据库:

     1 def save_to_mysql(data,conn,cursor):
     2     try:
     3         cursor.executemany(
     4             "INSERT INTO `aaa`(aaa,bbb) VALUES (%s,%s)",
     5             data)
     6         conn.commit()
     7 
     8     except Exception as e:
     9         print "catch exception when save to mysql",e
    10     else:
    11         pass

    保存指定页数据:

     1 def get_appointed_page(snum,opener,vs,logger):
     2     tdd = get_tdd()
     3     tdd["__VIEWSTATE"] = vs['__VIEWSTATE']
     4     tdd["__EVENTARGUMENT"] = snum
     5     tdd=urllib.urlencode(tdd)
     6     # print "tdd",tdd
     7     op = opener.open(gl.productUrl, tdd)
     8     if (op.getcode() != 200):
     9         print("the" + snum + " page ,state not 200,try connect again")
    10         return None
    11     data = op.read().decode('utf-8', 'ignore')
    12     # print "data",data
    13     bsObj = BeautifulSoup(data,"lxml")
    14     nameList = bsObj.find("table", {"class": "mainlist"})
    15     # print "nameList",nameList
    16     if len(nameList) == 0:
    17         return None
    18     viewState = bsObj.find(id="__VIEWSTATE")
    19     if viewState is None:
    20         logger.info("the other page,no viewstate,try connect again")
    21         print("the other page,no viewstate,try connect again")
    22         return None
    23     vs['__VIEWSTATE'] = viewState["value"]
    24     return nameList

    Main方法:

     1 while flag == True and logintime <50:
     2             try:
     3                 print "global login the ", str(logintime), " times"
     4                 logger.info("global login the " + str(logintime) + " times")
     5                 bsObj = get_logined_Data(opener, logger,views)
     6                 if bsObj is None:
     7                     print "try login 15 times,but failed,exit"
     8                     logger.info("try login 15 times,but failed,exit")
     9                     exit()
    10                 else:
    11                     print "global login the ", str(logintime), " times succesfully!"
    12                     logger.info("global login the " + str(logintime) + " times succesfully!")
    13                     viewState_Source = bsObj.find(id="__VIEWSTATE")
    14                     if totalNum == -1:
    15                         totalNum = get_totalNum(bsObj)
    16                         print "totalNum:",str(totalNum)
    17                         logger.info("totalnum:"+str(totalNum))
    18                     vs = gl.vs
    19                     if viewState_Source != None:
    20                         vs['__VIEWSTATE'] = viewState_Source["value"]
    21 
    22                     # 获取指定snum页的数据
    23                     # while snum<=totalNum:
    24                     while snum<=totalNum:
    25                         print "begin get the ",str(snum)," page"
    26                         logger.info("begin get the "+str(snum)+" page")
    27                         nameList = get_appointed_page(snum, opener, vs, logger)
    28                         if nameList is None:
    29                             print "get the nameList failed,connect agian"
    30                             logger.info("get the nameList failed,connect agian")
    31                             raise Exception
    32                         else:
    33                             print "get the ", str(snum), " successfully"
    34                             logger.info("get the " + str(snum) + " successfully")
    35 
    36                   
    37                             mydata = paras_data(nameList,logger)
    38                             #保存CSV文件
    39                             save_to_csv(mydata, snum, writer)
    40                             #保存到数据库
    41                             save_to_mysql(mydata, conn, cursor)
    42 
    43                             snum+=1
    44                             time.sleep(3)
    45 
    46                 flag = False
    47             except Exception as e:
    48                 logintime+=1
    49                 print "catch exception",e
    50                 logger.error("catch exception"+e.message)

    定时任务设置:

    cd /var/spool/cron/

    crontab –e#编辑定时任务

    输入:1 1 1 * * /yourpath/normal_script.sh>>/yourpath/cronlog.log  2>&1

    (上面定时任务的意思是每月1号1点1分执行文件normal_script.sh,日志存放在cronlog.log)

    目录结构:

     源码下载:helloworld.zip

关键字