1000字范文,内容丰富有趣,学习的好帮手!
1000字范文 > Post请求爬取国家税务总局纳税信用A级纳税人信息

Post请求爬取国家税务总局纳税信用A级纳税人信息

时间:2019-04-22 07:53:12

相关推荐

Post请求爬取国家税务总局纳税信用A级纳税人信息

效果预览

如图,目的是爬取国家税务总局-国家税务总局各地纳税信用A级纳税人信息。

基础代码

import pandas as pdimport requestsURL='http://hd./service/findCredit.do'HEADER = {'Cookie':'yfx_c_g_u_id_10003701=_ck20010211232618635509545356418; yfx_f_l_v_t_10003701=f_t_1577935406837__r_t_1577935406837__v_t_1577935406837__r_c_0; _Jo0OQK=21D020D4328410D73BDFA09A917AEB40E7167BB9651465E23A1380D81E8442706A3AA19408E6AD7127D826C47C034D0B2FB18F11F307B478FB63F657E29B5865DD71B918CCA8FE3BB9470EE0D297309F84070EE0D297309F840F2431DD9ED637E4DF76A79A067B4GJ1Z1OA==; JSESSIONID=F82521F2DFB764BB975730AD95DEE54B','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',}findCredit = {'page': 0,'location': '110000','code': '','name': '','evalyear':''}r = requests.post(URL ,data=findCredit, headers=HEADER)# r=requests.get(URL,heders=findCredit)print(r)print(r.text)

进一步完善

方法一——根据输入的地方名称、年份搜索下载

import requestsimport csvimport pandas as pddef getData(pageNum,placecode,year,return_total_count=False):URL='http://hd./service/findCredit.do'HEADER = {'Cookie':'yfx_c_g_u_id_10003701=_ck20010211232618635509545356418; yfx_f_l_v_t_10003701=f_t_1577935406837__r_t_1577935406837__v_t_1577935406837__r_c_0; _Jo0OQK=21D020D4328410D73BDFA09A917AEB40E7167BB9651465E23A1380D81E8442706A3AA19408E6AD7127D826C47C034D0B2FB18F11F307B478FB63F657E29B5865DD71B918CCA8FE3BB9470EE0D297309F84070EE0D297309F840F2431DD9ED637E4DF76A79A067B4GJ1Z1OA==; JSESSIONID=F82521F2DFB764BB975730AD95DEE54B','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',}findCredit = {'page': pageNum,'location': placecode,# 'cPage':5,'code': '','name': '','evalyear':year}r = requests.post(URL ,data=findCredit, headers=HEADER)print(r)# print(r.text)if r.status_code == requests.codes.ok:my_query = r.json()if return_total_count:tpage=my_query['totalPages']return tpageelse:data=pd.DataFrame({'code':[],'name':[], 'evalyear':[], 'location':[] })for each in my_query['content']:d1=str(each['code'])d2=str(each['name'])d3 = str(each['evalyear'])d4=str(each['location'])new = pd.DataFrame({'code':[d1], 'name':[d2], 'evalyear':[d3],'location':[d4]})data = data.append(new, ignore_index=True)return dataelse:return ''if __name__=='__main__':key=input('请输入地方名称:')if key == "北京":code = 110000elif key == "天津":code = 120000elif key == "河北":code = 130000elif key == "河北":code = 130000elif key == "内蒙古":code = 150000elif key == "辽宁":code = 210000elif key == "大连":code = 210200elif key == "黑龙江":code = 230000elif key == "上海":code = 310000elif key == "江苏":code = 320000elif key == "宁波":code = 330200elif key == "安徽":code = 340000elif key == "福建":code = 350000elif key == "江西":code = 360000elif key == "山东":code = 370000elif key == "青岛":code = 370200elif key == "湖北":code = 420000elif key == "湖南":code = 430000elif key == "广东":code = 440000elif key == "广西":code = 450000elif key == "海南":code = 460000elif key == "重庆":code = 500000elif key == "贵州":code = 520000elif key == "云南":code = 530000elif key == "西藏":code = 540000elif key == "甘肃":code = 620000elif key == "青海":code = 630000elif key == "宁夏":code = 640000else:print('没有此数据')year=input('请输入下载年份-(eg:):')tpages = getData(0, code, year, True)if tpages == '':print('访问过于频繁1,暂停访问!')else:x = tpages * 15print(f'{year}年{key}共有{tpages}页,每页15条记录第,共有{x}行')df = pd.DataFrame({'code': [], 'name': [], 'evalyear': [], 'location': []})# for i in range(0, tpages):i=0while i < tpages:df1 = getData(i, code, year)if len(df1)==0:print('访问过于频繁2,下载中断,临时保存')df.to_csv(f'{year}年{key}tax.csv')break# i=i-1else:print(f'正在下载第{i}页数据')df = df.append(df1, ignore_index=True)i=i+1df.to_csv(f'{year}年{key}tax.csv')

方法二——根据输入的地方代码、年份搜索下载(考虑到不方便输入汉字的情况)

import requestsimport csvimport pandas as pddef getData(pageNum,placecode,year,return_total_count=False):URL='http://hd./service/findCredit.do'HEADER = {'Cookie':'yfx_c_g_u_id_10003701=_ck20010211232618635509545356418; yfx_f_l_v_t_10003701=f_t_1577935406837__r_t_1577935406837__v_t_1577935406837__r_c_0; _Jo0OQK=21D020D4328410D73BDFA09A917AEB40E7167BB9651465E23A1380D81E8442706A3AA19408E6AD7127D826C47C034D0B2FB18F11F307B478FB63F657E29B5865DD71B918CCA8FE3BB9470EE0D297309F84070EE0D297309F840F2431DD9ED637E4DF76A79A067B4GJ1Z1OA==; JSESSIONID=F82521F2DFB764BB975730AD95DEE54B','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',}findCredit = {'page': pageNum,'location': placecode,# 'cPage':5,'code': '','name': '','evalyear':year}r = requests.post(URL ,data=findCredit, headers=HEADER)print(r)# print(r.text)if r.status_code == requests.codes.ok:my_query = r.json()if return_total_count:tpage=my_query['totalPages']return tpageelse:data=pd.DataFrame({'code':[],'name':[], 'evalyear':[], 'location':[] })for each in my_query['content']:d1=str(each['code'])d2=str(each['name'])d3 = str(each['evalyear'])d4=str(each['location'])new = pd.DataFrame({'code':[d1], 'name':[d2], 'evalyear':[d3],'location':[d4]})data = data.append(new, ignore_index=True)return dataelse:return ''if __name__=='__main__':code = input('请输入地方代码(北京—110000;天津—120000;河北—130000;内蒙古—150000;辽宁—210000;''大连—210200;黑龙江—230000;上海—310000;江苏—320000;宁波—330200;安徽—340000;''福建—350000;江西—360000;山东—370000;青岛—370200;湖北—420000;湖南—430000;''广东—440000;广西—450000;海南—460000;重庆—500000;贵州—520000;云南—530000;''西藏—540000;甘肃—620000;青海—630000;宁夏—640000;):')if code == "110000":key = "北京"elif code == "120000":key = "天津"elif code == "130000":key = "河北"elif code == "150000":key = "内蒙古"elif code == "210000":key = "辽宁"elif code == "210200":key = "大连"elif code == "230000":key = "黑龙江"elif code == "310000":key = "上海"elif code == "320000":key = "江苏"elif code == "330200":key = "宁波"elif code == "340000":key = "安徽"elif code == "350000":key = "福建"elif code == "360000":key = "江西"elif code == "370000":key = "山东"elif code == "370200":key = "青岛"elif code == "420000":key = "湖北"elif code == "430000":key = "湖南"elif code == "440000":key = "广东"elif code == "450000":key = "广西"elif code == "460000":key = "海南"elif code == "500000":key = "重庆"elif code == "520000":key = "贵州"elif code == "530000":key = "云南"elif code == "540000":key = "西藏"elif code == "620000":key = "甘肃"elif code == "630000":key = "青海"elif code == "640000":key = "宁夏"else:print('输入有误,请重新运行')year=input('请输入下载年份-(eg:):')tpages = getData(0, code, year, True)if tpages == '':print('访问过于频繁1,暂停访问!')else:x = tpages * 15print(f'{year}年{key}共有{tpages}页,每页15条记录第,共有{x}行')df = pd.DataFrame({'code': [], 'name': [], 'evalyear': [], 'location': []})# for i in range(0, tpages):i=0while i < tpages:df1 = getData(i, code, year)if len(df1)==0:print('访问过于频繁2,下载中断,临时保存')df.to_csv(f'{year}年{key}tax.csv')break# i=i-1else:print(f'正在下载第{i}页数据')df = df.append(df1, ignore_index=True)i=i+1df.to_csv(f'{year}年{key}tax.csv')

方法三——不用手动输入,直接从csv文件中读取所需下载的地方及代码信息,循环年份(如果数据量较大需要考虑分多个csv文件下载或者使用txt文件格式)

import requestsimport csvimport pandas as pddef getData(pageNum,placecode,year,return_total_count=False):URL='http://hd./service/findCredit.do'HEADER = {'Cookie':'yfx_c_g_u_id_10003701=_ck20010211232618635509545356418; yfx_f_l_v_t_10003701=f_t_1577935406837__r_t_1577935406837__v_t_1577935406837__r_c_0; _Jo0OQK=21D020D4328410D73BDFA09A917AEB40E7167BB9651465E23A1380D81E8442706A3AA19408E6AD7127D826C47C034D0B2FB18F11F307B478FB63F657E29B5865DD71B918CCA8FE3BB9470EE0D297309F84070EE0D297309F840F2431DD9ED637E4DF76A79A067B4GJ1Z1OA==; JSESSIONID=F82521F2DFB764BB975730AD95DEE54B','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',}findCredit = {'page': pageNum,'location': placecode,# 'cPage':5,'code': '','name': '','evalyear':year}r = requests.post(URL ,data=findCredit, headers=HEADER)print(r)# print(r.text)if r.status_code == requests.codes.ok:my_query = r.json()if return_total_count:tpage=my_query['totalPages']return tpageelse:data=pd.DataFrame({'code':[],'name':[], 'evalyear':[], 'location':[] })for each in my_query['content']:d1=str(each['code'])d2=str(each['name'])d3 = str(each['evalyear'])d4=str(each['location'])new = pd.DataFrame({'code':[d1], 'name':[d2], 'evalyear':[d3],'location':[d4]})data = data.append(new, ignore_index=True)return dataelse:return ''if __name__=='__main__':df_code=pd.read_csv('placecode2.csv',encoding='GBK')# print(df_code)df4 = pd.DataFrame({'code': [], 'name': [], 'evalyear': [], 'location': []})for x in range(len(df_code)):key=df_code.地方[x]code=df_code.代码[x]# print(key,code)df3 = pd.DataFrame({'code': [], 'name': [], 'evalyear': [], 'location': []})for year in range(,):print(code,year)totalpages=getData(0,code,year,True)x=totalpages*15print(f'{year}年{key}共有{totalpages}页,每页15条记录第,共有{x}行')df2=pd.DataFrame({'code':[],'name':[], 'evalyear':[], 'location':[] })for i in range(0,3):# print(f'正在下载第{i}页数据')df1=getData(i,code,year)print(df1)df2=df2.append(df1,ignore_index=True)print(f'下载完{year}年{key}数据')df3=df3.append(df2,ignore_index=True)print(f'下载完-{key}数据')df4 = df4.append(df3, ignore_index=True)print('下载完所有地方-数据')df4.to_csv(f'total_tax.csv')

注意事项:以上三种方法都需要考虑反爬虫问题,访问次数过于频繁,会暂封ip,不能继续爬取,可以考虑使用代理池,另外,以上代码中的cookie也要定期更换,否则也会导致运行出错,以上代码仅供参考。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。