node爬虫入门

Posted on 2021-02-22 Edited on 2023-07-31

node 基础

require 模块结束时,需要用;结尾.否则报错.

常用模块

请求模块

request

npm 地址
虽然 request 模块是几乎学习 node 接触的第一个模块,但是目前已经被弃用了.所以推荐使用 superagent.

superagent

npm 地址
简单的调用,简单的写法,配套的插件.

const superagent = require("superagent");

// callback
superagent
  .post("/api/pet")
  .send({ name: "Manny", species: "cat" }) // sends a JSON post body
  .set("X-API-Key", "foobar")
  .set("accept", "json")
  .end((err, res) => {
    // Calling the end function will send the request
  });

// promise with then/catch
superagent.post("/api/pet").then(console.log).catch(console.error);

// promise with async/await
(async () => {
  try {
    const res = await superagent.post("/api/pet");
    console.log(res);
  } catch (err) {
    console.error(err);
  }
})();

解析模块

cheerio

cheerio 是 node 中的 jQuery,可方便的控制 dom 节点.获取相应参数.
语法使用 jquery 语法.

1 2	const $ = cheerio.load(body); $("div");

Puppeteer

Puppeteer 俗称无头浏览器,实际是一个没有显示界面的 Chrome 浏览器.故能在浏览器上操作的也可以通过 Puppeteer 操作.
缺点是比常规爬虫慢.

cheerio 和 Puppeteer 的区别

cherrico 本质上只是一个使用类似 jquery 的语法操作 HTML 文档的库，使用 cherrico 爬取数据，只是请求到静态的 HTML 文档，如果网页内部的数据是通过 ajax 动态获取的，那么便爬取不到的相应的数据。
而 Puppeteer 能够模拟一个浏览器的运行环境，能够请求网站信息，并运行网站内部的逻辑。然后再通过 WS 协议动态的获取页面内部的数据，并能够进行任何模拟的操作(点击、滑动、hover 等),并且支持跳转页面，多页面管理。
甚至能注入 node 上的脚本到浏览器内部环境运行，总之，你能对一个网页做的操作它都能做，你不能做的它也能做。

其他模块

iconv-lite

转码模块,如果中文乱码.可使用该模块进行转码，中文显示正常后开始解析源码，获取需要的 URL.

superagent-charset

如果不想那么麻烦,直接使用,省的再安装模块

const request = require("superagent");
require("superagent-charset")(request);

request
  .get("http://www.xxx.com/")
  .charset("gbk")
  .end((err, res) => {});

Tesseract

OCR 识别模块,可用于识别验证码图片.

gm

gm 是 Node.js 对 GraphicsMagick 和 ImageMagick 封装。GraphicsMagick 和 ImageMagick 是老牌的图片处理工具.用于对验证码图片的噪点处理.

配合

Puppeteer 用于抓取网页中的图片，Tesseract 做图像识别，gm 实现图片去噪点，三个工具各司其职，分工明确。

node 爬虫框架

ppspider

github 地址

操作思想

爬虫主要在于两个字,’爬’和’取’.
‘爬’在于发起请求,获取数据.
‘取’在于解析数据.

错误处理

node 多会进行 IO 操作,所以需要经常使用错误处理.
常用的比如:

superagent.get(url).end(function (err, res) {
  // 抛错拦截
  if (err) {
    return throw Error(err);
  }
  // 等待 code
});

try/catch 操作注意异步处理.

《深入浅出 Nodejs》书中描述” 尝试对异步方法进行 try/catch 操作只能捕获当次事件循环内的异常，对 callback 执行时抛出的异常将无能为力 “。
可参考此文

反爬虫

通过 UA 机制识别爬虫。

UA 的全称是 UserAgent，它是请求浏览器的身份标志，许多网站使用它来作为鉴别爬虫的标志，假如访问请求的头部中没有带 UA 那么就会被判定为爬虫，但由于这种要针对这种反爬虫机制非常容易，即随机 UA，因此这种反爬机制使用的很少。

通过访问频率鉴别爬虫。

爬虫为了更好地保证效率，通常会在很短的时间内多次访问目标网站，因此能够通过单个 IP 访问的频率来判断是否为爬虫。并且，这种反爬方式比较难以被反反爬机制反制，只能通过更换代理 IP 来保证效率。

通过 Cookie 和验证码识别爬虫。

Cookie 是指会员制的账号密码登陆验证，这就可以通过限制单账号抓取频率来限制爬虫抓取，而验证码完全是随机的，爬虫脚本无法正确鉴别，同样能够限制爬虫程序。

反反爬虫

爬虫过于频繁就会触发封 ip,弹验证码等反爬虫的行为.那么就要有一些反反爬虫的策略.
常用操作有使用代理,降低操作频率.
添加多个 user-agent 用于随机调换.

动态 userAgent

//userAgent
const userAgents = [
  "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
  "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0) ,Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
  "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
  "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
  "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
  "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
  "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
  "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
  "Opera/9.25 (Windows NT 5.1; U; en), Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
  "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
];

module.exports = userAgents;

//app.js
import request from "superagent";
import userAgents from "../src/userAgent";

async function doRequest() {
  let userAgent = userAgents[parseInt(Math.random() * userAgents.length)];
  request
    .get("http://www.xxx.com")
    .set({ "User-Agent": userAgent }) //随机调用UA
    .timeout({ response: 5000, deadline: 60000 })
    .end(async (err, res) => {
      // 处理数据
    });
}

superagent-cache-plugin

有的需要 cookie 才可正常访问的接口.使用此插件.

1	npm install superagent-cache-plugin --save

var cacheModule = require("cache-service-cache-module");
var cache = new cacheModule({ storage: "session" });

// Require superagent-cache-plugin and pass your cache module
var superagentCache = require("superagent-cache-plugin")(cache);

superagent
  .get(uri)
  .use(superagentCache)
  .end(function (err, response) {
    // response is now cached!
    // subsequent calls to this superagent request will now fetch the cached response
  });

节流模块 superagent-throttle

设置限时,节流.
github 地址

const request     = require('superagent')
const Throttle    = require('superagent-throttle')

let throttle = new Throttle({
  active: true,     // 插件开关
  rate: 5,          // how many requests can be sent every `ratePer`
  ratePer: 10000,   // number of ms in which `rate` requests may be sent
  concurrent: 2     // 并发数
})

request
.get('http://placekitten.com/100/100')
.use(throttle.plugin())
.end((err, res) => { ... })

限流模块 async

function fetchContents(urls) {
  return new Promise((resolve, reject) => {
    var results = [];
    async.eachLimit(
      urls,
      3,
      (url, callback) => {
        spider(
          { url: url, decoding: "gb2312" },
          {
            url: {
              selector: "#Zoom table td a!text",
            },
            title: {
              selector: ".title_all h1!text",
            },
          }
        ).then(
          (d) => {
            results.push(d);
            callback();
          },
          () => {
            callback();
          }
        );
      },
      () => {
        resolve(results);
      }
    );
  });
}

动态 ip

构建个人代理池

1	npm install ip-proxy-pool

转载文章

避免重复抓取

var fs = require("fs-extra");
var path = require("path");
var uniqueArray = [];
const UNIQUE_ARRAY_URL = "./_fetchedList.json";
try {
  uniqueArray = require(UNIQUE_ARRAY_URL);
} catch (e) {}

function dealListData(data) {
  return new Promise((resolve, reject) => {
    var urls = request.get(data, "items");
    if (urls) {
      urls = urls
        .map((url) => {
          return "http://www.dytt8.net" + url;
        })
        .filter((url) => {
          return uniqueArray.indexOf(url) === -1;
        });
      // 如果为空就reject
      urls.length ? resolve(urls) : reject("empty urls");
    } else {
      reject(urls);
    }
  });
}

function addUniqueArray(url) {
  uniqueArray.push(url);
  if (uniqueArray.length > 300) {
    // 超长就删掉多余的
    uniqueArray.shift();
  }
}

fetchList()
  .then(dealListData)
  .then(fetchContents)
  .then((d) => {
    console.log(d, d.length);
    // json落地
    fs.writeJson(path.join(__dirname, UNIQUE_ARRAY_URL), uniqueArray);
  })
  .catch((e) => {
    console.log(e);
  });

文件写入

node 中文件读取写入有两种方法.fs.readFile/writeFile 和 fs.createReadStream/writeStream.

二者区别

fs.writeFile把文件内容全部读入内存，然后再写入文件，对于小型的文本文件，这没有多大问题，比如 grunt-file-copy 就是这样实现的。但是对于体积较大的二进制文件，比如音频、视频文件，动辄几个 GB 大小，如果使用这种方法，很容易使内存“爆仓”。理想的方法应该是读一部分，写一部分，不管文件有多大，只要时间允许，总会处理完成，这里就需要用到流的概念

1
2
3

fs.createReadStream("/path/to/source").pipe(
  fs.createWriteStream("/path/to/dest")
);

以上都是我瞎编的