Node.js Package

Todo

publish it and add the link to npm package with the package docs here

We provide a helper library to help the development process of external spiders in JavaScript using Node.js.

It’s recommended to read the Quickstart - Dmoz Streaming Spider before using this package.

Installation

To install the scrapystreaming package, runs:

npm install scrapystreaming

and loads it using:

var scrapy = require('scrapystreaming');

scrapystreaming

The scrapystreaming Node package provide the following commands:

Tip

The Scrapy Streaming and your spider communicates using the system stdin, stdout, and stderr. So, don’t write any data that is not a json message to the system stdout or stderr.

These commands write and read data from stdin, stdout, and stderr when necessary, so you don’t need to handle the communication channel manually.

create_spider(name, startUrls, callback[, allowedDomains, customSettings])
Parameters:
  • name (string) – name of the
  • startUrls (array) – list of initial
  • callback (Function) – callback to handle the responses from
  • allowedDomains (array) – list of allowed
  • customSettings (object) – custom settings to be used in Scrapy

This command is used to create and run a Spider, sending the spider message.

Usages:

var callback = function(response) {
    // handle the response message
};

// usage 1, all parameters
scrapy.createSpider('sample', ['http://example.com'], callback,
                    ['example.com'], {some_setting: 'some value'});
// usage 2, empty spider

scrapy.createSpiders("sample", [], parse);
closeSpider()

Closes the spider, sending the close message.

Usage:

scrapy.closeSpider();
sendLog(message, level)
Parameters:
  • message (string) – log message
  • level (string) – log level, must be one of ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, and ‘DEBUG’

Send a log message to the Scrapy Streaming’s logger output. This commands sends the log message.

Usages:

// logging some error
sendLog("something wrong", "error")
sendRequest(url, callback, config)
Parameters:
  • url (string) – request url
  • callback (function) – response callback
  • config (object) – object with extra request parameters (optional)
  • config.base64 (boolean) – if true, converts the response body to base64. (optional) :param string config.method request method (optional)
  • config.meta (object) – request extra data (optional)
  • config.body (string) – request body (optional)
  • config.headers (object) – request headers (optional)
  • config.cookies (object) – rqeuest extra cookies (optional)
  • config.encoding (string) – default encoding (optional)
  • config.priority (int) – request priority (optional)
  • config.dont_filter (boolean) – if true, the request don’t pass on the request duplicate filter (optional)

Open a new request, using the request message.

Usages:

var callback = function(response) {
    // parse the response
};

scrapy.sendRequest('http://example.com', callback);

// base64 encoding, used to request binary content, such as files
var config = {base64: true};
scrapy.sendRequest('http://example.com/some_file.xyz', callback, config)
sendFromResponseRequest(url, callback, fromResponseRequest, config)
Parameters:
  • url (string) – request url
  • callback (Function) – response callback
  • fromResponseRequest (object) – Creates a new request using the response
  • fromResponseRequest.base64 (boolean) – if true, converts the response body to base64. (optional)
  • fromResponseRequest.method (string) – request method (optional)
  • fromResponseRequest.meta (object) – request extra data (optional)
  • fromResponseRequest.body (string) – request body (optional)
  • fromResponseRequest.headers (object) – request headers (optional)
  • fromResponseRequest.cookies (object) – rqeuest extra cookies (optional)
  • fromResponseRequest.encoding (string) – default encoding (optional)
  • fromResponseRequest.priority (int) – request priority (optional)
  • fromResponseRequest.dont_filter (boolean) – if true, the request don’t pass on the request duplicate filter (optional)
  • fromResponseRequest.formname (string) – FormRequest.formname parameter (optional)
  • fromResponseRequest.formxpath (string) – FormRequest.formxpath parameter (optional)
  • fromResponseRequest.formcss (string) – FormRequest.formcss parameter (optional)
  • fromResponseRequest.formnumber (int) – FormRequest.formnumber parameter (optional)
  • fromResponseRequest.formdata (object) – FormRequest.formdata parameter (optional)
  • fromResponseRequest.clickdata (object) – FormRequest.clickdata parameter (optional)
  • fromResponseRequest.dont_click (boolean) – FormRequest.dont_click parameter (optional)
  • config (object) – object with extra request parameters (optional)
  • config.base64 (boolean) – if true, converts the response body to base64. (optional)
  • config.method (string) – request method (optional)
  • config.meta (object) – request extra data (optional)
  • config.body (string) – request body (optional)
  • config.headers (object) – request headers (optional)
  • config.cookies (object) – request extra cookies (optional)
  • config.encoding (string) – default encoding (optional)
  • config.priority (int) – request priority (optional)
  • config.dont_filter (boolean) – if true, the request don’t pass on the request duplicate filter (optional)

This function creates a request, and then use its response to open a new request using the from_response_request message.

Usages:

var callback = function(response) {
    // parse the response
};

// submit a login form, first requesting the login page, and then submitting the form

// we first create the form data to be sent
var fromResponseRequest = {
    formcss: '#login_form',
    formdata: {user: 'admin', pass: '1'}
};

// and open the request
scrapy.sendFromResponseRequest('http://example.com/login', callback, sendFromResponseRequest);
runSpider([exceptionHandler])
Parameters:exceptionHandler (function) – function to handle exceptions. Must receive a single parameter, the received json with the exception. (optional)

Starts the spider execution. This will bind the process stdin to read data from Scrapy Streaming, and process each message received.

If you want to handle the exceptions generated by Scrapy, pass a function that receives a single parameter as an argument.

By default, any exception will stop the spider execution and throw an Error.

Usage:

// create the spider

...
scrapy.createSpiders("sample", ['http://example.com'], parse);

// and start to listen the process stdin
scrapy.runSpider();

// with exception listener
scrapy.runSpider(function(error){
    // ignores the exception
});

Dmoz Streaming Spider with R

In this section, we’ll implement the same spider developed in Quickstart - Dmoz Streaming Spider using the scrapystreaming package. It’s recommended that you have read the quickstart section before following this topic, to get more details about Scrapy Streaming and the spider being developed.

We’ll be using the cheerio package to analyze the html content, feel free to use any one.

We start by loading the required libraries and defining two global variables:

#!/usr/bin/env node

var scrapy = require('scrapystreaming');
var jsonfile = require('jsonfile');
var cheerio = require('cheerio');

var pendingRequests = 0;
var result = {};

Then, we define two functions:

  • parse - parse the initial page, and then open a new request to each subcategory
  • parse_cat - parse the subcategory page, getting the links and saving it to the result variable.
// function to parse the response from the startUrls
var parse = function(response) {
    // loads the html page
    var $ = cheerio.load(response.body);

    // extract subcategories
    $('#subcategories-div > section > div > div.cat-item > a').each(function(i, item) {
        scrapy.sendRequest('http://www.dmoz.org' + $(this).attr('href'), parse_cat);
        pendingRequests++;
    });
};

// parse the response from subcategories
var parse_cat = function(response) {
    var $ = cheerio.load(response.body);

    // extract results
    $('div.title-and-desc a').each(function(i, item) {
        result[$(this).text().trim()] = $(this).attr('href');
    });

    pendingRequests--;
    // if there is no peding requests, save the result and close the spider
    if (pendingRequests == 0) {
        jsonfile.writeFile('outputs/dmoz_data.json', result);
        scrapy.closeSpider();
    }
};

Notice that when using the sendRequest(), we pass the parse_cat function as the callback. Therefore, each response coming from this request will execute the parse_cat function.

Finally, we start and run the spider, using:

scrapy.createSpider('dmoz', ["http://www.dmoz.org/Computers/Programming/Languages/Python/"], parse);
scrapy.runSpider();

then, just save your spider and execute it using:

scrapy streaming name_of_script.js

or:

scrapy streaming node -a name_of_script.js