How do you use Scrapy in CMD?

William Yang

unread,

Aug 27, 2012, 10:08:19 PM8/27/12

to

command line argument is passed to the spider using -a  switch. then u need also over ride the init of ur spider i think. sorry cant paste any code, i am on mobile. :(

happy googling

在 2012-8-27 上午3:49,"Иван Клешнин" <>写道:

How can i access parsed command-line arguments within a spider module?

-----------------------------------------------------------------------------------------------------------------------------

I wrote smth. like this:

parser = argparse.ArgumentParser()

parser.add_argument('name', help='The name for the spider')

args = parser.parse_args()

class LivejournalSpider(CrawlSpider):

    name = name

    ...

Which works well with 

$ scrapy crawl someblog

But to support other scrapy command e.g. "parse" and command-line flags i need to manually handle all of possible situations right there...

So i need access to scrapy native arguments which are already parsed, or (even better) to command context. 

I tried to debug the whole application to find answer by myself, but my python skill is not enough. My brain is ready to explode when i see those managers and recursions ^_^

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/YP4vG7RBTYkJ.
To post to this group, send email to .
To unsubscribe from this group, send email to .
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Pablo Hoffman

unread,

Sep 5, 2012, 1:54:54 AM9/5/12

to

See this section about spider arguments which I've just added to the doc:

Provided by: python-scrapy_0.14.4-1_all

How do you use Scrapy in CMD?

NAME

       scrapy - the Scrapy command-line tool

SYNOPSIS

       scrapy [command] [OPTIONS] ...

DESCRIPTION

       Scrapy  is  controlled  through  the scrapy command-line tool. The script provides several
       commands, for different purposes. Each command supports  its  own  particular  syntax.  In
       other words, each command supports a different set of arguments and options.

OPTIONS

   fetch [OPTION] URL
       Fetch a URL using the Scrapy downloader

       --headers
              Print response HTTP headers instead of body

   runspider [OPTION] spiderfile
       Run a spider

       --output=FILE
              Store scraped items to FILE in XML format

   settings [OPTION]
       Query Scrapy settings

       --get=SETTING
              Print raw setting value

       --getbool=SETTING
              Print setting value, intepreted as a boolean

       --getint=SETTING
              Print setting value, intepreted as an integer

       --getfloat=SETTING
              Print setting value, intepreted as an float

       --getlist=SETTING
              Print setting value, intepreted as an float

       --init Print initial setting value (before loading extensions and spiders)

   shell URL | file
       Launch the interactive scraping console

   startproject projectname
       Create new project with an initial project template

   --help, -h
       Print command help and options

   --logfile=FILE
       Log file. if omitted stderr will be used

   --loglevel=LEVEL, -L LEVEL
       Log level (default: None)

   --nolog
       Disable logging completely

   --spider=SPIDER
       Always use this spider when arguments are urls

   --profile=FILE
       Write python cProfile stats to FILE

   --lsprof=FILE
       Write lsprof profiling stats to FILE

   --pidfile=FILE
       Write process ID to FILE

   --set=NAME=VALUE, -s NAME=VALUE
       Set/override setting (may be repeated)

AUTHOR

       Scrapy was written by the Scrapy Developers <>.

       This  manual  page  was  written by Ignace Mouzannar <>, for the Debian
       project (but may be used by others).

                                         October 17, 2009                               SCRAPY(1)

In this article or blog post, we will discuss about scrapy’s Command Line Tool.

Scrapy is controlled through scrapy command-line tool, to be referred as the “Scrapy Tool”, to differentiate it from sub-commands.

The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set arguments and options.

# Configuration Settings

Scrapy’s configurations are saved in a scrapy.cfg file. Currently there are three levels where the configuration settings are saved. First is the root level, second is the account level, and third is the project level. The settings at the lowest level override the configurations of the above one’s(of course for the project one has set for).

Scrapy also understands a number of environment variables.

1. SCRAPY_SETTINGS_MODULE
2. SCRAPY_PROJECT
3. SCRAPY_PROJECT_SHELL

# Using the scrapy tool

We can start using scrapy command without passing any subcommand, and it prints the help list.

Scrapy 1.5.0 - project: tutorialUsage:
scrapy [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
[...]

#Creating projects

The first thing one does after downloading Scrapy, install them, and run one’s project. So start with creating a project:

scrapy startproject

That will create a Scrapy project under the directory. If wasn’t specified, will be the same as .

Next you go inside the new project directory:

cd project_dir

#Controlling projects

You use the scrapy tool from inside your project to control and manage them.

For example, to create a new spider:

scrapy genspider mydomain mydomain.com

#Available tool commands

This part will discuss the list of tool of built-in commands.

There are two kinds of commands, those that work only from inside a Scrapy project(Project specific commands) and those that also work as Global commands.

Global commands

  • startproject
  • genspider
  • settings
  • runspider
  • shell
  • fetch
  • view
  • version

Project only commands

  • crawl
  • check
  • list
  • edit
  • parse
  • bench

startproject

  • Syntax: scrapy startproject

Creates a new Scrapy project named project_name.

genspider

  • Syntax: scrapy genspider [-t template]

Creates a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The name parameter sets the name of the spider, while is used to generate the allowed_domains and start_urls spider’s attributes.

Usage example:

$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

Scrapy also provides to create spiders based on a template,while you are free to prepare spider with your own source files.

crawl

  • Syntax: scrapy crawl

Start crawling using a spider.

check

  • Syntax: scrapy check [-l]

Runs contract checks.

list

  • Syntax: scrapy list

Lists all the available spiders in the current project.

edit

  • Syntax: scrapy edit

Edit the given spider using the given editor proposed in the settings file.

fetch

  • Syntax: scrapy fetch

Downloads the given URL using Scrapy Downloader and writes the content into standard output.

This command is generally used to check how a spider fetches a page.

Supported options:

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider
  • --header: print the response’s HTTP header instead of the response’s body.
  • --no-redirect: do not follow HTTP 3xx redirects(default is to follow them)

view

  • Syntax: scrapy view

Opens the given url in a browser, as your Scrapy spider would see it.

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider
  • --no-redirect: do not follow HTTP 3xx redirects(default is to follow them)

shell

  • Syntax: scrapy shell [url]

Starts the scrapy shell for the given URL or empty URL if no url is given.

Supported options:

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider
  • -c code: evaluate the code in the shell, print the result and exit
  • --no-redirect: do not follow HTTP 3xx redirects(default is to follow them)

parse

  • Syntax: scrapy parse [options]

Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.

Supported options:

  • --spider=SPIDER: bypass spider autodetection.
  • --a NAME=VALUE:set spider argument(may be repeated)
  • --callback or c: spider method to use as callback for parsing response
  • --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: -meta=’{“foo” : ”bar”}’
  • --pipelines: process items through pipelines
  • --rules or -r: use CrawlSpider rules to discover the callback(i.e. spider method) to use for parsing the response
  • --noitems: don’t show scraped items
  • --nolinks: don’t show extracted links
  • --nocolour: avoid using pygments to colorized the output
  • --depth or -d: depth level for which the requests should be followed recursively.
  • --verbose or -v: display information for each depth level

settings

  • Syntax: scrapy settings [options]

Get the value of the scrapy settings.

runspider

  • Syntax: scrapy runspider

Run a spider self-contained in a python file, without having to create a project.

version

  • Syntax: scrapy version [-v]

Prints the Scrapy version. If used with -v it also prints Python, Twisted and Platform info, which is useful for the bugs.

bench

  • Syntax: scrapy bench

Run a benchmarking testing.

Custom Project commands. For more details about this, please refer to scrapy documentation.

How do you use a Scrapy tool?

While working with Scrapy, one needs to create scrapy project..
To get anchor tag : response.css('a').
To extract the data : links = response.css('a').extract().
To get href attribute, use attributes tag. links = response.css('a::attr(href)').extract().

How do I get into the Scrapy shell?

The scraping code is written using selectors, with XPath or CSS expressions. As shown above, we can get the HTML code, of the entire page, by writing response. text, at the shell. Let us see how we can test scraping code, using the response object, with XPath or CSS expressions.

How do you set up Scrapy?

When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE . The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, e.g. myproject. settings .

How use Scrapy shell Python?

Available Shortcuts.
shelp() - print a help with the list of available objects and shortcuts..
fetch(url[, redirect=True]) - fetch a new response from the given URL and update all related objects accordingly. ... .
fetch(request) - fetch a new response from the given request and update all related objects accordingly..