How do you use Scrapy in CMD?

Nội dung chính Show

William Yang
Pablo Hoffman
startproject
How do you use a Scrapy tool?
How do I get into the Scrapy shell?
How do you set up Scrapy?
How use Scrapy shell Python?

William Yang

unread,

Aug 27, 2012, 10:08:19 PM8/27/12

command line argument is passed to the spider using -a switch. then u need also over ride the init of ur spider i think. sorry cant paste any code, i am on mobile. :(

happy googling

在 2012-8-27 上午3:49，"Иван Клешнин" <>写道：

How can i access parsed command-line arguments within a spider module?
-----------------------------------------------------------------------------------------------------------------------------
I wrote smth. like this:
parser = argparse.ArgumentParser()
parser.add_argument('name', help='The name for the spider')
args = parser.parse_args()
class LivejournalSpider(CrawlSpider):
name = name
...
Which works well with

$ scrapy crawl someblog
But to support other scrapy command e.g. "parse" and command-line flags i need to manually handle all of possible situations right there...
So i need access to scrapy native arguments which are already parsed, or (even better) to command context.
I tried to debug the whole application to find answer by myself, but my python skill is not enough. My brain is ready to explode when i see those managers and recursions ^_^
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/YP4vG7RBTYkJ.
To post to this group, send email to .
To unsubscribe from this group, send email to .
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Pablo Hoffman

unread,

Sep 5, 2012, 1:54:54 AM9/5/12

See this section about spider arguments which I've just added to the doc:

Provided by: python-scrapy_0.14.4-1_all

NAME

       scrapy - the Scrapy command-line tool

SYNOPSIS

       scrapy [command] [OPTIONS] ...

DESCRIPTION

       Scrapy  is  controlled  through  the scrapy command-line tool. The script provides several
       commands, for different purposes. Each command supports  its  own  particular  syntax.  In
       other words, each command supports a different set of arguments and options.

OPTIONS

   fetch [OPTION] URL
       Fetch a URL using the Scrapy downloader

       --headers
              Print response HTTP headers instead of body

   runspider [OPTION] spiderfile
       Run a spider

       --output=FILE
              Store scraped items to FILE in XML format

   settings [OPTION]
       Query Scrapy settings

       --get=SETTING
              Print raw setting value

       --getbool=SETTING
              Print setting value, intepreted as a boolean

       --getint=SETTING
              Print setting value, intepreted as an integer

       --getfloat=SETTING
              Print setting value, intepreted as an float

       --getlist=SETTING
              Print setting value, intepreted as an float

       --init Print initial setting value (before loading extensions and spiders)

   shell URL | file
       Launch the interactive scraping console

   startproject projectname
       Create new project with an initial project template

   --help, -h
       Print command help and options

   --logfile=FILE
       Log file. if omitted stderr will be used

   --loglevel=LEVEL, -L LEVEL
       Log level (default: None)

   --nolog
       Disable logging completely

   --spider=SPIDER
       Always use this spider when arguments are urls

   --profile=FILE
       Write python cProfile stats to FILE

   --lsprof=FILE
       Write lsprof profiling stats to FILE

   --pidfile=FILE
       Write process ID to FILE

   --set=NAME=VALUE, -s NAME=VALUE
       Set/override setting (may be repeated)

AUTHOR

       Scrapy was written by the Scrapy Developers <>.

       This  manual  page  was  written by Ignace Mouzannar <>, for the Debian
       project (but may be used by others).

                                         October 17, 2009                               SCRAPY(1)

In this article or blog post, we will discuss about scrapy’s Command Line Tool.

Scrapy is controlled through scrapy command-line tool, to be referred as the “Scrapy Tool”, to differentiate it from sub-commands.

The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set arguments and options.

# Configuration Settings

Scrapy’s configurations are saved in a scrapy.cfg file. Currently there are three levels where the configuration settings are saved. First is the root level, second is the account level, and third is the project level. The settings at the lowest level override the configurations of the above one’s(of course for the project one has set for).

Scrapy also understands a number of environment variables.

1. SCRAPY_SETTINGS_MODULE
2. SCRAPY_PROJECT
3. SCRAPY_PROJECT_SHELL

# Using the scrapy tool

We can start using scrapy command without passing any subcommand, and it prints the help list.

Scrapy 1.5.0 - project: tutorialUsage:
  scrapy  [options] [args]Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
[...]

#Creating projects

The first thing one does after downloading Scrapy, install them, and run one’s project. So start with creating a project:

scrapy startproject

That will create a Scrapy project under the directory. If wasn’t specified, will be the same as .

Next you go inside the new project directory:

cd project_dir

#Controlling projects

You use the scrapy tool from inside your project to control and manage them.

For example, to create a new spider:

scrapy genspider mydomain mydomain.com

#Available tool commands

This part will discuss the list of tool of built-in commands.

There are two kinds of commands, those that work only from inside a Scrapy project(Project specific commands) and those that also work as Global commands.

Global commands

startproject
genspider
settings
runspider
shell
fetch
view
version

Project only commands

crawl
check
list
edit
parse
bench

startproject

Syntax: scrapy startproject

Creates a new Scrapy project named project_name.

genspider

Syntax: scrapy genspider [-t template]

Creates a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The name parameter sets the name of the spider, while is used to generate the allowed_domains and start_urls spider’s attributes.

Usage example:

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed$ scrapy genspider example example.com
Created spider 'example' using template 'basic'
$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

Scrapy also provides to create spiders based on a template,while you are free to prepare spider with your own source files.

crawl

Syntax: scrapy crawl

Start crawling using a spider.

check

Syntax: scrapy check [-l]

Runs contract checks.

list

Syntax: scrapy list

Lists all the available spiders in the current project.

edit

Syntax: scrapy edit

Edit the given spider using the given editor proposed in the settings file.

fetch

Syntax: scrapy fetch

Downloads the given URL using Scrapy Downloader and writes the content into standard output.

This command is generally used to check how a spider fetches a page.

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider
--header: print the response’s HTTP header instead of the response’s body.
--no-redirect: do not follow HTTP 3xx redirects(default is to follow them)

view

Syntax: scrapy view

Opens the given url in a browser, as your Scrapy spider would see it.

--spider=SPIDER: bypass spider autodetection and force use of specific spider
--no-redirect: do not follow HTTP 3xx redirects(default is to follow them)

shell

Syntax: scrapy shell [url]

Starts the scrapy shell for the given URL or empty URL if no url is given.

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider
-c code: evaluate the code in the shell, print the result and exit
--no-redirect: do not follow HTTP 3xx redirects(default is to follow them)

parse

Syntax: scrapy parse [options]

Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.

Supported options:

--spider=SPIDER: bypass spider autodetection.
--a NAME=VALUE:set spider argument(may be repeated)
--callback or c: spider method to use as callback for parsing response
--meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: -meta=’{“foo” : ”bar”}’
--pipelines: process items through pipelines
--rules or -r: use CrawlSpider rules to discover the callback(i.e. spider method) to use for parsing the response
--noitems: don’t show scraped items
--nolinks: don’t show extracted links
--nocolour: avoid using pygments to colorized the output
--depth or -d: depth level for which the requests should be followed recursively.
--verbose or -v: display information for each depth level

settings

Syntax: scrapy settings [options]

Get the value of the scrapy settings.

runspider

Syntax: scrapy runspider

Run a spider self-contained in a python file, without having to create a project.

version

Syntax: scrapy version [-v]

Prints the Scrapy version. If used with -v it also prints Python, Twisted and Platform info, which is useful for the bugs.

bench

Syntax: scrapy bench

Run a benchmarking testing.

Custom Project commands. For more details about this, please refer to scrapy documentation.

How do you use a Scrapy tool?

While working with Scrapy, one needs to create scrapy project..

To get anchor tag : response.css('a').

To extract the data : links = response.css('a').extract().

To get href attribute, use attributes tag. links = response.css('a::attr(href)').extract().

How do I get into the Scrapy shell?

The scraping code is written using selectors, with XPath or CSS expressions. As shown above, we can get the HTML code, of the entire page, by writing response. text, at the shell. Let us see how we can test scraping code, using the response object, with XPath or CSS expressions.

How do you set up Scrapy?

When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE . The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, e.g. myproject. settings .

How use Scrapy shell Python?

Available Shortcuts.

shelp() - print a help with the list of available objects and shortcuts..

fetch(url[, redirect=True]) - fetch a new response from the given URL and update all related objects accordingly. ... .

fetch(request) - fetch a new response from the given request and update all related objects accordingly..

How do you use Scrapy in CMD?

William Yang

Pablo Hoffman

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

AUTHOR

startproject

genspider

crawl

check

list

edit

fetch

view

shell

parse

settings

runspider

version

bench

How do you use a Scrapy tool?

How do I get into the Scrapy shell?

How do you set up Scrapy?

How use Scrapy shell Python?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội