Scrapy is a web-scraping framework for python. It's pretty popular and at the moment of writing it has over 16000 stars on github. In terms of codebase scrapy is pretty simple, however there are few things that are not as explicit as they could be in favor of abstraction and development simplicity.
Not to mention millions of websites that provide their own unique scraping challenges.
So if you do end up not understanding something or encountering some of the few scrapy's quirks, how do you go about it?
First thing you should do is read is how to ask a good question on stackoverflow.
It's a brilliant guide by, without a doubt the biggest Q&A website on the web, and it focuses on how to ask a good questions regardless of the topic. Following these guidelines not only make it easy for people to help you but also easy for you, yourself to formulate your question and understand the issue you are facing!
There are two places you can go to with your scrapy related questions and issues:
The issue with Stackoverflow is that it has a general rule of questions having to be generic, that means asking how to get price on this item on amazon is not a fit question. However the user base on
scrapy tag seems to be quite understanding of this and tend to be quite lenient with reports and down-votes, but don't be surprised if your post gets down-voted or put on hold. All you can do is to try and make your issue more generic and hope for the best!
IRC! @ irc.freenode.org #scrapy
Good old IRC has been there for decades and even though it dropped in popularity quite significantly, it's still a great place to get help on any subject and scrapy is not an exception. Feel free to join the channel and ask questions about anything scrapy related; you can find me there too!
Scrapinghub is the company behind scrapy and they have a user forum, so naturally it's a great place to look for help when it comes to your scrapy issues!
There's an official scrapy subreddit, which isn't very active but I can tell you for a fact that a lot of people that are involved with scrapy keep an eye on it. It's a great place for some discussions that might not fit stackoverflow and irc.
To debug an issue and get the help you need you need to provide information about your problem:
Once you have these bits you can easily formulate your question and I'm sure someone will help you out!
To save a log of your spider run you can use UNIX output redirection syntax:
scrapy crawl myspider 2>&1 > mylog.log # or scrapy crawl myspider &> mylog.log
scrapy crawl myspider - is a scrapy command that will start crawling spider called
2>&1 - is UNIX syntax for redirecting error output to standard output. In UNIX there are types of outputs and in your log you want to have both of them in one file.
> mylog.log - is another UNIX output redirection, but this time we redirect the output to file called
Tip: points 2 and 3 can be summarized as
&> in bash version 4 and up
For logging scrapy uses python's built-in
logging module which by itself is pretty awesome! If you look into it, it might appear quite daunting but you can actually just
import logging and simply log message to root logger:
logging.warning("this page has no next page"). To have simple logging in your spider.
Scrapy can automatically produce output in one these formats:
'xml', 'jsonlines', 'jl', 'json', 'csv', 'pickle', 'marshal'
To do that simply run
crawl command with
--output flag (
-o for short version) and provide a name + file ending of format you want as an argument:
scrapy crawl myspider --output output.json
This will output all items your spider spews out to
To get help for readability purposes you probably want to use either
xml since those are most readable and as described in section below parsing-friendly formats.
Tip: You can actually tell scrapy to produce output to stdout directly by setting output argument to
scrapy crawl myspider -t json -o - output.json
There few tools to parse
xml content, similar like you'd use
grep in unix. The most popular and widely known is probably jq, which I believe translates to json query.
I personally really dislike that jq uses it's own mini-language as opposed to xpath or css selectors we all know, love and use daily.
So in response to this I made PQ! It uses xpath and css selectors as well as support both json and xml parsing.
To put it shortly, using the tools described above you can find specific values of some fields really easily.
Lets imagine we have a bunch of products that have these fields: name and price. Now for some reason Samsung items have weird pricing and we want to find out whether that's the case every time we update the code.
For example using pq we can navigate the prices of items that have some keywords in their names:
cat output.json | pq "//item[contains(@name,'samsung')]/price/text()"
Will find all items that contain "samsung" in the name and output their price values. If you change up your spider an run this command again you can easily navigate whether the values are changing.
You can combine this with scrapy spider redirection to have everything in one line:
scrapy crawl spider --nolog -t json -o - | pq "//item[contains(@name,'samsung')]/price/text()"
Scrapy is a lovely framework and web-crawling is a tricky subjects with a lot of hidden issues, quirks and complexities. Because of it being rather big subjects and every spider having it's own challenges it might be difficult to find help. However I feel if you follow the steps and ideas described in this blog post you'll have a really good chance at getting some help either on stackoverflow or irc!
Do you have any places where you go to with your scrapy or web-crawling related questions? Did I miss something important? Leave the comment below :)