Beginner's Guide to Scrapy for Python

ยท5 min read

Beginner's Guide to Scrapy for Python

Scrapy is a high-level web scraping framework with use cases varying from data mining to automated testing.  Similar to automating user interaction with Selenium, Scrapy can crawl and interact with webpages.  However, Scrapy is considered a better choice for working with larger datasets and also has a larger collection of related projects and plugins.  Let's get started.

1. Create a virtual environment

Windows:

C:\Users\Owner> cd desktop
C:\Users\Owner\desktop> py -m venv scrap
C:\Users\Owner\desktop> cd scrap
C:\Users\Owner\desktop\scrap> Scripts\activate
(scrap)C:\Users\Owner\desktop\scrap>

 

2. Install scrapy

  • Install scrapy within your activated virtual environment
(scrap)C:\Users\Owner\desktop\scrap>pip install scrapy

 

3. Create a scrapy project

  • Create a scrapy project named "myproject"
scrapy startproject myproject

At this point, scrapy will set up our document structure as follows:

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            #empty until we add a spider
    

 

4. Create a basic spider

In this example, we will scrape Quicken Loans mortgage reviews from creditkarma.com. Open myproject in your text editor.  We recommend Sublime Text. Create a new file spider1.py in the myproject/spiders folder.  The following example scrapes data by selecting elements via CSS.  

  • Our spider subclasses scrapy.Spider
  • name must be a unique identifier between spiders
  • start_urls are the URLs to be scraped
  • get() returns one element using the CSS selector
  • getall() returns all elements that match the CSS selector
  • parse() extracts data from a response containing page content
import scrapy

class ReviewSpider(scrapy.Spider):
    name = "quicken"
    start_urls = [
    "https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",
    ]

    def parse(self, response):
      reviews = response.css('.readmoreInner p::text').getall()
      yield {"text" : reviews}

We defined the content to scrape with a CSS query that called on the response object.  To identify the actual reviews, we inspected the webpage with our developer tools and found that each review is in paragraph element nested within a division with a class attribute of "readmoreInner".  Then we just write the corresponding CSS query as if wanted to add a CSS property to the text. 

https://d2gdtie5ivbdow.cloudfront.net/articles/creditkarma-quicken.PNG

 

5. Run the spider

  • cd into the spiders folder from your command line
  • Run the following command to run Scrapy
(scrap) C:\Users\Owner\Desktop\code\scrap\myproject\myproject\spiders>scrapy crawl quicken

You will see a good amount of information outputted in your command prompt/terminal.  Some of this data includes which pages were scraped, the number of requests sent, and the request methods.  You will also see a dictionary of all the text we scraped, in this example, we have all of the reviews from the web page.  Instead of just outputting the data to your command prompt, let's save the data as a JSON file with the following command:

 

(scrap) C:\Users\Owner\Desktop\code\scrap\myproject\myproject\spiders>scrapy crawl quicken -o reviews.json

This will output all scraped information into a JSON file located in the spiders folder.

 

6. Create a more advanced spider

This time let's loop through multiple pages by identifying the next page button and adding it to our spider.  Note this is a new spider title spider2.py.

https://d2gdtie5ivbdow.cloudfront.net/articles/cssquicken.PNG

import scrapy

class ReviewSpider(scrapy.Spider):
  name = "quicken2"
  start_urls = [
  "https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",]

  def parse(self, response):
    review = response.css('.readmoreInner p::text').getall()
    yield {"text" : review}

    NEXT_PAGE_SELECTOR = '.pagination-link.next-page ::attr(href)' #the html container for the next page arrow
    next_page = response.css(NEXT_PAGE_SELECTOR).get()
    if next_page:
      yield scrapy.Request(
        response.urljoin(next_page),
        callback=self.parse
      )

 

7. Using item containers

You can also create items for larger data sets to keep your data organized.

  • add the following code in your items.py file
import scrapy

class ReviewItem(scrapy.Item):
    # define the fields for your item here like:
    text = scrapy.Field()
    date = scrapy.Field()
  
  • add the following code top your created .py file
import scrapy
from ..items import ReviewItem


class ReviewSpider(scrapy.Spider):
  name = "quicken2"
  start_urls = [
  "https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",]

  def parse(self, response):
    items = ReviewItem()

    items['text'] = response.css('.readmoreInner p::text').getall()
    items['date'] = response.css('.review-date::text').getall()
    yield items

    # NEXT_PAGE_SELECTOR = '.pagination-link.next-page ::attr(href)' 
    # next_page = response.css(NEXT_PAGE_SELECTOR).get()
    # if next_page:
    #   yield scrapy.Request(
    #     response.urljoin(next_page),
    #     callback=self.parse
    #   )

 

8. Running your spider in a shell prompt

Also, you can use a shell prompt to test a few lines of your code before running the spider completely

  • cd to the spiders folder
  • type scrapy shell '(domain of website)' to open a shell in Terminal
    • type in response.css('(CSS_selector)::text').get()
    • check result
  • to quit the shell prompt type quit()

 

9. Rotating proxies

As you continue to web scrape, check out these resources to help with staying anonymous, and prevent yourself from getting blocked:

  1.