Jun 08, 2020
Python ยท5 min read
Scrapy is a high-level web scraping framework with use cases varying from data mining to automated testing. Similar to automating user interaction with Selenium, Scrapy can crawl and interact with webpages. However, Scrapy is considered a better choice for working with larger datasets and also has a larger collection of related projects and plugins. Let's get started.
1. Create a virtual environment
Windows:
C:\Users\Owner> cd desktop
C:\Users\Owner\desktop> py -m venv scrap
C:\Users\Owner\desktop> cd scrap
C:\Users\Owner\desktop\scrap> Scripts\activate
(scrap)C:\Users\Owner\desktop\scrap>
(scrap)C:\Users\Owner\desktop\scrap>pip install scrapy
scrapy startproject myproject
At this point, scrapy will set up our document structure as follows:
myproject/
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
#empty until we add a spider
In this example, we will scrape Quicken Loans mortgage reviews from creditkarma.com. Open myproject in your text editor. We recommend Sublime Text. Create a new file spider1.py
in the myproject/spiders folder. The following example scrapes data by selecting elements via CSS.
scrapy.Spider
name
must be a unique identifier between spiders
start_urls
are the URLs to be scrapedget()
returns one element using the CSS selectorgetall()
returns all elements that match the CSS selectorimport scrapy
class ReviewSpider(scrapy.Spider):
name = "quicken"
start_urls = [
"https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",
]
def parse(self, response):
reviews = response.css('.readmoreInner p::text').getall()
yield {"text" : reviews}
cd
into the spiders folder from your command line(scrap) C:\Users\Owner\Desktop\code\scrap\myproject\myproject\spiders>scrapy crawl quicken
(scrap) C:\Users\Owner\Desktop\code\scrap\myproject\myproject\spiders>scrapy crawl quicken -o reviews.json
This will output all scraped information into a JSON file located in the spiders folder.
This time let's loop through multiple pages by identifying the next page button and adding it to our spider. Note this is a new spider title spider2.py.
import scrapy
class ReviewSpider(scrapy.Spider):
name = "quicken2"
start_urls = [
"https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",]
def parse(self, response):
review = response.css('.readmoreInner p::text').getall()
yield {"text" : review}
NEXT_PAGE_SELECTOR = '.pagination-link.next-page ::attr(href)' #the html container for the next page arrow
next_page = response.css(NEXT_PAGE_SELECTOR).get()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
You can also create items for larger data sets to keep your data organized.
import scrapy
class ReviewItem(scrapy.Item):
# define the fields for your item here like:
text = scrapy.Field()
date = scrapy.Field()
import scrapy
from ..items import ReviewItem
class ReviewSpider(scrapy.Spider):
name = "quicken2"
start_urls = [
"https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",]
def parse(self, response):
items = ReviewItem()
items['text'] = response.css('.readmoreInner p::text').getall()
items['date'] = response.css('.review-date::text').getall()
yield items
# NEXT_PAGE_SELECTOR = '.pagination-link.next-page ::attr(href)'
# next_page = response.css(NEXT_PAGE_SELECTOR).get()
# if next_page:
# yield scrapy.Request(
# response.urljoin(next_page),
# callback=self.parse
# )
Also, you can use a shell prompt to test a few lines of your code before running the spider completely
cd
to the spiders folderscrapy shell '(domain of website)'
to open a shell in Terminal
response.css('(CSS_selector)::text').get()
quit()
As you continue to web scrape, check out these resources to help with staying anonymous, and prevent yourself from getting blocked:
Rotate IP addresses/proxies
scrapy-rotating-proxies : list your own proxies
https://github.com/TeamHG-Memex/scrapy-rotating-proxies
scrapy-proxy-pool : auto-generated list of proxies
https://github.com/rejoiceinhope/scrapy-proxy-pool
Tips:
Rotate and spoof user agents (requests appear to be coming from different browsers)
scrapy-user-agents: auto-generated list of user-agents
https://pypi.org/project/scrapy-user-agents/
scrapy-useragents: list your own users
https://pypi.org/project/Scrapy-UserAgents/
Pretend to be Google Bot
http://www.google.com/bot.html
List of user-agent strings
Use headless browsers
Selenium
https://www.selenium.dev/ (Tutorial using headless browser: https://www.scrapehero.com/tutorial-web-scraping-hotel-prices-using-selenium-and-python/)
PhantomJS
Google's headless chrome
https://developers.google.com/web/updates/2017/04/headless-chrome
Reduce the crawling rate
TorRequests and Python
https://www.scrapehero.com/make-anonymous-requests-using-tor-python/
Scrapy AutoThrottle extension
https://docs.scrapy.org/en/latest/topics/autothrottle.html
Scrapy automatically obeys the robots.txt page on all websites (some sites don't have any)
Follow us@ordinarycoders