Scrapy is a high-level web scraping framework with use cases varying from data mining to automated testing.&nbsp; Similar to <a href="../../../../../blog/article/testing-django-selenium">automating user interaction with Selenium</a>, Scrapy can crawl and interact with webpages.&nbsp; However, Scrapy is considered a better choice for working with larger datasets and also has a larger collection of related projects and plugins.&nbsp; Let's get started.
1. Create a virtual environment
Windows:
<pre class="language-python"><code>C:\Users\Owner&gt; cd desktop
C:\Users\Owner\desktop&gt; py -m venv scrap
C:\Users\Owner\desktop&gt; cd scrap
C:\Users\Owner\desktop\scrap&gt; Scripts\activate
(scrap)C:\Users\Owner\desktop\scrap&gt;</code></pre>
&nbsp;
<h2>2. Install scrapy</h2>
<ul>
<li>Install scrapy within your activated virtual environment</li>
</ul>
<pre class="language-python"><code>(scrap)C:\Users\Owner\desktop\scrap&gt;pip install scrapy</code></pre>
&nbsp;
<h2>3. Create a scrapy project</h2>
<ul>
<li>Create a scrapy project named "myproject"</li>
</ul>
<pre class="language-python"><code>scrapy startproject myproject</code></pre>
At this point, scrapy will set up our document structure as follows:
<pre class="language-python"><code>myproject/
 scrapy.cfg
 myproject/
 __init__.py
 items.py
 middlewares.py
 pipelines.py
 settings.py
 spiders/
 __init__.py
 #empty until we add a spider
 </code></pre>
&nbsp;
<h2>4. Create a basic spider</h2>
In this example, we will scrape Quicken Loans mortgage reviews from creditkarma.com.&nbsp;Open myproject in your text editor.&nbsp; We recommend <a href="https://www.sublimetext.com/">Sublime Text.</a> Create a new file <code>spider1.py</code> in the myproject/spiders folder.&nbsp; The following example scrapes data by selecting elements via CSS.&nbsp;&nbsp;
<ul>
<li>Our spider subclasses <code>scrapy.Spider</code></li>
<li><code>name&nbsp;</code>must be a unique identifier between spiders<code></code></li>
<li><code>start_urls</code> are the URLs to be scraped</li>
<li><code>get()</code> returns one element using the CSS selector</li>
<li><code>getall()&nbsp;</code>returns all elements that match the CSS selector</li>
<li>parse() extracts data from a response containing page content</li>
</ul>
<pre class="language-python"><code>import scrapy

class ReviewSpider(scrapy.Spider):
 name = "quicken"
 start_urls = [
 "https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",
 ]

 def parse(self, response):
 reviews = response.css('.readmoreInner p::text').getall()
 yield {"text" : reviews}</code></pre>
<h2><Image src='https://d2gdtie5ivbdow.cloudfront.net/articles/creditkarma-quicken.PNG' alt='' objectFit='cover' width='undefined' height='undefined' /></h2>
<h2>We defined the content to scrape with a CSS query that called on the response object.&nbsp; To identify the actual reviews, we inspected the webpage with our developer tools and found that each review is in paragraph element nested within a division with a class attribute of "readmoreInner".&nbsp; Then we just write the corresponding CSS query as if wanted to add a CSS property to the text.&nbsp;</h2>
<Image src='https://d2gdtie5ivbdow.cloudfront.net/articles/cssquicken.PNG' alt='https://d2gdtie5ivbdow.cloudfront.net/articles/creditkarma-quicken.PNG' objectFit='cover' width='1495' height='946' />
&nbsp;
<h2>5. Run the spider</h2>
<ul>
<li><code>cd</code> into the spiders folder from your command line</li>
<li>Run the following command to run Scrapy</li>
</ul>
<pre class="language-python"><code>(scrap) C:\Users\Owner\Desktop\code\scrap\myproject\myproject\spiders&gt;scrapy crawl quicken</code></pre>
<h2>You will see a good amount of information outputted in your command prompt/terminal.&nbsp; Some of this data includes which pages were scraped, the number of requests sent, and the request methods.&nbsp; You will also see a dictionary of all the text we scraped, in this example, we have all of the reviews from the web page.&nbsp; Instead of just outputting the data to your command prompt, let's save the data as a JSON file with the following command:</h2>
&nbsp;
<pre class="language-python"><code>(scrap) C:\Users\Owner\Desktop\code\scrap\myproject\myproject\spiders&gt;scrapy crawl quicken -o reviews.json</code></pre>
This will output all scraped information into a JSON file located in the spiders folder.
&nbsp;
<h2>6. Create a more advanced spider</h2>
This time let's loop through multiple pages by identifying the next page button and adding it to our spider.&nbsp; Note this is a new spider title spider2.py.
<Image src='https://d2gdtie5ivbdow.cloudfront.net/articles/nextquicken.PNG' alt='https://d2gdtie5ivbdow.cloudfront.net/articles/cssquicken.PNG' objectFit='cover' width='1505' height='952' />
<pre class="language-python"><code>import scrapy

class ReviewSpider(scrapy.Spider):
 name = "quicken2"
 start_urls = [
 "https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",]

 def parse(self, response):
 review = response.css('.readmoreInner p::text').getall()
 yield {"text" : review}

 NEXT_PAGE_SELECTOR = '.pagination-link.next-page ::attr(href)' #the html container for the next page arrow
 next_page = response.css(NEXT_PAGE_SELECTOR).get()
 if next_page:
 yield scrapy.Request(
 response.urljoin(next_page),
 callback=self.parse
 )</code></pre>
<h2>&nbsp;</h2>
<h2>7. Using item containers</h2>
You can also create items for larger data sets to keep your data organized.
<ul>
<li>add the following code in your <a href="http://items.py">items.py</a> file</li>
</ul>
<pre class="language-python"><code>import scrapy

class ReviewItem(scrapy.Item):
 # define the fields for your item here like:
 text = scrapy.Field()
 date = scrapy.Field()
 </code></pre>
<ul>
<li>add the following code top your created .py file</li>
</ul>
<pre class="language-python"><code>import scrapy
from ..items import ReviewItem


class ReviewSpider(scrapy.Spider):
 name = "quicken2"
 start_urls = [
 "https://www.creditkarma.com/reviews/mortgage/single/id/quicken-loans-mortgage/",]

 def parse(self, response):
 items = ReviewItem()

 items['text'] = response.css('.readmoreInner p::text').getall()
 items['date'] = response.css('.review-date::text').getall()
 yield items

 # NEXT_PAGE_SELECTOR = '.pagination-link.next-page ::attr(href)' 
 # next_page = response.css(NEXT_PAGE_SELECTOR).get()
 # if next_page:
 # yield scrapy.Request(
 # response.urljoin(next_page),
 # callback=self.parse
 # )
</code></pre>
<h2>&nbsp;</h2>
<h2>8. Running your spider in a shell prompt</h2>
Also, you can use a shell prompt to test a few lines of your code before running the spider completely
<ul>
<li><code>cd</code> to the spiders folder</li>
<li>type <code>scrapy shell '(domain of website)'</code> to open a shell in Terminal 
<ul>
<li>type in <code>response.css('(CSS_selector)::text').get()</code></li>
<li>check result</li>
</ul>
</li>
<li>to quit the shell prompt type <code>quit()</code></li>
</ul>
<h2>&nbsp;</h2>
<h2>9. Rotating proxies</h2>
As you continue to web scrape, check out these resources to help with staying anonymous, and prevent yourself from getting blocked:
<ul>
<li>
Rotate IP addresses/proxies
scrapy-rotating-proxies : list your own proxies
<a href="https://github.com/TeamHG-Memex/scrapy-rotating-proxies">https://github.com/TeamHG-Memex/scrapy-rotating-proxies</a>
scrapy-proxy-pool : auto-generated list of proxies
<a href="https://github.com/rejoiceinhope/scrapy-proxy-pool" target="_blank" rel="noopener">https://github.com/rejoiceinhope/scrapy-proxy-pool</a>
Tips:
<ol>
<li>Avoid Using Proxy IP addresses that are in a sequence</li>
<li>Automate free proxies</li>
<li>Use elite proxies (hard to detect - free)</li>
<li>Use premium proxies if scraping 1,000+ pages (even harder to detect - $)</li>
</ol>
</li>
<li>
Rotate and spoof user agents (requests appear to be coming from different browsers)
scrapy-user-agents: auto-generated list of user-agents
<a href="https://pypi.org/project/scrapy-user-agents/">https://pypi.org/project/scrapy-user-agents/</a>
scrapy-useragents: list your own users
<a href="https://pypi.org/project/Scrapy-UserAgents/">https://pypi.org/project/Scrapy-UserAgents/</a>
Pretend to be Google Bot
<a href="http://www.google.com/bot.html">http://www.google.com/bot.html</a>
List of user-agent strings
<a href="http://www.useragentstring.com/pages/useragentstring.php">http://www.useragentstring.com/pages/useragentstring.php</a>
<a href="https://developers.whatismybrowser.com/useragents/explore/">https://developers.whatismybrowser.com/useragents/explore/</a>
</li>
<li>
Use headless browsers&nbsp;
Selenium
<a href="https://www.selenium.dev/" target="_blank" rel="noopener">https://www.selenium.dev/</a> (Tutorial using headless browser: <a href="https://www.scrapehero.com/tutorial-web-scraping-hotel-prices-using-selenium-and-python/">https://www.scrapehero.com/tutorial-web-scraping-hotel-prices-using-selenium-and-python/</a>)
PhantomJS
<a href="https://phantomjs.org/">https://phantomjs.org/</a>
Google's headless chrome
<a href="https://developers.google.com/web/updates/2017/04/headless-chrome">https://developers.google.com/web/updates/2017/04/headless-chrome</a>
</li>
<li>
Reduce the crawling rate
TorRequests and Python
<a href="https://www.scrapehero.com/make-anonymous-requests-using-tor-python/">https://www.scrapehero.com/make-anonymous-requests-using-tor-python/</a>
Scrapy AutoThrottle extension
<a href="https://docs.scrapy.org/en/latest/topics/autothrottle.html">https://docs.scrapy.org/en/latest/topics/autothrottle.html</a>
Scrapy automatically obeys the robots.txt page on all websites (some sites don't have any)
</li>
</ul>
<ol>
<li style="list-style-type: none;">&nbsp;</li>
</ol>

Beginner's Guide to Scrapy for Python

Scrapy is a high-level web scraping framework with use cases varying from data mining to automated testing.&nbsp; Similar to automating user interacti