How to use scrapy to feed an SQLite database

Spread the love

Working Scrapy Web Spider With Comments

Complete Python Script

Maybe a better idea would be to to start this tutorial with the complete script, then break it apart to explain how it works? Let me know in the comments. Usually that’s what I’m looking for when I search the web for examples.

# -*- coding: utf-8 -*-
import scrapy, sqlite3, re, datetime, arrow, sys, logging
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor

version = 6.0
# Define numeric fields:
numerics = ['ClassNumber', 'SeatsTaken', 'SeatsOpen', 'ClassCapacity', 'WaitListTotal', 'WaitListCapacity']

# Define keys to remove: "Components" has no value
keys2remove=['Components']

# Define path to database:
database='tuScraper.sqlite3'

# Define SQL statement to check if a class number exists.
# The question mark will be replaced by a parameter:
sqlClassCheck = "SELECT COUNT(1) FROM classes WHERE ClassNumber = (?)"

# Define SQL statement to fetch all class numbers from the database for the loop:
sqlClassFetch = "SELECT ClassNumber FROM classes"

total=0                 # Define a global variable total that will be incremented by the threads
modulo=500              # Print status every 500 lines updates
nums = range(1000,7187) # Range of class numbers to scrap via URL
processRange = False    # Process either defined range of numbers or database content

# dict_factory method: transforms the output of rows = cursor.execute.fetchall from:# rows = [(2792,)] to rows = [{'ClassNumber': 7179, ..}]
def dict_factory(cursor, row):
  d = {}
  for idx, col in enumerate(cursor.description):
      d[col[0]] = row[idx]
  return d

# class name is arbitrary
class Tu1182PipelineSpider(CrawlSpider):
  # this name is called as parameter in the command line:
  name = 'tu1182v6'
  
  # your site base url, that the spider will modify for each loop:
  baseUrl = 'https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1182/'
  
  # start_urls is the reserved name list that the spider will parse:
  start_urls = []
  
  # Create db connection
  db = sqlite3.connect(database)
  
  # Attach the dict factory method to the post-process of row fetching
  db.row_factory = dict_factory
  
  # Create a cursor to communicate with the db
  cursor = db.cursor()
  
  if processRange:
    # This has to be done the very first time to fill the database.
    for num in nums: start_urls.append(baseUrl+str(num))
  else:
    # Execute the sqlClassFetch statement to get the list of class numbers from the database:
    # Output: rows = [{'ClassNumber': 7179, ..}]
    rows = cursor.execute(sqlClassFetch).fetchall()
    
    # Set modulo to 1 to print every lines updated if there are less than 20 numbers to process:
    if len(rows)<20: modulo=1
  
    # for each row returned, create an url to parse for the spider:
    for row in rows: start_urls.append(baseUrl+str(row["ClassNumber"]))
  #endif
  
  # This is the spider parse method that will be run in parallel against what's in the start_urls list
  # This method is run for each page.
  # Cache and parallel options are defined in Python27\tuScraper\tuScraper\settings.py
  def parse(self, response):
    # Fetch the attributes list from the page:
    keys = response.css('div.section-content div.pull-left div::text').extract()
    
    # Skip blank pages: process the rest only if there are attributes in the page
    if len(keys) > 0:
      # Get page title = class name + section
      title = response.css('title::text').extract_first()
      
      # Cleanup keys: keep only alpha-num + convert to ascii:
      keys = [re.sub(r'\W+', '', i).encode("utf-8") for i in keys]
      
      # Remove multiple items from the list:
      keys = [e for e in keys if e not in keys2remove]
      
      # Rename "CombinedSectionCapacity" to "ClassCapacity":
      keys = ['ClassCapacity' if x=='CombinedSectionCapacity' else x for x in keys]
      
      # Fetch values list from the page:
      values = response.css('div.section-content div.pull-right div::text').extract()
      ClassNumber = values[keys.index('ClassNumber')]
      
      # If we have more value than keys:
      if len(keys) > len(values):
        # Re-fetch values but one level above:
        prevalues = response.css('div.section-content div.pull-right div').extract()
        
        # Get indices of items containing "<div></div>" (empty value):
        indices = [i for i, x in enumerate(prevalues) if x == "<div></div>"]
        
        # For each empty value, insert a default value at the right indice:
        for i in indices: values.insert(i, "TBA")
      
      # On the other hand if there are more values than keys:
      elif len(keys) < len(values):
        # Remove every <br> from the body:
        response = response.replace( body=re.sub(r"<br\s*[\/]?>", "\n", response.body) )
        
        # Re-fetch values again:
        values = response.css('div.section-content div.pull-right div::text').extract()
      #endif
      
      # After these checks, if there are still more values than keys:
      if len(keys) < len(values):
        # Print error message on screen and log it:
        print 'ERROR for CLASS %s: len(keys)=%d len(values)=%d' % (ClassNumber, len(keys), len(values))
        self.logger.error('ERROR for CLASS %s: len(keys)=%d len(values)=%d' % (ClassNumber, len(keys), len(values)))
      self.logger.debug("ClassNumber %s: %d values + %d keys = %s" % (ClassNumber, len(keys), len(values), ','.join(keys)))
      
      # Make a dictionary by zipping keys against values:
      classDict = dict(zip(keys,values))
      
      # Create a secondary dictionary classDictValues with numeric values only:
      classDictValues = { key: classDict[key] for key in numerics }
      
      # Check if current class number is already in database:
      db = sqlite3.connect(database)
      cursor = db.cursor()
      cursor.execute(sqlClassCheck, (ClassNumber,))
      
      # If no row is returned, INSERT a new class definition in the main table:
      if not cursor.fetchone()[0]:
        print 'INSERT CLASS: %s' % (ClassNumber)
        self.logger.info('INSERT CLASS: %s' % (ClassNumber))
        
        # Remove numerics from classDict:
        for unwanted_key in numerics: del classDict[unwanted_key]
        
        # Implode classDict keys with comma separator for the SQL query:
        keys2insert = ','.join(classDict.keys())
        
        # Prepare the query string with values not in the dictionary:
        query_string = "INSERT INTO classes(ClassNumber,title,{}) VALUES ("+ClassNumber+","+title+",%s)"
        # Output will be: "INSERT INTO classes(ClassNumber,title,{}) VALUES (1234,Class title,%s)"
        
        # Append all the keys to insert to the query string.
        query_string = query_string.format(keys2insert) % ','.join('?' * len(classDict.keys()))
        # Output will be: "INSERT INTO classes(ClassNumber,title,Status,Meets,..) VALUES (1234,Class title,?,?,..)"
        
        # Execute the INSERT statement:
        try:
          cursor.execute(query_string, classDict.values())
          db.commit()
        except sqlite3.Error, e:
          print "sqlite3.Error %s" % (e)
          self.logger.error("sqlite3.Error %s" % (e))
          # This is how I discovered new fields not presents on every pages:
          # sqlite3.Error table classes has no column named Notes
          # etc
        #endtry
      #endif
      
      # Now we do the same for history table with the numeric values:
      keys2insert = ','.join(classDictValues.keys())
      
      # arrow will timestamp current datetime to a format that JavaScript can reconvert:
      query_string = "INSERT INTO history(timestamp,{}) VALUES ("+str(arrow.get(datetime.datetime.now()).timestamp)+",%s)"
      query_string = query_string.format(keys2insert) % ','.join('?' * len(classDictValues.keys()))
      
      global total  # get the global variable total to increment it; looks like PHP a lot, eh?
      total+=1
      
      # Test current number of rows processed against modulo to print status or not:
      if (total>=modulo and total%modulo == 0):
        print 'UPDATED HISTO ROWS: %d - Current class: %s' % (total, ClassNumber)
      self.logger.info('UPDATE total HISTO: %d - Current class: %s' % (total, ClassNumber))

      # Update the history table with numeric values:
      try:
        cursor.execute(query_string, classDictValues.values())
        db.commit()
      except sqlite3.Error, e:
        print "sqlite3.Error %s" % (e)
        self.logger.error("sqlite3.Error %s" % (e))
      
      # Uncomment this to log into json format:
      #yield classDict

[callout type=”default” size=”lg”]

Logging

The log from self.logger is defined in Python27\tuScraper\tuScraper\settings.py and will be under Python27\tuScraper\

You can enable a json log of your data if you uncomment the last line: yield something.

[/callout]

A word about the database inserts: according to the Scrapy best practices, updating a database should be done through a pipeline. There are pipelines scripts examples under <Python path>\tuScraper\tuScraper but I couldn’t manage to make them work under 2 days.

If you know someone who managed to do it, please let me know. Passing arguments to a pipeline has to be done via a dictionary. Losing time just to respect best practices is a loss of time.

 

Tuning Scrapy

Once you have something working, it’s time for tuning. Spider options are found in Python27\tuScraper\tuScraper\settings.py and detailed in the Scrapy wiki so I will just detail some options that I tuned here:

BOT_NAME = 'tuScraper'
LOG_LEVEL = 'INFO'
LOG_FILE = 'tuScraper.log'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'tuScraper (+http://www.yourdomain.com)'

# https://doc.scrapy.org/en/latest/topics/broad-crawls.html
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
REACTOR_THREADPOOL_MAXSIZE = 20

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True # Default: False
HTTPCACHE_EXPIRATION_SECS = 0 # Default: 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [] # Don’t cache response with these HTTP codes
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

[callout type=”default” size=”lg”]

Cache

If enabled, cached pages will be found under Python27\tuScraper\.scrapy\<HTTPCACHE_DIR>\

I didn’t notice a huge performance upgrade but I like the idea to have cached data.

[/callout]

 

MS-DOS Batch to Run the Spider

Optionally, it can place your computer to sleep mode once it’s done! Place it under Python27\tuScraper\ :

@echo OFF

pushd %~dp0
path=%~dp0..;%~dp0..\Scripts;%PATH%

set sleep=n
set /P sleep=sleep computer after scraping? [y/N] 

python ..\Scripts\scrapy.exe crawl tu1182v6
sync64

if "%sleep%" EQU "n" goto :END

:SLEEP
rundll32.exe powrprof.dll,SetSuspendState

:END

You can also generate a json log of your scraped data by adding the parameter -o logname.json to the python command and uncomment the last line of the Python script (yield something).

 

Wrapping up

This could not have been done in just 2 days without the good posts and examples given by other bloggers, that I thank dearly. They didn’t perform a whole analysis method along with providing a fully working script though.

This tutorial took me weeks to write so I hope it helps someone one day!