How to use scrapy to feed an SQLite database

Spread the love

Scrapy console

Fetch a Page and Send It To the Browser

First thing to do now is open the source code of an example page we need to work on. However, we won’t just do it from the browser directly. Scrapy cannot handle CSS includes and JavaScript correctly.

Thus, we need to work on the source code that Scrapy will actually see. To do so, launch this command from the command line environment:

C:\Python27> scrapy view https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1182/6764

This will open the page scraped by Scrapy in your browser. From there you can hit F12 to open the debug console and work on the source.

 

Fetch the Page to Scrapy Shell

From the command line, open a Scrapy shell:

C:\Python27> scrapy shell

 

Now, from the python command line, fetch the page again:

fetch("https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1182/6764") 

response.css('title::text').extract_first() 
# u'ITEC 470 - 101'

You may see the whole HTML code fetched in the object response.text.

Let’s explore the HTML code to find ways to access the data.

 

Explore the Structure

[collapsibles]
[collapse title=”Example 1: title” active=”true”]

Let’s explore the title of the class:

tu mobile classes ITEC470 101 title

This value, ITEC 470 - 101 can be found in 2 different places: in the head <title> attribute (it’s the title of the tab in your browser), and also later in the code under an h1 tag:

<head> 
<meta charset="utf-8"> 
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> 
  <title>ITEC 470 - 101</title> 
  ... 
  <h1 class="page-title with-back-btn">ITEC 470 - 101</h1> 
  ...

 

From there, there is 2 different ways to access the text values from the response object: via the css or xpath methods. Both these methods have methods too: extract() and extract_first().

  • css: works with CSS attributes: HTML tags and classes
    response.css('title::text').extract_first() 
    # u'ITEC 470 - 101' 
    response.css('h1.page-title::text').extract_first() 
    # u'ITEC 470 - 101' 
    response.css('.page-title::text').extract_first() 
    # u'ITEC 470 - 101' 
    response.css('.page-title::text').extract() 
    # [u'ITEC 470 - 101']

    As you can see, there is more than one way to get this particular value as a string with extract_first(). Note that extract() will return a list.

 

  • xpath:
    response.selector.xpath('//title/text()').extract_first() 
    # u'ITEC 470 - 101' 
    response.selector.xpath('//h1/text()').extract() 
    # [u'ITEC 470 - 101']

    The syntax is slightly different but the outputs are the same. Most blogs and StackExchange questions I found concentrate on xpath, though. Maybe it’s more powerful, I don’t really know. And to be honest, I couldn’t manage to properly get the tags’ text with specific classes, and also I do not like its syntax.

Since everything can be done with the css method anyway, this tutorial will be about the css method only.

 

[/collapse]
[collapse title=”Example 2: topic”]

Let’s explore the topic of the class:

tu mobile classes ITEC470 101 topic

This text can be found under the following tags:

<section itemscope itemtype="http://schema.org/Class"> 
  <div class="primary-head" style="" itemprop="name"> SPECIAL TOPICS IN INFORMATION TECHNOLOGY </div> 
  ...

Extraction:

response.css('.primary-head::text').extract_first() 
# u'\n \n \n \n \n SPECIAL TOPICS IN INFORMATION TECHNOLOGY \n \n '

As you can see, we ran into our first issue. The text extraction outputs also blanks and carriage returns. We will need to create a cleanup function later on.

 

[/collapse]
[collapse title=”Example 3: attributes”]

Let’s explore the first attributes of a class, Session and Class Number:

tu mobile classes ITEC470 101 session

tu mobile classes ITEC470 101 ClassNumber

Well, it turns out that all the class attributes on the page are coded in the same way:

<div class="section-content clearfix" style="">
  <div class="pull-left">
    <div class="strong">Session</div>
  </div>
  <div class="pull-right">
    <div>Regular Academic Session</div>
  </div>
</div>

<div class="section-content clearfix" style="">
  <div class="pull-left">
    <div class="strong">Class Number</div>
  </div>
  <div class="pull-right">
    <div>6764</div>
  </div>
</div>

All the attributes are found under a div.section-content > div.pull-left > div so we code it exactly in the same order: div.section-content div.pull-left div::text

response.css('div.section-content div.pull-left div::text').extract() 
# [u'Session', u'Class Number', u'Career', u'Units', u'Grading', u'Topic', u'Description', u'Add Consent', u'Enrollment Requirements', u'Instructor(s)', u'Topic', u'Meets', u'Dates', u'Room', u'Campus', u'Location', u'Components', u'Status', u'Seats Taken', u'Seats Open', u'Class Capacity', u'Wait List Total', u'Wait List Capacity']

 

[/collapse]
[collapse title=”Example 4: values”]

Same code slightly different for the values, found under a div.section-content > div.pull-right > div :

response.css('div.section-content div.pull-right div::text').extract() 
# [u'Regular Academic Session', u'7608', u'Undergraduate', u'3 units', u'UNDERGRADUATE GRADING', u'Network & Security Practicum', u'Studies in selected areas of information technology. May be repeated for a maximum of 6 units provided a different topic is taken. Prerequisites: 12 units of ITEC, CIS, or COSC courses.', u'Department Consent Required', u'ITEC 470 requires 12 units of ITEC, CIS and /or COSC.', u'Jeffrey Hanson', u'COSC & DIAR/CPS Initiative', u'Mo 6:00PM - 8:40PM', u'08/28/2017 - 12/19/2017', u'Y20111', u'Main Academic Campus', u'On Campus', u'Closed', u'15', u'0', u'15', u'0', u'0']

 

[/collapse]
[/collapsibles]

 

Cleaning Up the Fetched Data

As seen in Examples 3 & 4, we successfully fetched the list of values for both class attributes and values. However, a quick verification will show that sometimes the number of items is not the same in both outputs. Because of that, we cannot directly zip those outputs into a dictionary. Bellow are two examples and how to deal with them. Your mileage may vary.

[collapsibles]
[collapse title=”Problem 1: missing items” active=”true”]

If the number of items is lower in the values:

>>> keys=response.css('div.section-content div.pull-left div::text').extract() 
>>> values=response.css('div.section-content div.pull-right div::text').extract() 
>>> len(keys) 
23 
>>> len(values) 
22

 

Long story short, it turns out that one of the div in the returned values is empty. Because of that, div::text returns null, and so we miss an item in our list. The solution to this is to re-extract the values from an upper level: div.section-content > div.pull-right > div

This will return the HTML code between the div.pull-right. Then we just need to find the empty div and insert something like 0 or “TBA”. So we extract with the same code but without ::text at the end:

prevalues=response.css('div.section-content div.pull-right div').extract() 
prevalues 
# [u'<div>Regular Academic Session</div>', u'<div>6764</div>', u'<div>Undergraduate</div>',.. 

# get indices of items containing "<div></div>" in the prevalues list: 
indices = [i for i, x in enumerate(prevalues) if x == "<div></div>"] 

# for each empty indice, we insert something in the list: 
for i in indices: values.insert(i, "TBA")

 

[/collapse]
[collapse title=”Problem 2: items to remove”]

If you decide on the other hand that you need to remove an item from the keys for example, just define a list of items to remove and post-process the list after extraction. You may also need to rename some items:

# define a list of items to remove: 
keys2remove=['Components'] 

# Components is undefined in the values, Topic has a duplicate, and sometimes ClassCapacity=CombinedSectionCapacity because they divide big classes in 2 sub-sections
# remove multiple items from a list, against another one: 
keys = [e for e in keys if e not in keys2remove] 

# rename CombinedSectionCapacity to ClassCapacity: 
keys=['ClassCapacity' if x=='CombinedSectionCapacity' else x for x in keys]

That’s one of the many issues you may run into for your project. Hopefully, Python is powerful enough to provide one-liner solutions like these.

 

[/collapse]
[collapse title=”Problem 3: keep only alphanum + convert unicode to ascii”]

We do not want spaces or blank characters in key names because we’re gonna use them as column names for the inserts in SQLite database. Also we do not want unicode (the little ‘u’ before our output) so let’s convert the items to ascii as well:

# cleanup keys by removing non alphanum characters AND re-encode everything to ASCII: 
keys = [re.sub(r'\W+', '', i).encode("utf-8") for i in keys]

 

[/collapse]
[/collapsibles]