March 28, 2024

IT Cooking

Success is just one script away

How to use scrapy to feed an SQLite database

dilbert-2001_04_14-features-complexity
Spread the love

Database initialization

Install the DB browser

Definition of the database

It is clear that the only variable attributes are under Class Availability. Thus, we need at least 2 tables:

  • one main table for the class title, number, description, etc
  • one history table for the variable attributes: Status, Seats Taken, etc

The analysis of the URLs also gives us precious information:

https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1174/6891

 

Each class has a unique number (here 6891) that we can use as the primary key for the main classes table. The first 4 digits number (1174) is for the term of the class: 1-YY-semester. Last digit is for:

  • 1: Winter minimester
  • 2: Spring
  • 3: Summer
  • 4: Fall

 

For the column names, let’s use the attributes as they are shown, without spaces. We can therefore open DB Browser for SQLite Portable and create a sample database, and our first table classes:

DB browser for SQLite table classes

Everything will be text but the PK, as we can always modify it later if needed.

The history table will be defined the same way:

DB browser for SQLite table history

 

Finally, I decided to add a fields table, to keep the original names of the attributes, with spaces and capitalization:

DB browser for SQLite table fields

[callout type=”info” size=”lg”]

Create this table with Python

You can create this table and leave it empty as we will recreate it and insert rows from the Python console to validate the connection with the database!

[/callout]

 

Access SQLite Database from Python

Let’s say the database name and path is <Python path>\tuScraper\tuScraper.0.sqlite3

So, restart a new Python environment DOS command line with the batch you created earlier.

Launch a scrapy shell with a class URL, any class:

scrapy shell https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1182/6764

You should have this log and the Python prompt:

2018-01-13 01:03:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-13 01:03:43 [scrapy.core.engine] INFO: Spider opened
2018-01-13 01:04:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1182/6764> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x042F7D50>
[s] item {}
[s] request <GET https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1182/6764>
[s] response <200 https://mytumobile.towson.edu/app/catalog/classsection/TOWSN/1182/6764>
[s] settings <scrapy.settings.Settings object at 0x042F7E50>
[s] spider <DefaultSpider 'default' at 0x4532490>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>

All good! First thing to do is loading the sqlite3 and regexp modules:

>>> import sqlite3, re

Next, create a connection to the database and a cursor:

db = sqlite3.connect('tuScraper/tuScraper.0.sqlite3')
cursor = db.cursor()

[callout type=”danger” size=”lg”] An error message at this point means that the path to the database must be modified. Check your environment. [/callout]

 

Now, DROP and CREATE the fields table:

>>> cursor.execute("DROP TABLE IF EXISTS fields")
<sqlite3.Cursor object at 0x04489FA0>
>>> cursor.execute("CREATE TABLE IF NOT EXISTS fields(field TEXT PRIMARY KEY, desc TEXT)")
<sqlite3.Cursor object at 0x04489FA0>

The output you see is the sign that it worked!

 

Import the Fields into the Fields Table

As seen in the Scrapy console part, we will scrap the attributes and their values with the css method, into the variables keys and values. Then We remove the “Components” attribute from the keys, then we merge keys and values into a dictionary with the zip method. This will eliminate the duplicate attribute “Topic” as there can be only unique keys in a dictionary:

keys=response.css('div.section-content div.pull-left div::text').extract()
keys.remove('Components')
values=response.css('div.section-content div.pull-right div::text').extract()
classDict = dict(zip(keys,values))

Finally we will parse the dictionary keys into a loop with an SQL INSERT statement to fill the fields table:

for key in classDict:
  cursor.execute("INSERT INTO fields(field, desc) VALUES(?,?)", (re.sub(r'\W+', '', key), key))
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>
<sqlite3.Cursor object at 0x04489FA0>

No error? Let’s commit! Hit CTRL+Z to exit:

db.commit()
db.close()
^Z

Now Open the database with DB Browser and browse data for fields:

tuscraper-database-fields-data

You see that? Bingo! You successfully created and inserted rows into your SQLite database from Python command line!

Copyright IT Cooking© All rights reserved. | Production by Doctus IT LLC.