Extracting Embedded Files from SQLite Databases using Python

In this post, Alex Caithness – member of CCL Forensics’ R&D Team and Python Enthusiast examines how Python can be used to automate the extraction of embedded files from SQLite databases.

Recently, over the last few weeks I’ve come across a number of cases when working with SQLite databases where I’ve found embedded files in one or more of the fields; of course, I wanted to get at these files for review or further analysis with other tools. Generally, files “embedded” in a database will be stored in “BLOB” fields (BLOB stands for “Binary Large OBject” – which is just database parlance for binary data) unless the file is purely text (eg. an HTML file) in which case it could be stored in a text field.

01-SqliteExpertBlobs

Unfortunately, in most database viewers, exporting these files en-masse turns out to be an arduous task at best and a downright faff at worst. Take my SQLite viewer of choice – “SQLite Expert” as an example: what is in most other respects an incredibly well featured database viewer, becomes very frustrating if you need to export multiple BLOB fields at once: right click on the field, open field in internal hex/image/text editor, click save, choose output location, click OK and repeat for every record in the table… Not an efficient use of your time if you have more than a few records!

02-SqliteExpert-Export

Luckily I have another powerful tool to hand in the form of Python – see my previous blog post “An analyst enthuses about Python”. In just a few lines of code I had a working solution and for the rest of this post I’d like to share with you a simplified version of the script and walk you through how it works. The script exports the contents of a given field from every record of a given table in the database, to files for examination outside of the database. We will work with the database shown in the screenshots above which is from the Safari web cache file (cache.db) in the table called “cfurl_cache_receiver_data” (snappy title). The BLOB data we’re interested in is held in the “receiver_data” field.

First of all, let’s look at the whole script – it’s only 20 lines of actual code, the rest of the script is comments (the lines starting with a # symbol, which are just annotations which Python totally ignores).

# Import required modulesimport sys

import sqlite3

# Get database from command line

db_path = sys.argv[1]

# Pick tables/fields to work with

TABLE_NAME = “cfurl_cache_receiver_data”

FIELD_NAME = “receiver_data”

# Open database, set up cursor to read database

conn = sqlite3.connect(db_path)

cur = conn.cursor()

# Create query

query = “SELECT {0} FROM {1};”.format(FIELD_NAME, TABLE_NAME)

# Execute query

cur.execute(query)

# Go through returned rows

for index, row in enumerate(cur):

    # Get the first (zero’th) column from the returned row

    data = row[0]

    # Check if the data is binary, open the file in binary or text mode

    # accordingly and write the data out.

    if isinstance(data, bytes):

        file_name = “{0}.bin”.format(index)

        output = open(file_name, “wb”)

        output.write(data)

    else:

        file_name = “{0}.txt”.format(index)

        output = open(file_name, “wt”, encoding=“utf-8″)                                                           

        output.write(str(data))

    # close the file   

    output.close()

OK, take a deep breath and don’t panic – there really isn’t anything particularly complicated in that code! Let’s take it section by section so we can clearly see what is going on. So, to the top of the code and our “imports”:

import sys

import sqlite3

These two lines import modules into our script to gain access to additional functionality. The first module: “sys” contains a bunch of, often fairly low-level, system functions, but we’re only going to use it briefly and we’ll be using one of the most basic things it has to offer: access to arguments at the command line. The second module we import is “sqlite3” which, you will probably have already worked out contains all of our SQLite functionality. Both of these modules are part of Python’s standard library and come pre-packaged when you install Python – no additional work required!

db_path = sys.argv[1]

Next, this line gets the path of the database that we’re planning on working with from the command line. So assuming we’re going to point the script at a Safari cache database (cache.db), when we run the script we’ll do it something like this:

C:\blog_post\SqliteBlobDumpForBlog.py cache.db

The “sys.argv” object is a list of arguments which are passed into Python at the command line. In  Python we access items in a list by using the square brackets notation, putting the index of the item we want inside the square brackets. So in this case we’re after the “one’th” item in the list – and I say “one’th” rather than “first” because in Python (and most other programming languages) lists are “zero-indexed” meaning that the first item is at index 0. In the case of our “sys.argv” list the first (zero’th) item in the list is the path of the script itself and at index 1 we have the path of the database.

TABLE_NAME = “cfurl_cache_receiver_data”

FIELD_NAME = “receiver_data”

These two lines of code define the table and field name that we want to work with in the database (these could also be pulled from the command line using “sys.argv” the same way we did the database with a slight modification, but I wanted to keep things simple for this example).

conn = sqlite3.connect(db_path)

This line opens a connection to the database. The sqlite3.connect function here is taking a single argument which is the path of the database file. The function returns a “connection object” (we’re calling that object “conn”) which we use to manage our connection to the database including creating a “cursor” object – and that’s exactly what we do on our next line:

cur = conn.cursor()

A cursor object (the one we’ve created we’re calling “cur”) is used to access data in the database. As we’ll see in a few lines time we’re going to give it an SQL query to execute and we’ll use the cursor to move through each of the rows returned in our results.

query = “SELECT {0} FROM {1};”.format(FIELD_NAME, TABLE_NAME)

Here we are creating our SQL query.  The query we’re running is very simple, we just want to get access to a single column for every record in a particular table. In SQL when we want to get access to data in a table we use a “SELECT” statement. Select statements in their most simple form look like:

SELECT column_name1, column_name_2, etc FROM table_name;

In our case we only want a single column (and table) which we defined variables for earlier in the script (TABLE_NAME and FIELD_NAME). Rather than hard-code the table and field names (makes updating the script harder, especially if it gets expanded) we’re using a feature of Python called “string formatting”. We create a string with placeholders (the “{0}” and “{1}”) which are then replaced with the contents of our table and field variables, meaning that when we run the code the contents of our “query” variable looks like:

SELECT receiver_data FROM cfurl_cache_receiver_data;

The nice thing about working this way with format strings is that we only have to update the variables at the top of our script to make it work with other tables and fields and not search around for the point in the script where the query is built.

cur.execute(query)

With the query prepared, we can execute it using our cursor object. You can think of this line as putting the cursor at the top of our results, ready to step through them one by one.

for index, row in enumerate(cur):

This line of code is setting up a loop. A loop allows us to execute a piece of code multiple times; in this case the code we’ll be repeating is the code which exports the data to files. The type of loop we’re using is a “for-loop” which in Python means: “for each item in a sequence” and specifically for us “for each row encountered by the cursor” – the file exporting code will execute for each row in our results. We do have a slight addition to this thanks to the use of the “enumerate” function which allows us to keep track of where we are (the index) in the sequence; we’ll use the index as a file name for our exported files. The index will simply start at 0 and increment each time we go around the loop.

    data = row[0]

The “row” that we get from the cursor is like a list (actually a “tuple”, but for this explanation that isn’t important) containing each of the fields we asked for in our SELECT query in the order we asked for them.  Just like with accessing the items in “sys.argv” near the top of our script, we can access the fields in “row” in just the same way – by index, using the square brackets. In this case we only have a single field so we can just ask for the first (zero’th) item.

if isinstance(data, bytes):

 file_name = “{0}.bin”.format(index)

        output = open(file_name, “wb”)

        output.write(data)

    else:

        file_name = “{0}.txt”.format(index)

        output = open(file_name, “wt”, encoding=“utf-8″)

        output.write(str(data))

This next bit of code actually deals with the business of opening up an output file, and writing out the data. Before we get to that we have an “if” statement. An “if” statement allows the code to take different paths based on a particular condition; we also have an “else” statement to define the code that is executed if the condition is not met.

The condition in our code is:

    isinstance(data, bytes)

This is the only particularly “programmery” bit of code in the whole of this script. What is happening here is we are checking whether the data we’ve got from our row is a “bytes” (binary) object because in Python 3 we need to treat binary data differently from text data (Python 2 does things differently, and frankly, worse – this small complexity solves all sorts of other issues).

So if we are dealing with bytes (binary) data, the next line we see is:

        file_name = “{0}.bin”.format(index)

This line is simply building a file name for the output file, using string formatting as we did before (when we built the SQL query). The placeholder is being replaced with the index, so assuming all of our data is binary the file names will be:

  • 0.bin
  • 1.bin
  • 2.bin
  • 3.bin
  • Etc.

        output = open(file_name, “wb”)

The open function opens a file for reading, or as in our case: writing. We’re providing two arguments to the function, the first is the file name that we had just built, and the second argument is the string: “wb”; this string is the “mode string”. The mode string is a combination of characters which define how the file should be opened. The “w” lets Python know that we’re after a “write-only” file (if we wanted read-only we’d have used “r”) and the “b” tells Python that we want the file to be in “binary” mode (because we’re going to write binary data to the file).

        output.write(data)

The open function gives us a “file object” (we’ve called it “output” in our script) which we can use in the rest of our code to perform the writing of the data, which is precisely what we’re doing on this line – calling the file object’s “write” function and giving it the data from the database field to write.

If the condition in our “if statement” was not met (if the data wasn’t binary) the code in the else block will be executed instead:

        file_name = “{0}.txt”.format(index)

        output = open(file_name, “wt”, encoding=“utf-8″)

        output.write(str(data))

It is very similar to the code relating to binary data, but with a few small changes. Firstly when generating the file name for the output file we’re giving it the extension “.txt” instead of “.bin”. Next, when we look at the open function, our “mode string” is a little different – using a “t” rather than “b” we’re telling Python that the file should be opened in text mode; we are also specifying a text encoding scheme to be used in the output file. Finally, when we are writing the data to the file we are first ensuring that the data we are writing is in the form of a string (in case the record contained a number, etc. rather than a string).

So in essence what this section of code is really saying is:

If the data is binary, create a file name with the “.bin” extension, open the output file in binary mode and write out the data,

Else, create a file name with the “.txt” extension, open the output file in text mode (with a Unicode UTF-8 encoding) and write out the textual representation of the data.

    output.close()

After we have output the data into the file we need to “close” the file. This does two things – first ensures that Python doesn’t hang onto the file for longer than needed, but probably more importantly, it makes sure that if the operating system has buffered any of this data before actually writing to disk that the buffer is flushed and all of the data makes its way into the file.

And that’s the whole script! It’s useful already in this form, but there are a number of improvements that could be made, the obvious ones are to get the table and field names from the command line, you could also specify a different field to be used as the file name rather than just an incrementing number (useful if you need to relate the extracted data back to the database). Moving on from that we could specify an extension rather than using generic “.bin” and “.txt” – if you were feeling particularly ambitious you could even add file signature recognition to pick the appropriate extension!

If you have any questions or feedback please email research@cclgroupltd.com. Alternatively if you are interested in learning more about Python, CCL offer one or two day training course – you can find out more information here.