Downloader Python3

Downloader is a small utility and framework intended for help download many files from a certain site in an unobtrusive manner. Written in Python 3 for Linux platform – requires some python programming skills to use.

The program is build for this use case:
One needs to download many files form a site. Files are referenced on site in a list or table, that is paginated ( hundreds of pages). Actual download links might be indirectly referred from that pages or even require some algorithm to be constructed.  The  downloading program should behave in an unobtrusive manner – do not overload the site and basically be undistinguishable from a regular user who is browsing site.

The tool is basically a framework, which consist of a basic program, which provides key actions and plugins, which are specific for given site and which do parse site pages and provide links for files to be downloaded. Also tool is designed to be able to run a long time (weeks, months) and resume if it has been interrupted in meanwhile ( not actually resuming individual files now, but rather resume from the page when it was parsing links last time).

Interface of program is command line.

Basic flow of program:

  1. Log into site
  2. Download first page
  3. Parse all relevant links on the page, plus their metadata ( e.g. relevant information related to link – name, author, date etc.)
  4. Process the link – is any further browsing and parsing is needed to get final link and its metadata
  5. Enqueue links and meta for download – downloads will then run in separate threads
  6. Parse page for link to next page
  7. Load next page and continue on 2.
  8. End if there are no further pages

Program is to be used from command line :

Usage: downloader.py plugin options directory_to_store

Options:
  -h, --help            show this help message and exit
  --proxy=PROXY         HTTP proxy to use, otherwise system wise setting will
  --no-proxy            Do not use proxy, even if set in system
  -r, --resume          Resumes form from last run - starts on last unfinished
                        page and restores queue of unfinished downloads
  -s STOP_AFTER, --stop-after=STOP_AFTER
                        Stop approximatelly after x minutes
  -d, --debug           Debug logging
  -l LOG, --log=LOG     Log file

Each plugin can have its own options.
Do not forget that it has to be run from python3,  to run with sample plug-in provided run this:

python3 downloader.py manybooks -d -r ~/tmp/manybooks

To use downloader you need to write your own plugin for site you’re interested in.  Skill needed:

There is a sample plugin for site manybooks so you can look at it.

General approach would be :

1) Looks at the site of your interest

Look at the HTML code how it is structured – look where are links and what information around there you can use as metadata. The invaluable help when looking into HTML can be provided by Firebug Firefox plugin. Look how pagination is done and how to get a link to next page.

Download a page for offline testing with plugin ( HTML Only file – no need for related media).

2) Write Plugin

In plugins directory create new file – use manybooks.py as a prototype.

Set correct urls to BASE_URL and START_URL and start with methods next_link and next_page_url – these are core of site parsing.  Use test cases – see below – to test your code.

If site requires login then implement require_login and login methods.

If you cannot get clear download link from initial pages implement method postprocess_link, where you can download and parse any intermediate page(s) and do any further processing to get final link and metadata. If for any reasons link should not be downloaded , this method can raise  raise SkipLink exception.

Then implement function save_file, which will save final link to disk – use client.save_file method to load and save data. save_file function has access to special parametrs context, which is worker object running this function.  If for any reasons function needs to wait for a while (download limits etc.) it should use context.sleeper.sleep(x) to sleep for x seconds.

Finally you can fine tune downloader parameters on lines 15-30, which will govern the HTTP requests behaviour – delay between requests and concurrency.

If you want control plugin behaviour from command line you can define its OPTIONS on line 42 and use then options global variable in code.

Here is complete sample plugin with comments:

import logging
import re
import os.path
from httputils import LinksSpider
from optparse import Option


# Some basic constants for particular site
# client settings for links parsing
REPEATS=5 # how many times to retry after HTTP or connection error
MEAN_WAIT=6 # average time to wait before next request - in seconds
MAX_WAIT=18 # maximum time to wait - if random wait if greater it is reduced to this maximum

# clients settings for files downloading
# same meaning as above, but for files downloading
REPEATS2=5
MEAN_WAIT2=1
MAX_WAIT2=2

# maximum number of files links in queue, when reached parsing of new links
# blocks, until a link is downloaded and can push to queue again 
MAX_QUEUE_SIZE=50 #0 means unlimited
DOWN_THREADS=2 # number of threads for download tasks

#Base URL to use for relative links processing
BASE_URL = 'http://manybooks.net'
# URL to start with
START_URL='http://manybooks.net/language.php?code=en&s=1'

# Set by program - a directory where downloaded files should be stored
BASE_DIR=''

#command line options for this plugin
# available the from options variable as options.name
OPTIONS=[Option('--store_path', type='string', help='Template for path where file is stored. '+
                'following keys available {title} {author} {id}'),
        ]
options=None # will be filled with parsed options 

#Create MySpider class - to get all links form one page and also to get link to next page
class MySpider(LinksSpider):
    
    #Assure that you are logged into site and any cookies set
    # self.client is available HTTPClient to load any page needed
    def login(self):
        pg=self.client.load_page(BASE_URL)
        
    # Do we require to login
    # page - is parsed page as BeutifulSoup object
    # return true is yes
    def require_login(self, page):
        return not page.find('span', 'googlenav') is None
    
    # function to parse available links on a page
    # page - is parsed page as BeutifulSoup object
    # must be generator or return iterator returning (link, metadata) tuple       
    def next_link(self, page):
        table=page.find('div', 'table')
        if table:
            cells=table.find_all('div', 'grid_5')
            for cell in cells:
                try:
                    metadata={}
                    metadata['author']=str(list(cell.strings)[-1]).strip()
                    link=cell.find('a')
                    if link: 
                        metadata['title']=str(link.string).strip()
                        link_url=link['href']
                        m=re.match(r'/titles/(.*).html', link_url)
                        if m:
                            metadata['id']=m.group(1)
                        if not link_url:
                            continue
                    else:
                        continue
                    sub=cell.find('em')
                    if sub:
                        metadata['subtitle']=str(sub.string).strip()
                    # Is generator !
                    yield BASE_URL+link_url, metadata
                except:
                    logging.exception('error parsing book data')
                    continue
                    
       
    # Parse page to find URL for next page
    # page - is parsed page as BeutifulSoup object
    # returns url or false if there is no further page
    def next_page_url(self,page):
        try:
            pager=page.find('span', 'googlenav')
            current_page=pager.find('strong', recursive=False)
            next_url=current_page.find_next_sibling('a')
            
            if next_url and next_url.get('href'): 
                next_url=str(next_url['href'])
                # for testing can stop on page 5
                #if next_url and next_url.find('s=6')>=0: return
                logging.debug('Has next page %s' % next_url)
                return BASE_URL+next_url
        except:
            logging.exception('Error while parsing page for next page')
    
    # Postprocess links found on page and get final link and metadata
    # can get another page   via self.client and do some more parsing
    # or anything else needed
    # link, metadata is tuple returned by next_link    
    # must return (link, metadata) tuple 
    # if for any reason link should not be downloaded method should  raise SkipLink
    def postprocess_link(self, link, metadata):
        if not metadata.get( 'author'):
            metadata['author']='Unknown Author'
        if not metadata.get( 'title'):
            metadata['title']='Unknown Title'
        return link, metadata

       
def _meta_to_filename(base_dir,metadata, ext):
    templ="{author}/{title}/{author} - {title}[{id}]"
    if options.store_path:
        templ=options.store_path
    while templ.startswith(os.sep):
        templ=templ[1:]
    p=templ.format(**metadata) +ext
    return os.path.join(base_dir,p)

# function to save  file from found link
# client is   HTTPClient object to load and save file
# link is a link to file
# metadata is a dictionary of metadata as returned by MySpider.next_link
# base_dir is directory to save file
# context - get access to worker running this function 
#           context.sleeper.sleep - can be used to sleep for x seconds
def save_file(client,link, metadata, base_dir, context=None):
    filename=_meta_to_filename(base_dir, metadata, '.epub')
    data={'book':'1:epub:.epub:epub', 'tid': metadata['id']}
    client.save_file(BASE_URL+'/_scripts/send.php', filename, data,refer_url=link)
    meta=repr(metadata)
    with open(filename+'.meta', 'w') as f: 
        f.write(meta)  
    logging.debug('Saved file %s'% filename)

3) Test plugin

Surely you will need some iterations to get working correctly.  There is some support within code for building test cases.

You can import httputils.TestSpider and use it as mixed-in class – thus you should be able to test MySpider class on files stored locally.

See below test code for sample plugin:

import unittest
from plugins.manybooks import MySpider
import httputils

class TestSpider( httputils.TestSpider, MySpider):
    pass

class Test(unittest.TestCase):

    def setUp(self):
        pass

    def tearDown(self):
        pass

    def testLinks(self):
        spider=TestSpider('../../test_data/manybooks.html')
        links=list(spider)
        for l,m in links: print(l,m)
        self.assertEqual(len(links), 20)

    def testNextPageURL(self):
        spider=TestSpider('../../test_data/manybooks.html')
        url=spider.next_page_url(spider.page)

        self.assertEqual(url, 'http://manybooks.net/language.php?code=en&s=2')

        spider=TestSpider('../../test_data/manybooks5.html')
        url=spider.next_page_url(spider.page)

        self.assertTrue(url)

After offline test you can test your plugin from within downloader.  -d option will pring logging messages so be sure to use logging module to log all important things going on in your plugin.

All code can be downloaded and forked from GitHub:

https://github.com/izderadicka/downloader-py3

All code is provided under GPL v3 license.

Leave a Reply

Your email address will not be published. Required fields are marked *

My Digital Bits And Pieces