Downloader Python3

Downloader is a small utility and framework intended for help download many files from a certain site in an unobtrusive manner. Written in Python 3 for Linux platform – requires some python programming skills to use.

The program is build for this use case:
One needs to download many files form a site. Files are referenced on site in a list or table, that is paginated ( hundreds of pages). Actual download links might be indirectly referred from that pages or even require some algorithm to be constructed.  The  downloading program should behave in an unobtrusive manner – do not overload the site and basically be undistinguishable from a regular user who is browsing site.

The tool is basically a framework, which consist of a basic program, which provides key actions and plugins, which are specific for given site and which do parse site pages and provide links for files to be downloaded. Also tool is designed to be able to run a long time (weeks, months) and resume if it has been interrupted in meanwhile ( not actually resuming individual files now, but rather resume from the page when it was parsing links last time).

Interface of program is command line.

Basic flow of program:

  1. Log into site
  2. Download first page
  3. Parse all relevant links on the page, plus their metadata ( e.g. relevant information related to link – name, author, date etc.)
  4. Process the link – is any further browsing and parsing is needed to get final link and its metadata
  5. Enqueue links and meta for download – downloads will then run in separate threads
  6. Parse page for link to next page
  7. Load next page and continue on 2.
  8. End if there are no further pages

Program is to be used from command line :

Each plugin can have its own options.
Do not forget that it has to be run from python3,  to run with sample plug-in provided run this:

To use downloader you need to write your own plugin for site you’re interested in.  Skill needed:

There is a sample plugin for site manybooks so you can look at it.

General approach would be :

1) Looks at the site of your interest

Look at the HTML code how it is structured – look where are links and what information around there you can use as metadata. The invaluable help when looking into HTML can be provided by Firebug Firefox plugin. Look how pagination is done and how to get a link to next page.

Download a page for offline testing with plugin ( HTML Only file – no need for related media).

2) Write Plugin

In plugins directory create new file – use manybooks.py as a prototype.

Set correct urls to BASE_URL and START_URL and start with methods next_link and next_page_url – these are core of site parsing.  Use test cases – see below – to test your code.

If site requires login then implement require_login and login methods.

If you cannot get clear download link from initial pages implement method postprocess_link, where you can download and parse any intermediate page(s) and do any further processing to get final link and metadata. If for any reasons link should not be downloaded , this method can raise  raise SkipLink exception.

Then implement function save_file, which will save final link to disk – use client.save_file method to load and save data. save_file function has access to special parametrs context, which is worker object running this function.  If for any reasons function needs to wait for a while (download limits etc.) it should use context.sleeper.sleep(x) to sleep for x seconds.

Finally you can fine tune downloader parameters on lines 15-30, which will govern the HTTP requests behaviour – delay between requests and concurrency.

If you want control plugin behaviour from command line you can define its OPTIONS on line 42 and use then options global variable in code.

Here is complete sample plugin with comments:

3) Test plugin

Surely you will need some iterations to get working correctly.  There is some support within code for building test cases.

You can import httputils.TestSpider and use it as mixed-in class – thus you should be able to test MySpider class on files stored locally.

See below test code for sample plugin:

After offline test you can test your plugin from within downloader.  -d option will pring logging messages so be sure to use logging module to log all important things going on in your plugin.

All code can be downloaded and forked from GitHub:

https://github.com/izderadicka/downloader-py3

All code is provided under GPL v3 license.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

My Digital Bits And Pieces