ffscrap – Firefox Add-on

ffscrap is a Firefox addon that enables to extract  data (in CSV format) easily from any page and save them or paste them  to a spreadsheet.

Many times in past I needed to get data from a HTML page into a spreadsheet –  sometimes it worked  fine  just by cut&paste, but many times HTML code was more complex and it did not work very well.   So I decided to write a Firefox extension that will help to extract data ( the term ‘scrap’ is also used in this context) into some useful form that can be easily used in a spreadsheet or in other applications (CSV format seemed to be best choice due to its simplicity and wide support).  I used Mozilla Add-on SDK, which enables to write extensions with high level API.

Features of the extension:

  • Define data on the page using easy to understand JSON script/recipe.
  • When scraping JSON script is defined you can extract data just by one click.
  • Maintains scraping scripts for pages.
  • Share scraping scripts easily within your community (scripts can be automatically downloaded from given URL).

You can test extension and its scraping function easily on this page – http://www.cnb.cz/cs/financni_trhy/devizovy_trh/kurzy_devizoveho_trhu/denni_kurz.jsp

Assure that Addon Bar is visible (Ctrl+/) and right-click on ffscrap icon    on the  bar (in bottom right of the screen).  The “Define Scrapping Script” dialogue  pops up.

Input “CNB Exchage Rate” as Name of the script and paste this to Scraping Specification replacing text there:

Then click “Apply to This Page” button.   “Scraping Result” dialogue will appear and you can copy CSV data to clipboard or save it to file.

In order to use this addon you have to:

  1. Install it in Firefox browser ( ver. 17+) – see Download tab.
  2. Define scraping script (right-click ffscrap icon – more details below)
  3. Extract data (left-click ffscrap icon) and save them or copy them to clipboard
  4. Optionally you can share your scraping scripts within your community

Defining Scraping Script

On any page you can define your own scraping script – just right-click ffscrap icon    on the Add-on  Bar (to make this toolbar visible use View/ToolBars/Add-on Bar or Ctrl+/ ).  This will open a pop up dialogue “Define Scraping Script” , where you can define new script – for the initial definition I recommend to move this dialogue to a new tab by pressing “To Tab” button (dialogue will close when you click outside of it or window looses focus – this is a feature of FF add-on panels).

In “Define Scrapping Script”  dialogue enter:

Name – arbitrary name for your script, just be aware that this name should be unique – if you use name of an existing script, it will be overwritten by this new definition. Name cannot be changed once script is saved for first time.

URL Pattern –   URL or partial URL for which this script should be applied.  Just leave enough of URL to identify this page, but leave out dynamic components which vary with different page views – like query strings etc.  Star (*) can be user as “wild character” to match group of any characters.

Scraping Specification –   JSON script defining  data scraping.   There is a skeleton/sample script shown initially so you can start with it and modify it.

When done with the script click “Apply to Page”  to test the script (it’ll be also saved) or “Save Only” just to save it for later.

When some script is defined for a page ffscrap icon on Add-on Bar will change to   and you can extract data from this page just by left-clicking on this icon.

Scraping script details

Script uses JSON  syntax, so stick to it exactly – unfortunately parser is not very particular about syntax errors.

Script uses CSS selectors, with jQuery extensions –  for available selectors and their usage refer to JQuery documentation. Selectors are always evaluated in the context of their parent ( row in root context, field in row context). In order to identify right selectors you might need some other FF addon that will enable to browse page HTML code – I personally prefer Firebug.

Script identifies  root element – an element which contains all required data (table, ul, div …), then repeating row elemets (tr, div, li, ..) and finally individual fields/column elements (td, div, span, input …).

rowsRage – for selected set of rows you can define its subset by defining starting index (zero based), ending index (can be negative, -2 means two before last) and step.  rowsRage is optional, if not used all rows identified by selector are used.

You can define as many fields as needed  within fields array(they need to be separated by comma). For each field on a row, apart of its selector,  you define:

name – name is mandatory and is used as column header in CSV format.

typestring is the default type , int and float are numeric types (parsed by parseInt and parseFloat JavaScript functions), boolean has values true (for ‘true’ or ‘yes’or  ‘y’ or  ‘1’) or false (for everything else).

source – determines what is used from selected element(s):
text  – all  text  within selected element(this is default, if source is not specified)
attr:xxx – value of xxx attribute from  first selected element
input – value of first input found within selected element(s)
input_or_text – value of first input found within selected element(s), if there is none, then all  text within selected element

preprocess – function to pre-process text of a field before converting it  to given type – currently there are 2 functions numberNormalize and numberNormalizeCZ, which  remove all non digit characters (numberNormalizeCZ also replaces , with .).

Sharing Scraping Scripts

Extension has possibility to load scripts for an URL – there is a parameter “URL of remote source of scrapers” in the  extension preferences.  This URL should refer to a JSON file, which contains an array of scraping scripts definitions (JSON objects). To get full scraping script JSON object use “Copy Script” button on  “Define Scraping Script” dialogue.  When this extension parameter is filled correctly, scraping scripts are downloaded and set up and  also with each new start of the browser scripts they are updated.

Troubleshooting

In case of problems check Error Console (Tools/Web Developer/Error Console or Ctrl+Shift+J), extension logs there various messages.

Install from here.

Source is available on GitHub.

Dual licensed under MPL 2.0 or GPL v3.

My Digital Bits And Pieces