java, ruby, python, php, religion, theism, atheism, ignosticism, portfolio, canada, programming, ruby sucks, python sucks, java rocks

Linux Bash Script to Preview Websites

Posted January 28, 2010

I was given an interesting task today. My boss wanted me to develop a script which could basically take a screen-shot of a website (as much of the website [vertically] as is reasonable) using only Bash and any tools readily available on Linux.

Before I proceed, you will need the following software installed

  • vncserver - used for creating an off-screen virtual display. You must have run this at least one normally in order to set a password, etc.
  • imagemagick tools - most notably "import" and "convert" command-line tools. These are used to capture the screen, and trim the borders
  • opera 10 - this is needed for the actual rendering (see below why this was chosen)
  • uuid - this is handy for generating temp file names.

Herewith I present my script:

#!/bin/sh
# Script to capture website
# Usage: site_capture.sh {site_to_capture} {filename}

if [ $# -ne 2 ]; then
    echo "Usage: $0 website filename"
    exit 1
fi

SITE=$1
DESTIN=$2

killall -9 opera 1>/dev/null 2>/dev/null
vncserver -clean -kill :4 1>/dev/null 2>/dev/null

vncserver :4 -geometry 1024x4000 1>/dev/null 2>/dev/null
opera -display :4 -activetab -nosession -notrayicon -disableinputmethods -nomail -fullscreen 1>/dev/null 2>/dev/null &
OPERA_PID=$!
sleep 2
WINDOW_ID=`xwininfo -display :4 -root -children | grep "Speed Dial - Opera" | awk '{print $1}'`
opera -display :4 -remote "openURL($SITE)" & 1>/dev/null 2>/dev/null

sleep 5
FILENAME_UUID=`uuid`
FILENAME="$FILENAME_UUID.jpg"
import -display :4 -window $WINDOW_ID /tmp/$FILENAME

kill $OPERA_PID 1>/dev/null 2>/dev/null
vncserver -clean -kill :4 1>/dev/null 2>/dev/null

convert /tmp/$FILENAME -trim /var/www/screens/$DESTIN

Now, this is going to require some explanation.

First of all, I use "1>/dev/null and 2>/dev/null" instead of "&>/dev/null" mainly because I've experienced problems with this on earlier versions of Ubuntu, and I don't care to experiment with that right now.

Secondly, we used Opera for a very specific reason - it seems to be just about the only Linux browser which actually offers useful command-line arguments. FireFox "supposedly" has loads of arguments, but in reality they all seem to be ignored. I'm not sure if this is the "firefox" script which resides in /usr/bin (which ultimately executes the firefox binary), but again I didn't have the time nor energy to delve into this.

The first few lines are simple enough, and should make sense to anyone who's ever written a Bash script.

However, from line 13 things become interesting. First of all, we want to sanitize the environment. Since we're using Opera as the rendering engine for the website, we need to make sure we only have one instance running (you will see why later). We do this with a simple "killall". Next we need to make sure that vnc isn't running on our target "display" area - doing an outright "kill" instruction causes havoc later (lock files all over the place), but thankfully there's a clean instruction for this.

Now comes the real bit of trickery - we launch vncserver on a "display" which doesn't physically exist (thereby creating a "virtual display" as big as we care to make it. Most sites today will work in 1024 width, but since we want to render as much vertically as we reasonably can, we opted for a whopping 4000. There is nothing you will visually see here.

And now we hit Opera ... and this area caused me the most troubles. Opera will, on occasion, pop-up dialog boxes which must be interacted with (not something you can do on a virtual display). Not an issue, you can prevent all of them by using "-nosession" - but this argument has a very annoying side-effect: it causes Opera to completely ignore the URL you specify.

Now the "-remote" argument would seem to be the solution, except when I initially used this, the results were very quirky - typically a selection of a 1/2 dozen or so error messages (none of which seemed pleasant). As it turns out, you have to give the first invocation to Opera a few seconds to get up and running before trying to interact with it.

I've skipped a step here, so I'll back-track - what's that stuff on line 20 ("WINDOW_ID=..."). Well, it relates to line 26. Line 26 executes an ImageMagick tool which performs the actual screen-grab. You can typically just tell "import" to grab the entire desktop. After all, Opera is running in fullscreen, right? Well - not reliably so. I've seen it on occasion where the Ubuntu menu bar and task bar are present. The "import" tool allows you to limit the grab to only a single window, but that requires the X-Windows "window ID" of that particular window. All we have the the PID, which isn't of any help here (and if you're wondering, there doesn't seem to be any way of getting the window ID's associated to a PID - I looked).

So, we make the assumption that our window will always initially have the title "Speed Dial - Opera", so we tell xwininfo to simpyl give us all the children of the root window, and then we grep for the title, and pipe the result through awk so we can grab the ID number.

Once we're instruction Opera to open our URL, we need to wait a few seconds for it to load the website. This is a little dodgy, really. We actually have no real way of knowing if the page is done loading or not, so we simply guess. We grab the image, kill the instance of Opera (using the PID we stored previously to cleanly kill it), shut down the vncserver, and as a last stage we run the temporary file through a convert filter to strip off any potential "borders" we may have hanging on.

It's a short script, but it took a fair amount of tinkering to put it all together.

I hope someone out there finds this useful.

Comments

There are no comments for this post.

Add comment

Visit my Friends and Family

If you've enjoyed my site, please take a moment to visit my friends and family, many of whom have some interesting insights, and entertaining thoughts and ideas.

Widgets

Advertising


Windows 7 Sins
 

ss_blog_claim=07d81221ccea23e9eae5fdaf510cea20