written on 15/01/2015
For so long i've wanted to download all of the PHDComics web comics. For those of you who don't know what it is, PHDComics are a series of web comics by Jorge Cham that depict the common idiosyncrasies of Graduate School and the Grad Student's life. They are pretty Hilarious. :D
Anyways, so i finally sat to write a simple script that could download all of the comics for me. Sounds simple right? A simple loop, culr/wget and presto, all images downloaded. Nopes.!! Took me a good 1 Hour and 15 minutes to figure stuff out and write it down.
for i in {1..1776}
do
w3m -dump_source "http://phdcomics.com/comics/archive.php?comicid="$i>test.txt
file_line=$(grep "<td bgcolor=#FFFFFF align=center><img id=comic name=comic src=http://www.phdcomics.com/comics/archive/" test.txt)
x="${file_line#*src=}"
x="${x%% *}"
python test.py $i $x
echo -e "Recieved Image: "$i
echo -e " "
done
Here, -dump_source tag very successfully ports the whole HTML static source to a test TXT file. Now, PHDComics have a very simple page structure and the image is always linked after the same tag always ie:
"<td bgcolor=#FFFFFF align=center><img id=comic name=comic src=http://www.phdcomics.com/comics/archive/"
followed by the image name. This makes it very easy to "grep" the line containing these things. After that there is just very simple RegEX that is awesome in BASH.
x="${file_line#*src=}"
Successfully removes everything before and including "src=" from the URL.
x="${x%% *}"
Removes everything after the URL. Hence, after this, we are just left with the URL of the image file in the variable x.This, along with the counter ie. "i" is passed to the Python Program "test.py":
import urllib
import sys
count=sys.argv[1]
url=sys.argv[2]
urllib.urlretrieve(url,"%s.gif"%count)
This saves all of the images in the directory, in a proper sequential order.
An hour to write the code. 35 minutes to download all the files(..1776 images), and now I'm of to covert the images to send it to my Kindle. Whew!!