Download all lolcat images from iCanHasCheezburger.com « Python recipes

Running this python script will download all lolcat images from icanhascheezburger.com to the current folder. Download will start from the oldest image. Images are collected into subfolders lolcat0, lolcat1 etc, each containing 300 images. The script can be stopped and resumed at anytime. Make sure to create files lolconfig.txt and log.txt in the same folder before running the script. lolconfig.txt must have a string as follows in the beginning: 1496/1496/0. log.txt is an empty file in the beginning

      #!/usr/bin/env python
 
# Retrieve lolcat images from icanhascheezburger.com, starting from the oldest image
# Author: Rahul Anand <rahulanand.fine@gmail.com> | Homepage: http://eternalthinker.blogspot.com/

import urllib2, urllib
import re, os

file("log.txt", "w").close() # clear current log
# config string is in the format: lastKnownMaxPages/lastParsedPage/imgCountSoFar
config = [int(item) for item in file("lolconfig.txt", "r").read().split('/')]
limitPage = config[0]
lastPage = config[1]
imgCount = config[2]

urlContent = urllib2.urlopen('http://icanhascheezburger.com/').read()
limitPageOriginal = int(re.findall(""">Next.*?page/(.*?)/">Last<""", urlContent)[0]) # Get the current max pages
lastPage += limitPageOriginal - limitPage # Alter current page to account for new pages added to the website
limitPage = str(limitPageOriginal) + '/'

# Start each page from the oldest to the latest
while lastPage >= 1:
	logString = limitPage + str(lastPage) + '/' + str(imgCount)
		
	config = file("lolconfig.txt", "w")
	config.write(logString) # Write the config log about current parsing 
	config.close()	
	
	log = file("log.txt", "a")
	log.write(logString + '\n') # Write the log
	log.close()
	
	folderName = './lolcats' + str(imgCount/300) + '/' # Make new folder for every 300 images
	if not os.path.exists(folderName): 
		os.mkdir(folderName)
		print "Now downloading to", folderName.rsplit('/', 2)[1]
	
	url = "http://icanhascheezburger.com/page/" + str(lastPage)
	urlContent = urllib2.urlopen(url).read()
	print 'Page:', lastPage
	lastPage -= 1
	
	# Parse and download images from current page
	imgUrls = re.findall("""<div class="md"><p.*?img .*?src=["'](.*?)["']""", urlContent, re.DOTALL) # el regex
	imgUrls = imgUrls[::-1] # The bottom image is the oldest in a page. So reverse the parsed list
	for i in range(0, len(imgUrls) ):
	    imgUrl = imgUrls[i] 
	    fileName = str(imgCount) + ".jpg"
	    imgCount += 1 	
	    try:
	    	print "   *", fileName
		urllib.urlretrieve(imgUrl, folderName + fileName)        
	    except:
		print "Error retrieving image", fileName # :'(
		
		
		
		

      

While stopping, some of the images may not be completely downloaded. Moreover, when you run the script again, many intermediate images can be missing. For this use the log.txt together with lolconfig.txt

Resuming without losing any image:
The lolconfig.txt has a string in the format: maximumPages/lastDownloadedPage/currentImageCount The starting value can be something like 1495/1495/0 Note that the script updates the maximumPages automatically while connecting to the site.
The log contains the strings written before downloading each page

Step 1: Check the folder to which images are downloaded. Step 2: Find the number of the last image which was succesfully downloaded. Step 3: Check the log.txt entries starting at the bottom. Step 4: Find the entry with the imageCount (the 3rd number) less than or equal to the last succesful image number. Step 5: Copy paste this entry to lolconfig.txt and save. Step 6: Run the script again!

The above code doesn't download the pg-13 rated images. For this, alter the last block as follows:

# Parse and download images from current page
    pg = False
    imgUrls = re.findall("""<div class="md"><p.*?img .*?src=["'](.*?)["']""", urlContent, re.DOTALL) # el regex
        imgUrls = imgUrls[::-1] # The bottom image is the oldest in a page. So reverse the parsed list
        for i in range(0, len(imgUrls) ):
            imgUrl = imgUrls[i] 
            fileName = str(imgCount) + ".jpg"
            if not 'pg-13' in imgUrl:
                imgCount += 1   
                try:
                    print "   *", fileName
                urllib.urlretrieve(imgUrl, folderName + fileName)        
                except:
                    print "Error retrieving image", fileName # :'(
            else: pg = True
        if pg:
            imgUrls = re.findall("""<div class="md"><p><a href=["'](.*?)["']""", urlContent, re.DOTALL)
            for i in range(0, len(imgUrls) ):
                imgUrl = imgUrls[i] 
                fileName = str(imgCount) + "-pg-13.jpg"         
                imgCount += 1   
                try:                
                        print "   *", fileName, '--> PG!'
                urllib.urlretrieve(imgUrl, folderName + fileName)        
            except:
                print "Error retrieving image", fileName # :'(

Tags: download, images, lolcat, python, web

◄	Python recipes (4591)	►
◄	Rahul Anand's recipes (2)	►

Download all lolcat images from iCanHasCheezburger.com (Python recipe) by Rahul Anand
ActiveState Code (http://code.activestate.com/recipes/577603/)

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Download all lolcat images from iCanHasCheezburger.com (Python recipe) by Rahul Anand ActiveState Code (http://code.activestate.com/recipes/577603/)