Finds and downloads all images from any given URL.
If your download location path has spaces then put quotes around it!
Python, 88 lines
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# ImageDownloader.py # Finds and downloads all images from any given URL recursively. # FB - 20140223 import sys import os import urllib2 from os.path import basename import urlparse from BeautifulSoup import BeautifulSoup # for HTML parsing urlList =  # recursively download images starting from the root URL def downloadImages(url, level): # the root URL is level 0 # do not go to other websites global website netloc = urlparse.urlsplit(url).netloc.split('.') if netloc[-2] + netloc[-1] != website: return global urlList if url in urlList: # prevent using the same URL again return try: urlContent = urllib2.urlopen(url).read() urlList.append(url) print url except: return soup = BeautifulSoup(''.join(urlContent)) # find and download all images imgTags = soup.findAll('img') for imgTag in imgTags: imgUrl = imgTag['src'] imgUrl = url[ : url.find(".com") + 4] + imgUrl if (imgUrl[ : 4] != "http") else imgUrl # download only the proper image files if imgUrl.lower().endswith('.jpeg') or \ imgUrl.lower().endswith('.jpg') or \ imgUrl.lower().endswith('.gif') or \ imgUrl.lower().endswith('.png') or \ imgUrl.lower().endswith('.bmp'): try: imgData = urllib2.urlopen(imgUrl).read() global minImageFileSize if len(imgData) >= minImageFileSize: print " " + imgUrl fileName = basename(urlparse.urlsplit(imgUrl)) output = open(os.path.join(downloadLocationPath, fileName),'wb') output.write(imgData) output.close() except Exception, e: print str(e) # pass print print # if there are links on the webpage then recursively repeat if level > 0: linkTags = soup.findAll('a') if len(linkTags) > 0: for linkTag in linkTags: try: linkUrl = linkTag['href'] downloadImages(linkUrl, level - 1) except Exception, e: print str(e) # pass # MAIN cla = sys.argv # command line arguments if len(cla) != 5: print "USAGE:" print "[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize" os._exit(1) rootUrl = cla maxRecursionDepth = int(cla) downloadLocationPath = cla # absolute path if not os.path.isdir(downloadLocationPath): print downloadLocationPath + " is not an existing directory!" os._exit(2) minImageFileSize = long(cla) # in bytes netloc = urlparse.urlsplit(rootUrl).netloc.split('.') website = netloc[-2] + netloc[-1] downloadImages(rootUrl, maxRecursionDepth)
And what about <img height="18" width="90" src="/static/activestyle/img/activestate.png" /> for instance ?
Or what if I put several spaces between "<img" and "src" like : <img src="activestate.png" /> ?
Yes, the img regex is not too solid. It would be better to use one of the many available libraries to parse the page and properly extract <img> elements, but things would get a little more complicated...
Thanks for the comments. I made a small fix.
Using regexes for HTML is always a bit tricky (at best). If you can afford the performance hit of parsing the entire document, using
beautifulsoupto get all image tags might be a better idea.
Here is the version that uses Beatiful Soup for HTML parsing:
And this is a recursive version which goes N level deep in the target website:
This is also recursive version but using Beautiful Soup library to parse HTML:
This version always stays within the same website and also user can decide min file size for the image files (if 0 then all images would be downloaded):
I added this to line 8 for input.
I'm just beginning python and programming so been trying to get as much experience reading code as possible. Thanks for the downloaders I've been looking for a image downloader.
The above scripts misses to grab images from relative urls. Below change fixes it!
How to set download location? I using windows 7 and xbmc, so when I use .py with xbmc all image go to C:\Program Files\XBMC How can I change download location?
I have updated the script.
Hello FB36 I'm glad for the new update,but now I have a problem with the script. When I run with XBMC,XBMC turns off and honestly I say, I have no idea where I write the url and downloadLocationPath.If you have time please let me answer. Best Regards
The updated script gets all parameters from the command-line:
[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize
(First open a command-line window to where the script is located.)
[python] ImageDownloader.py "http://www.yahoo.com" 2 "c:\new folder" 50000
Also make sure the download location path (directory/folder) exists before doing this.
As for XBMC, I have never used it before.
Ok thanks a lot for your reply,I will try to see how it works. I'm a total amateur when it comes to Python.But I have to admit that it is a very powerful thing. Best Regards
I use this version,where I need to type C:\New folder Please answer me.
Finds and downloads all images from any given URL.
FB - 201009072
import urllib2 import re from os.path import basename from urlparse import urlsplit
url = "http://www.yahoo.com" urlContent = urllib2.urlopen(url).read()
HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .?src="(.?)"', urlContent)
download all images
for imgUrl in imgUrls: try: imgData = urllib2.urlopen(imgUrl).read() fileName = basename(urlsplit(imgUrl)) output = open(fileName,'wb') output.write(imgData) output.close() except: pass
I'm sorry if this is a bad question, but how do you modify the code to allow it to download images that are hosted on another website (e.g. tumblr, imgur, etc)?
Does this work for instagram users/hashtags?
If not, can anyone recommend one that does :)