Finds and downloads all images from any given URL.
Important note:
If your download location path has spaces then put quotes around it!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | # ImageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 20140223
import sys
import os
import urllib2
from os.path import basename
import urlparse
from BeautifulSoup import BeautifulSoup # for HTML parsing
urlList = []
# recursively download images starting from the root URL
def downloadImages(url, level): # the root URL is level 0
# do not go to other websites
global website
netloc = urlparse.urlsplit(url).netloc.split('.')
if netloc[-2] + netloc[-1] != website:
return
global urlList
if url in urlList: # prevent using the same URL again
return
try:
urlContent = urllib2.urlopen(url).read()
urlList.append(url)
print url
except:
return
soup = BeautifulSoup(''.join(urlContent))
# find and download all images
imgTags = soup.findAll('img')
for imgTag in imgTags:
imgUrl = imgTag['src']
imgUrl = url[ : url.find(".com") + 4] + imgUrl if (imgUrl[ : 4] != "http") else imgUrl
# download only the proper image files
if imgUrl.lower().endswith('.jpeg') or \
imgUrl.lower().endswith('.jpg') or \
imgUrl.lower().endswith('.gif') or \
imgUrl.lower().endswith('.png') or \
imgUrl.lower().endswith('.bmp'):
try:
imgData = urllib2.urlopen(imgUrl).read()
global minImageFileSize
if len(imgData) >= minImageFileSize:
print " " + imgUrl
fileName = basename(urlparse.urlsplit(imgUrl)[2])
output = open(os.path.join(downloadLocationPath, fileName),'wb')
output.write(imgData)
output.close()
except Exception, e:
print str(e)
# pass
print
print
# if there are links on the webpage then recursively repeat
if level > 0:
linkTags = soup.findAll('a')
if len(linkTags) > 0:
for linkTag in linkTags:
try:
linkUrl = linkTag['href']
downloadImages(linkUrl, level - 1)
except Exception, e:
print str(e)
# pass
# MAIN
cla = sys.argv # command line arguments
if len(cla) != 5:
print "USAGE:"
print "[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize"
os._exit(1)
rootUrl = cla[1]
maxRecursionDepth = int(cla[2])
downloadLocationPath = cla[3] # absolute path
if not os.path.isdir(downloadLocationPath):
print downloadLocationPath + " is not an existing directory!"
os._exit(2)
minImageFileSize = long(cla[4]) # in bytes
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, maxRecursionDepth)
|
And what about <img height="18" width="90" src="/static/activestyle/img/activestate.png" /> for instance ?
Or what if I put several spaces between "<img" and "src" like : <img src="activestate.png" /> ?
Yes, the img regex is not too solid. It would be better to use one of the many available libraries to parse the page and properly extract <img> elements, but things would get a little more complicated...
Thanks for the comments. I made a small fix.
Using regexes for HTML is always a bit tricky (at best). If you can afford the performance hit of parsing the entire document, using
beautifulsoup
to get all image tags might be a better idea.Here is the version that uses Beatiful Soup for HTML parsing:
And this is a recursive version which goes N level deep in the target website:
This is also recursive version but using Beautiful Soup library to parse HTML:
This version always stays within the same website and also user can decide min file size for the image files (if 0 then all images would be downloaded):
I added this to line 8 for input.
I'm just beginning python and programming so been trying to get as much experience reading code as possible. Thanks for the downloaders I've been looking for a image downloader.
The above scripts misses to grab images from relative urls. Below change fixes it!
:)
Hello
How to set download location? I using windows 7 and xbmc, so when I use .py with xbmc all image go to C:\Program Files\XBMC How can I change download location?
Best Regards
I have updated the script.
Hello FB36 I'm glad for the new update,but now I have a problem with the script. When I run with XBMC,XBMC turns off and honestly I say, I have no idea where I write the url and downloadLocationPath.If you have time please let me answer. Best Regards
The updated script gets all parameters from the command-line:
[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize
For example:
(First open a command-line window to where the script is located.)
[python] ImageDownloader.py "http://www.yahoo.com" 2 "c:\new folder" 50000
Also make sure the download location path (directory/folder) exists before doing this.
As for XBMC, I have never used it before.
Ok thanks a lot for your reply,I will try to see how it works. I'm a total amateur when it comes to Python.But I have to admit that it is a very powerful thing. Best Regards
I use this version,where I need to type C:\New folder Please answer me.
imageDownloader.py
Finds and downloads all images from any given URL.
FB - 201009072
import urllib2 import re from os.path import basename from urlparse import urlsplit
url = "http://www.yahoo.com" urlContent = urllib2.urlopen(url).read()
HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .?src="(.?)"', urlContent)
download all images
for imgUrl in imgUrls: try: imgData = urllib2.urlopen(imgUrl).read() fileName = basename(urlsplit(imgUrl)[2]) output = open(fileName,'wb') output.write(imgData) output.close() except: pass
I'm sorry if this is a bad question, but how do you modify the code to allow it to download images that are hosted on another website (e.g. tumblr, imgur, etc)?
Does this work for instagram users/hashtags?
If not, can anyone recommend one that does :)