Welcome, guest | Sign In | My Account | Store | Cart

Finds and downloads all images from any given URL.

Important note:

If your download location path has spaces then put quotes around it!

Python, 88 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# ImageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 20140223
import sys
import os
import urllib2
from os.path import basename
import urlparse
from BeautifulSoup import BeautifulSoup # for HTML parsing

urlList = []

# recursively download images starting from the root URL
def downloadImages(url, level): # the root URL is level 0
    # do not go to other websites
    global website
    netloc = urlparse.urlsplit(url).netloc.split('.')
    if netloc[-2] + netloc[-1] != website:
        return

    global urlList
    if url in urlList: # prevent using the same URL again
        return

    try:
        urlContent = urllib2.urlopen(url).read()
        urlList.append(url)
        print url
    except:
        return

    soup = BeautifulSoup(''.join(urlContent))
    # find and download all images
    imgTags = soup.findAll('img')
    for imgTag in imgTags:
        imgUrl = imgTag['src']
        imgUrl = url[ : url.find(".com") + 4] + imgUrl if (imgUrl[ : 4] != "http") else imgUrl
        # download only the proper image files
        if imgUrl.lower().endswith('.jpeg') or \
            imgUrl.lower().endswith('.jpg') or \
            imgUrl.lower().endswith('.gif') or \
            imgUrl.lower().endswith('.png') or \
            imgUrl.lower().endswith('.bmp'):
            try:
                imgData = urllib2.urlopen(imgUrl).read()
                global minImageFileSize
                if len(imgData) >= minImageFileSize:
                    print "    " + imgUrl
                    fileName = basename(urlparse.urlsplit(imgUrl)[2])
                    output = open(os.path.join(downloadLocationPath, fileName),'wb')
                    output.write(imgData)
                    output.close()
            except Exception, e:
                print str(e)
                # pass
    print
    print

    # if there are links on the webpage then recursively repeat
    if level > 0:
        linkTags = soup.findAll('a')
        if len(linkTags) > 0:
            for linkTag in linkTags:
                try:
                    linkUrl = linkTag['href']
                    downloadImages(linkUrl, level - 1)
                except Exception, e:
                    print str(e)
                    # pass

# MAIN
cla = sys.argv # command line arguments
if len(cla) != 5:
    print "USAGE:"
    print "[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize"
    os._exit(1)

rootUrl = cla[1]
maxRecursionDepth = int(cla[2])
downloadLocationPath = cla[3] # absolute path
if not os.path.isdir(downloadLocationPath):
    print downloadLocationPath + " is not an existing directory!"
    os._exit(2)

minImageFileSize = long(cla[4]) # in bytes
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, maxRecursionDepth)

18 comments

nicobo 13 years, 6 months ago  # | flag

And what about <img height="18" width="90" src="/static/activestyle/img/activestate.png" /> for instance ?

Or what if I put several spaces between "<img" and "src" like : <img     src="activestate.png" /> ?

Sébastien Volle 13 years, 6 months ago  # | flag

Yes, the img regex is not too solid. It would be better to use one of the many available libraries to parse the page and properly extract <img> elements, but things would get a little more complicated...

FB36 (author) 13 years, 6 months ago  # | flag

Thanks for the comments. I made a small fix.

Alan Plum 13 years, 6 months ago  # | flag

Using regexes for HTML is always a bit tricky (at best). If you can afford the performance hit of parsing the entire document, using beautifulsoup to get all image tags might be a better idea.

FB36 (author) 13 years, 6 months ago  # | flag

Here is the version that uses Beatiful Soup for HTML parsing:

# imageDownloader.py
# Finds and downloads all images from any given URL.
# FB - 201009083
import urllib2
from os.path import basename
from urlparse import urlsplit
from BeautifulSoup import BeautifulSoup # for HTML parsing

url = "http://www.yahoo.com"
urlContent = urllib2.urlopen(url).read()
soup = BeautifulSoup(''.join(urlContent))
imgTags = soup.findAll('img') # find all image tags

# download all images
for imgTag in imgTags:
    imgUrl = imgTag['src']
    try:
        imgData = urllib2.urlopen(imgUrl).read()
        fileName = basename(urlsplit(imgUrl)[2])
        output = open(fileName,'wb')
        output.write(imgData)
        output.close()
    except:
        pass
FB36 (author) 13 years, 6 months ago  # | flag

And this is a recursive version which goes N level deep in the target website:

# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009083
import urllib2
import re
from os.path import basename
from urlparse import urlsplit

global urlList
urlList = []

# recursively download images starting from the root URL
def downloadImages(url, level): # the root URL is level 0
    print url
    global urlList
    if url in urlList: # prevent using the same URL again
        return
    urlList.append(url)
    try:
        urlContent = urllib2.urlopen(url).read()
    except:
        return

    # find and download all images
    imgUrls = re.findall('<img .*?src="(.*?)"', urlContent)
    for imgUrl in imgUrls:
        try:
            imgData = urllib2.urlopen(imgUrl).read()
            fileName = basename(urlsplit(imgUrl)[2])
            output = open(fileName,'wb')
            output.write(imgData)
            output.close()
        except:
            pass

    # if there are links on the webpage then recursively repeat
    if level > 0:
        linkUrls = re.findall('<a .*?href="(.*?)"', urlContent)
        if len(linkUrls) > 0:
            for linkUrl in linkUrls:
                downloadImages(linkUrl, level - 1)

# main
downloadImages('http://www.yahoo.com', 1)
FB36 (author) 13 years, 6 months ago  # | flag

This is also recursive version but using Beautiful Soup library to parse HTML:

# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009083
import urllib2
from os.path import basename
from urlparse import urlsplit
from BeautifulSoup import BeautifulSoup # for HTML parsing

global urlList
urlList = []

# recursively download images starting from the root URL
def downloadImages(url, level): # the root URL is level 0
    print url
    global urlList
    if url in urlList: # prevent using the same URL again
        return
    urlList.append(url)
    try:
        urlContent = urllib2.urlopen(url).read()
    except:
        return

    soup = BeautifulSoup(''.join(urlContent))
    # find and download all images
    imgTags = soup.findAll('img')
    for imgTag in imgTags:
        imgUrl = imgTag['src']
        try:
            imgData = urllib2.urlopen(imgUrl).read()
            fileName = basename(urlsplit(imgUrl)[2])
            output = open(fileName,'wb')
            output.write(imgData)
            output.close()
        except:
            pass

    # if there are links on the webpage then recursively repeat
    if level > 0:
        linkTags = soup.findAll('a')
        if len(linkTags) > 0:
            for linkTag in linkTags:
                try:
                    linkUrl = linkTag['href']
                    downloadImages(linkUrl, level - 1)
                except:
                    pass

# main
downloadImages('http://www.yahoo.com', 1)
FB36 (author) 13 years, 6 months ago  # | flag

This version always stays within the same website and also user can decide min file size for the image files (if 0 then all images would be downloaded):

# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009094
import urllib2
from os.path import basename
import urlparse
from BeautifulSoup import BeautifulSoup # for HTML parsing

global urlList
urlList = []

# recursively download images starting from the root URL
def downloadImages(url, level, minFileSize): # the root URL is level 0
    # do not go to other websites
    global website
    netloc = urlparse.urlsplit(url).netloc.split('.')
    if netloc[-2] + netloc[-1] != website:
        return

    global urlList
    if url in urlList: # prevent using the same URL again
        return

    try:
        urlContent = urllib2.urlopen(url).read()
        urlList.append(url)
        print url
    except:
        return

    soup = BeautifulSoup(''.join(urlContent))
    # find and download all images
    imgTags = soup.findAll('img')
    for imgTag in imgTags:
        imgUrl = imgTag['src']
        # download only the proper image files
        if imgUrl.lower().endswith('.jpeg') or \
            imgUrl.lower().endswith('.jpg') or \
            imgUrl.lower().endswith('.gif') or \
            imgUrl.lower().endswith('.png') or \
            imgUrl.lower().endswith('.bmp'):
            try:
                imgData = urllib2.urlopen(imgUrl).read()
                if len(imgData) >= minFileSize:
                    print "    " + imgUrl
                    fileName = basename(urlsplit(imgUrl)[2])
                    output = open(fileName,'wb')
                    output.write(imgData)
                    output.close()
            except:
                pass
    print
    print

    # if there are links on the webpage then recursively repeat
    if level > 0:
        linkTags = soup.findAll('a')
        if len(linkTags) > 0:
            for linkTag in linkTags:
                try:
                    linkUrl = linkTag['href']
                    downloadImages(linkUrl, level - 1, minFileSize)
                except:
                    pass

# main
rootUrl = 'http://www.yahoo.com'
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
global website
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, 1, 50000)
closedthedoor 12 years, 8 months ago  # | flag

I added this to line 8 for input.

bang = raw_input("http://")     
url = "http://"+bang

I'm just beginning python and programming so been trying to get as much experience reading code as possible. Thanks for the downloaders I've been looking for a image downloader.

Deepak 11 years, 6 months ago  # | flag

The above scripts misses to grab images from relative urls. Below change fixes it!

>>> img_url=url[:url.find(".com")+4]+imgUrl if (imgUrl[:4]!="http") else imgUrl
>>> imgData = urllib2.urlopen(img_url).read()
>>> fileName = basename(urlsplit(img_url)[2])

:)

Goran Nikolic 10 years, 1 month ago  # | flag

Hello

How to set download location? I using windows 7 and xbmc, so when I use .py with xbmc all image go to C:\Program Files\XBMC How can I change download location?

Best Regards

FB36 (author) 10 years, 1 month ago  # | flag

I have updated the script.

Goran Nikolic 10 years, 1 month ago  # | flag

Hello FB36 I'm glad for the new update,but now I have a problem with the script. When I run with XBMC,XBMC turns off and honestly I say, I have no idea where I write the url and downloadLocationPath.If you have time please let me answer. Best Regards

FB36 (author) 10 years, 1 month ago  # | flag

The updated script gets all parameters from the command-line:

[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize

For example:

(First open a command-line window to where the script is located.)

[python] ImageDownloader.py "http://www.yahoo.com" 2 "c:\new folder" 50000

Also make sure the download location path (directory/folder) exists before doing this.

As for XBMC, I have never used it before.

Goran Nikolic 10 years, 1 month ago  # | flag

Ok thanks a lot for your reply,I will try to see how it works. I'm a total amateur when it comes to Python.But I have to admit that it is a very powerful thing. Best Regards

Goran Nikolic 10 years, 1 month ago  # | flag

I use this version,where I need to type C:\New folder Please answer me.

imageDownloader.py

Finds and downloads all images from any given URL.

FB - 201009072

import urllib2 import re from os.path import basename from urlparse import urlsplit

url = "http://www.yahoo.com" urlContent = urllib2.urlopen(url).read()

HTML image tag: <img src="url" alt="some_text"/>

imgUrls = re.findall('img .?src="(.?)"', urlContent)

download all images

for imgUrl in imgUrls: try: imgData = urllib2.urlopen(imgUrl).read() fileName = basename(urlsplit(imgUrl)[2]) output = open(fileName,'wb') output.write(imgData) output.close() except: pass

Dylan Sampson 9 years, 11 months ago  # | flag

I'm sorry if this is a bad question, but how do you modify the code to allow it to download images that are hosted on another website (e.g. tumblr, imgur, etc)?

juan 9 years, 1 month ago  # | flag

Does this work for instagram users/hashtags?

If not, can anyone recommend one that does :)