stock prices historical data bulk download from internet « Python recipes

Courtesy of Yahoo finance, it is possible to bulk download historical prices data. This script, borrowed from pycurl retriever-multi.py example, fetch series for several tickers at a time. It uses urllib to fetch web data, so it should work with a plain vanilla python distro.

      #! /usr/bin/env python
# -*- coding: utf-8 -*-


__author__ = 'gian paolo ciceri <gp.ciceri@gmail.com>'
__version__ = '0.1'
__date__ = '20070401'
__credits__ = "queue and MT code was shamelessly stolen from pycurl example retriever-multi.py"

#
# Usage: python grabYahooDataMt.py -h 
#
#
# for selecting tickers and starting date it uses an input file of this format 
# <ticker> <fromdate as YYYYMMDD>
# like
# ^GSPC 19500103 # S&P 500
# ^N225 19840104 # Nikkei 225

import sys, threading, Queue, datetime
import urllib
from optparse import OptionParser


# this thread ask the queue for job and does it!
class WorkerThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while 1:
            try:
                # fetch a job from the queue
                ticker, fromdate, todate = self.queue.get_nowait()
            except Queue.Empty:
                raise SystemExit
            if ticker[0] == "^": 
                tick = ticker[1:]
            else:
                tick = ticker
            filename = downloadTo + "%s_%s.csv" % (tick, todate)
            fp = open(filename, "wb")
            if options.verbose:
                print "last date asked:", todate, todate[0:4], todate[4:6], todate[6:8] 
                print "first date asked:", fromdate, fromdate[0:4], fromdate[4:6], fromdate[6:8]
            quote = dict()
            quote['s'] = ticker
            quote['d'] = str(int(todate[4:6]) - 1)
            quote['e'] = str(int(todate[6:8]))
            quote['f'] = str(int(todate[0:4]))
            quote['g'] = "d" 
            quote['a'] = str(int(fromdate[4:6]) - 1)
            quote['b'] = str(int(fromdate[6:8]))
            quote['c'] = str(int(fromdate[0:4]))
            #print quote
            params = urllib.urlencode(quote)
            params += "&ignore=.csv"

            url = "http://ichart.yahoo.com/table.csv?%s" % params
            if options.verbose:
                print "fetching:", url           
            try:
                f = urllib.urlopen(url)
                fp.write(f.read())
            except:
                import traceback
                traceback.print_exc(file=sys.stderr)
                sys.stderr.flush()
            fp.close()
            if options.verbose:
                print url, "...fetched"
            else:
                sys.stdout.write(".")
                sys.stdout.flush()



if __name__ == '__main__':

    # today is
    today = datetime.datetime.now().strftime("%Y%m%d")
    
    # parse arguments
    parser = OptionParser()
    parser.add_option("-f", "--file", dest="tickerfile", action="store", default = "./tickers.txt",
    				  help="read ticker list from file, it uses ./tickers.txt as default")
    parser.add_option("-c", "--concurrent", type="int", dest="connections", default = 10, action="store",
    				  help="# of concurrent connections")
    parser.add_option("-d", "--dir", dest="downloadTo", action="store", default = "./rawdata/",
    				  help="save date to this directory, it uses ./rawdata/ as default")
    				  
    parser.add_option("-t", "--todate", dest="todate", default = today, action="store",
    				  help="most recent date needed")
    parser.add_option("-v", "--verbose",
    					  action="store_true", dest="verbose")
    parser.add_option("-q", "--quiet",
    					  action="store_false", dest="verbose")					 
    (options, args) = parser.parse_args()

    
    tickerfile = options.tickerfile
    downloadTo = options.downloadTo
    connections =  options.connections
    today = options.todate
    

    # get input list
    try:
    	tickers = open(tickerfile).readlines()
    except:
    	parser.error("ticker file %s not found" % (tickerfile,))
    	raise SystemExit
    
    
    # build a queue with (ticker, fromdate, todate) tuples
    queue = Queue.Queue()
    for tickerRow in tickers:
    	#print tickerRow
    	tickerRow = tickerRow.strip()
    	if not tickerRow or tickerRow[0] == "#":
    		continue
    	tickerSplit = tickerRow.split()
    	# ticker, fromdate, todate
    	queue.put((tickerSplit[0], tickerSplit[1], today))

    

    
    # Check args
    assert queue.queue, "no Tickers given"
    numTickers = len(queue.queue)
    connections = min(connections, numTickers)
    assert 1 <= connections <= 255, "too much concurrent connections asked"

    
    if options.verbose:
    	print "----- Getting", numTickers, "Tickers using", connections, "simultaneous connections -----"
    
    
    # start a bunch of threads, passing them the queue of jobs to do
    threads = []
    for dummy in range(connections):
    	t = WorkerThread(queue)
    	t.start()
    	threads.append(t)
    
    
    # wait for all threads to finish
    for thread in threads:
    	thread.join()
    sys.stdout.write("\n")
    sys.stdout.flush()
       	
    # tell something to the user before exiting
    if options.verbose:	   
    	print "all threads are finished - goodbye."

      

It seems easy to automate web data download, and this little sample do the job using only a plain vanilla python distribution. The multithread approach (thanks to pycurl sample) is implemented in a simple manner, with the help of queue of jobs (one for each serie to get, all performing the same task for different tickers). As a minor feature, optparse is used to read line parameters to the script.

Tags: network

7 comments

Texie Nielsen 16 years, 10 months ago # | flag

Yahoo URL. Is the URL in the script up to date? The script seems to work fine, but I'm not getting a hit on the URL. Thanks, -t

Sergio Correia 16 years, 7 months ago # | flag

There was a recent change (2-3 months maybe) in Yahoo's URLs, which is probably easy to fix.

However, what if there are more than 256 stocks? What if I want to do the SP500 (500 stocks), or if even 20 concurrent connections are too much? I'm still learning threading but the answer may need the use of some kind of locks + global vars (not sure though)

Albert Chan 15 years, 12 months ago # | flag

How to add ticker symbol to tuples. Thanks! The script works great.

I was wondering how you can modify the script to add the corresponding ticker symbol to the beginning of each line in each downloaded CSV file. The stock analysis program I use require the ticker symbol to be in each line of data in order for the CSV file to import properly.

Greatly appreciate your help!

jigarv 15 years, 8 months ago # | flag

How to run this script.... please sugest

jigarv 15 years, 8 months ago # | flag

IndexError: list index out of range this is the error i get... plaease suggest

jigarv 15 years, 8 months ago # | flag

TypeError: coercing to Unicode: need string or buffer, NoneType found this is the eror i get

John Deere 12 years, 5 months ago # | flag

Hi,

I am trying to run this code but it gives me syntax error (invalid syntax). Could anyone help me how to run this code and output s&p data in csv ?

Thanks

◄	Python recipes (4591)	►
◄	gian paolo ciceri's recipes (7)	►

stock prices historical data bulk download from internet (Python recipe) by gian paolo ciceri
ActiveState Code (http://code.activestate.com/recipes/511444/)

7 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

stock prices historical data bulk download from internet (Python recipe) by gian paolo ciceri ActiveState Code (http://code.activestate.com/recipes/511444/)

7 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

stock prices historical data bulk download from internet (Python recipe) by gian paolo ciceri
ActiveState Code (http://code.activestate.com/recipes/511444/)