Popular Python recipes tagged "html"http://code.activestate.com/recipes/langs/python/tags/html/2015-03-07T20:22:54-08:00ActiveState Code RecipesConvert HTML to PDF with the PDFcrowd API (Python) 2015-03-07T20:22:54-08:00Vasudev Ramhttp://code.activestate.com/recipes/users/4173351/http://code.activestate.com/recipes/579032-convert-html-to-pdf-with-the-pdfcrowd-api/ <p style="color: grey"> Python recipe 579032 by <a href="/recipes/users/4173351/">Vasudev Ram</a> (<a href="/recipes/tags/api/">api</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/pdf/">pdf</a>, <a href="/recipes/tags/pdfcrowd/">pdfcrowd</a>). </p> <p>This recipe shows how to use Python and the PDFcrowd API to convert HTML content to PDF. The HTML input can come from a remote URL, a local HTML file, or a string containing HTML.</p> Composing a POSTable HTTP request with multipart/form-data Content-Type to simulate a form/file upload. (Python) 2014-03-08T17:34:38-08:00István Pásztorhttp://code.activestate.com/recipes/users/4189380/http://code.activestate.com/recipes/578846-composing-a-postable-http-request-with-multipartfo/ <p style="color: grey"> Python recipe 578846 by <a href="/recipes/users/4189380/">István Pásztor</a> (<a href="/recipes/tags/field/">field</a>, <a href="/recipes/tags/file/">file</a>, <a href="/recipes/tags/form/">form</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/httpclient/">httpclient</a>, <a href="/recipes/tags/mime/">mime</a>, <a href="/recipes/tags/multipart/">multipart</a>, <a href="/recipes/tags/post/">post</a>, <a href="/recipes/tags/upload/">upload</a>, <a href="/recipes/tags/web/">web</a>). Revision 5. </p> <p>This code is useful if you are using a http client and you want to simulate a request similar to that of a browser that submits a form containing several input fields (including file upload fields). I've used this with python 2.x.</p> Pretty and Stated HTMLParsers (Python) 2013-12-14T00:28:36-08:00Ádám Szieberthhttp://code.activestate.com/recipes/users/4188745/http://code.activestate.com/recipes/578787-pretty-and-stated-htmlparsers/ <p style="color: grey"> Python recipe 578787 by <a href="/recipes/users/4188745/">Ádám Szieberth</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/htmlparser/">htmlparser</a>, <a href="/recipes/tags/state/">state</a>). Revision 2. </p> <p>Extensions of html.parser.HTMLParser().</p> <p>PrettyHTMLParser() does not splits data into chuncks by HTML entities. StatedHTMLParser() can have many state-dependent handlers which helps parsing HTML pages alot.</p> Python HTML Stripper (Python) 2013-04-08T13:58:00-07:00Granning Stolinehttp://code.activestate.com/recipes/users/4186069/http://code.activestate.com/recipes/578511-python-html-stripper/ <p style="color: grey"> Python recipe 578511 by <a href="/recipes/users/4186069/">Granning Stoline</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/python/">python</a>, <a href="/recipes/tags/stripper/">stripper</a>). </p> <p>Python HTML Stripper</p> Easy to use, easy to read, python based HTML generation (Python) 2013-06-21T14:47:21-07:00Pavloshttp://code.activestate.com/recipes/users/4185038/http://code.activestate.com/recipes/578436-easy-to-use-easy-to-read-python-based-html-generat/ <p style="color: grey"> Python recipe 578436 by <a href="/recipes/users/4185038/">Pavlos</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/template/">template</a>, <a href="/recipes/tags/text/">text</a>). Revision 4. </p> <p>I was looking for a simple way to generate HTML directly in python that does not require learning a new template 'language' nor requires the installation of a big complex package. Closest thing I found was James Casbon's attempt(https://gist.github.com/1461441). This is my version of the same idea. </p> <p>(2013-04-21) added some simplifications and support for switching off string interpolation. Added to github:</p> <p><a href="https://github.com/pavlos-christoforou/web" rel="nofollow">https://github.com/pavlos-christoforou/web</a></p> A Simple Webcrawler (Python) 2012-03-03T02:37:30-08:00Johnhttp://code.activestate.com/recipes/users/4181142/http://code.activestate.com/recipes/578060-a-simple-webcrawler/ <p style="color: grey"> Python recipe 578060 by <a href="/recipes/users/4181142/">John</a> (<a href="/recipes/tags/crawler/">crawler</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/page/">page</a>, <a href="/recipes/tags/parser/">parser</a>, <a href="/recipes/tags/scraping/">scraping</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urlopen/">urlopen</a>, <a href="/recipes/tags/web/">web</a>). </p> <p>This is my simple web crawler. It takes as input a list of seed pages (web urls) and 'scrapes' each page of all its absolute path links (i.e. links in the format <a href="http://" rel="nofollow">http://</a>) and adds those to a dictionary. The web crawler can take all the links found in the seed pages and then scrape those as well. You can continue scraping as deep as you like. You can control how "deep you go" by specifying the depth variable passed into the WebCrawler class function start_crawling(seed_pages,depth). Think of the depth as the recursion depth (or the number of web pages deep you go before returning back up the tree).</p> <p>To make this web crawler a little more interesting I added some bells and whistles. I added the ability to pass into the WebCrawler class constructor a regular expression object. The regular expression object is used to "filter" the links found during scraping. For example, in the code below you will see:</p> <p>cnn_url_regex = re.compile('(?&lt;=[.]cnn)[.]com') # cnn_url_regex is a regular expression object</p> <p>w = WebCrawler(cnn_url_regex)</p> <p>This particular regular expression says:</p> <p>1) Find the first occurence of the string '.com'</p> <p>2) Then looking backwards from where '.com' was found it attempts to find '.cnn'</p> <p>Why do this?</p> <p>You can control where the crawler crawls. In this case I am constraining the crawler to operate on webpages within cnn.com.</p> <p>Another feature I added was the ability to parse a given page looking for specific html tags. I chose as an example the &lt;h1&gt; tag. Once a &lt;h1&gt; tag is found I store all the words I find in the tag in a dictionary that gets associated with the page url.</p> <p>Why do this?</p> <p>My thought was that if I scraped the page for text I could eventually use this data for a search engine request. Say I searched for 'Lebron James'. And suppose that one of the pages my crawler scraped found an article that mentions Lebron James many times. In response to a search request I could return the link with the Lebron James article in it.</p> <p>The web crawler is described in the WebCrawler class. It has 2 functions the user should call:</p> <p>1) start_crawling(seed_pages,depth)</p> <p>2) print_all_page_text() # this is only used for debug purposes</p> <p>The rest of WebCrawler's functions are internal functions that should not be called by the user (think private in C++).</p> <p>Upon construction of a WebCrawler object, it creates a MyHTMLParser object. The MyHTMLParser class inherits from the built-in Python class HTMLParser. I use the MyHTMLParser object when searching for the &lt;h1&gt; tag. The MyHTMLParser class creates instances of a helper class named Tag. The tag class is used in creating a "linked list" of tags.</p> <p>So to get started with WebCrawler make sure to use Python 2.7.2. Enter the code a piece at a time into IDLE in the order displayed below. This ensures that you import libs before you start using them.</p> <p>Once you have entered all the code into IDLE, you can start crawling the 'interwebs' by entering the following:</p> <p>import re</p> <p>cnn_url_regex = re.compile('(?&lt;=[.]cnn)[.]com') </p> <p>w = WebCrawler(cnn_url_regex)</p> <p>w.start_crawling(['http://www.cnn.com/2012/02/24/world/americas/haiti-pm-resigns/index.html?hpt=hp_t3'],1)</p> <p>Of course you can enter any page you want. But the regular expression object is already setup to filter on <a href="http://cnn.com" rel="nofollow">cnn.com</a>. Remember the second parameter passed into the start_crawling function is the recursion depth.</p> <p>Happy Crawling!</p> Safe HTML string and unicode (Python) 2012-01-10T08:14:14-08:00Garel Alexhttp://code.activestate.com/recipes/users/2757636/http://code.activestate.com/recipes/578008-safe-html-string-and-unicode/ <p style="color: grey"> Python recipe 578008 by <a href="/recipes/users/2757636/">Garel Alex</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/security/">security</a>, <a href="/recipes/tags/web/">web</a>). Revision 2. </p> <p>As you display message on a web page, you have to sanitize input data coming from users to avoid <a href="https://en.wikipedia.org/wiki/Cross-site_scripting">XSS</a>. Here is a small recipe where we can use a special class for our string to be sure we get safe all the way long.</p> Show all the telecommuting jobs from the Python Job Board (Python) 2011-12-09T07:38:28-08:00Victor Yanghttp://code.activestate.com/recipes/users/627255/http://code.activestate.com/recipes/577979-show-all-the-telecommuting-jobs-from-the-python-jo/ <p style="color: grey"> Python recipe 577979 by <a href="/recipes/users/627255/">Victor Yang</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/network/">network</a>, <a href="/recipes/tags/screenscrape/">screenscrape</a>). </p> <p>It is running as a cronjob on a VPS(Virutal Private Server). The output html can be served by any web server. </p> ActiveState recipe statistics (Python) 2011-06-02T14:52:50-07:00Kaan Ozturkhttp://code.activestate.com/recipes/users/4178179/http://code.activestate.com/recipes/577732-activestate-recipe-statistics/ <p style="color: grey"> Python recipe 577732 by <a href="/recipes/users/4178179/">Kaan Ozturk</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/regular_expressions/">regular_expressions</a>, <a href="/recipes/tags/statistics/">statistics</a>, <a href="/recipes/tags/urllib2/">urllib2</a>, <a href="/recipes/tags/web/">web</a>). Revision 2. </p> <p>Downloads "All Recipe Authors" pages in ActiveState, uses regular expressions to parse author name and number of their recipes on each page. Finally, it displays the recipe submission distribution (the count of how many authors have submitted how many recipes each).</p> webcheck: site to csv (Python) 2011-03-09T06:37:08-08:00Jervis Whitleyhttp://code.activestate.com/recipes/users/4169341/http://code.activestate.com/recipes/577602-webcheck-site-to-csv/ <p style="color: grey"> Python recipe 577602 by <a href="/recipes/users/4169341/">Jervis Whitley</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/linkcheck/">linkcheck</a>, <a href="/recipes/tags/sitemap/">sitemap</a>, <a href="/recipes/tags/webcheck/">webcheck</a>). Revision 3. </p> <p>An extension to Arthur de Jong's excellent webcheck tool (a website link checker) (<a href="http://arthurdejong.org/webcheck" rel="nofollow">http://arthurdejong.org/webcheck</a>) that will read in the resultant webcheck.dat file and create a csv formatted file.</p> Random URL (Python) 2010-09-12T22:23:09-07:00FB36http://code.activestate.com/recipes/users/4172570/http://code.activestate.com/recipes/577389-random-url/ <p style="color: grey"> Python recipe 577389 by <a href="/recipes/users/4172570/">FB36</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/http/">http</a>, <a href="/recipes/tags/url/">url</a>, <a href="/recipes/tags/urllib2/">urllib2</a>, <a href="/recipes/tags/web/">web</a>). </p> <p>Finds and displays a random webpage from the Internet. (Warning: It may take a while!)</p> Website Text Search (Python) 2010-09-11T17:32:01-07:00FB36http://code.activestate.com/recipes/users/4172570/http://code.activestate.com/recipes/577388-website-text-search/ <p style="color: grey"> Python recipe 577388 by <a href="/recipes/users/4172570/">FB36</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/http/">http</a>, <a href="/recipes/tags/url/">url</a>, <a href="/recipes/tags/urllib2/">urllib2</a>, <a href="/recipes/tags/web/">web</a>). Revision 2. </p> <p>Searches a website recursively for the given text string and prints all URLs containing it.</p> Image Downloader (Python) 2014-02-24T03:49:51-08:00FB36http://code.activestate.com/recipes/users/4172570/http://code.activestate.com/recipes/577385-image-downloader/ <p style="color: grey"> Python recipe 577385 by <a href="/recipes/users/4172570/">FB36</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/http/">http</a>, <a href="/recipes/tags/url/">url</a>, <a href="/recipes/tags/urllib2/">urllib2</a>, <a href="/recipes/tags/web/">web</a>). Revision 4. </p> <p>Finds and downloads all images from any given URL.</p> <p>Important note:</p> <p>If your download location path has spaces then put quotes around it!</p> Website Mapper (Python) 2010-09-23T01:23:04-07:00FB36http://code.activestate.com/recipes/users/4172570/http://code.activestate.com/recipes/577392-website-mapper/ <p style="color: grey"> Python recipe 577392 by <a href="/recipes/users/4172570/">FB36</a> (<a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/http/">http</a>, <a href="/recipes/tags/url/">url</a>, <a href="/recipes/tags/web/">web</a>). Revision 3. </p> <p>Prints the tree graph of the given URL. </p> Userfriendly Webpage Template (Python) 2010-05-04T07:38:03-07:00david.gaarenstroomhttp://code.activestate.com/recipes/users/4168848/http://code.activestate.com/recipes/577203-userfriendly-webpage-template/ <p style="color: grey"> Python recipe 577203 by <a href="/recipes/users/4168848/">david.gaarenstroom</a> (<a href="/recipes/tags/cgi/">cgi</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/httpserver/">httpserver</a>, <a href="/recipes/tags/mvc/">mvc</a>, <a href="/recipes/tags/template/">template</a>, <a href="/recipes/tags/web/">web</a>, <a href="/recipes/tags/webdesign/">webdesign</a>, <a href="/recipes/tags/webpagetemplate/">webpagetemplate</a>). Revision 5. </p> <p>User friendly template class targeted towards Web-page usage and optimized for speed and efficiency.</p> <p>Tags can be inserted in a template HTML file in a non-intrusive way, by using specially formatted comment strings. Therefore, the template-file can be viewed in a browser, even with prototype data embedded in it, which will later be replaced by dynamic content. Also, webdesigners can continue to work on the template and upload it without further modification.</p> Convert text/enriched MIME to text/html (Python) 2009-06-09T15:08:40-07:00Jack Trainorhttp://code.activestate.com/recipes/users/4076953/http://code.activestate.com/recipes/576800-convert-textenriched-mime-to-texthtml/ <p style="color: grey"> Python recipe 576800 by <a href="/recipes/users/4076953/">Jack Trainor</a> (<a href="/recipes/tags/email/">email</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/mime/">mime</a>, <a href="/recipes/tags/text_enriched/">text_enriched</a>). </p> <p>Converts text stream in text/enriched MIME format from file or stdin to text/html output to file or stdout.</p> CommentEditor: HTML editor for online comments (Python) 2009-06-17T12:10:55-07:00Jack Trainorhttp://code.activestate.com/recipes/users/4076953/http://code.activestate.com/recipes/576814-commenteditor-html-editor-for-online-comments/ <p style="color: grey"> Python recipe 576814 by <a href="/recipes/users/4076953/">Jack Trainor</a> (<a href="/recipes/tags/editor/">editor</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/wxwidgets/">wxwidgets</a>). Revision 3. </p> <p>Edit online comments with easy addition of HTML tags for bold, italics, underlining, blockquote and anchored links. Then check your work with the preview feature.</p> <p>Requires wxWidgets.</p> Serve static web content from within a gzipped tarball to save space using CherryPy (Python) 2009-03-31T18:24:06-07:00Dan McDougallhttp://code.activestate.com/recipes/users/4169722/http://code.activestate.com/recipes/576706-serve-static-web-content-from-within-a-gzipped-tar/ <p style="color: grey"> Python recipe 576706 by <a href="/recipes/users/4169722/">Dan McDougall</a> (<a href="/recipes/tags/cherrypy/">cherrypy</a>, <a href="/recipes/tags/compression/">compression</a>, <a href="/recipes/tags/embedded/">embedded</a>, <a href="/recipes/tags/gzip/">gzip</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/http/">http</a>, <a href="/recipes/tags/network/">network</a>, <a href="/recipes/tags/routes/">routes</a>, <a href="/recipes/tags/web/">web</a>, <a href="/recipes/tags/web_server/">web_server</a>). </p> <p>This code lets you store all of your static website content inside a gzipped tarball while transparently serving it to HTTP clients on-demand. Perfect for embedded systems where space is limited.</p> 分析html格式文件的标记 (Python) 2009-02-21T23:52:05-08:00nillgump nillgumphttp://code.activestate.com/recipes/users/4169273/http://code.activestate.com/recipes/576658-html/ <p style="color: grey"> Python recipe 576658 by <a href="/recipes/users/4169273/">nillgump nillgump</a> (<a href="/recipes/tags/html/">html</a>). </p> <p>分析html格式文件的标记</p> text-to-html (Python) 2009-03-05T23:10:50-08:00nillgump nillgumphttp://code.activestate.com/recipes/users/4169273/http://code.activestate.com/recipes/576682-text-to-html/ <p style="color: grey"> Python recipe 576682 by <a href="/recipes/users/4169273/">nillgump nillgump</a> (<a href="/recipes/tags/html/">html</a>). </p> <p>将文本排版转化为html中相同的排版。 为google blog服务</p>