Welcome, guest | Sign In | My Account | Store | Cart
1

This recipe uses DOM (precisely, cDomlette or the minidom variant in 4Suite) to merge two files containing XBEL boomark listings. It uses Python 2.2. generators for straightforward and efficient iteration over the XBEL DOM trees in document order. It requires Python 2.2 and 4Suite 0.12.0a2 or more recent versions.

Python, 88 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/env python
from __future__ import generators
from xml.dom import Node
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint


def in_order_iterator_filter(node, filter_func):
    if filter_func(node):
        yield node
    for child in node.childNodes:
        for cn in in_order_iterator_filter(child, filter_func):
            if filter_func(cn):
                yield cn
    return


def get_elements_by_tag_name_ns(node, ns, local):
    return in_order_iterator_filter(
               node,
               lambda n: n.nodeType == Node.ELEMENT_NODE and \
                         n.namespaceURI == ns and n.localName == local
           )


def string_value(node):
    text_nodes = in_order_iterator_filter(
        node, lambda n: n.nodeType == Node.TEXT_NODE)
    return u''.join([ n.data for n in text_nodes ])


def get_title(node):
    return string_value(
        get_elements_by_tag_name_ns(node, None, 'title').next())


def merge_folders(folder_node1, folder_node2):
    #Folder element children of folder1
    folder1_folders = \
        [ n for n in folder_node1.childNodes if n.nodeName == 'folder' ]
    #Yes, the list must be copied to avoid mutate-while-iterate bugs
    for elem in folder_node2.childNodes[:]:
        #No need to copy title element
        if elem.nodeName == 'title':
            continue
        #
        elif elem.nodeName == 'folder':
            title = get_title(elem)
            for a_folder in folder1_folders:
                if title == get_title(a_folder):
                    merge_folders(a_folder, elem)
                    break
            else:
                folder_node1.appendChild(elem)
        else:
            folder_node1.appendChild(elem)


def xbel_merge(xbel1, xbel2):
    xbel1_top_level = \
        [ n for n in xbel1.documentElement.childNodes \
            if n.nodeType == Node.ELEMENT_NODE ]
    xbel1_top_level_folders = \
        [ n for n in xbel1_top_level if n.nodeName == 'folder' ]
    xbel1_top_level_bookmarks = \
        [ n for n in xbel1_top_level if n.nodeName == 'bookmark' ]
    xbel2_top_level = \
        [ n for n in xbel2.documentElement.childNodes \
            if n.nodeType == Node.ELEMENT_NODE ]
    for elem in xbel2_top_level:
        if elem.nodeName == 'folder':
            title = get_title(elem)
            for a_folder in xbel1_top_level_folders:
                if title == get_title(a_folder):
                    merge_folders(a_folder, elem)
                    break
            else:
                xbel1.documentElement.appendChild(elem)
        elif elem.nodeName == 'bookmark':
            xbel1.documentElement.appendChild(elem)
    return xbel1


if __name__ == "__main__":
    import sys
    xbel1 = NonvalidatingReader.parseUri(sys.argv[1])
    xbel2 = NonvalidatingReader.parseUri(sys.argv[2])
    new_xbel = xbel_merge(xbel1, xbel2)
    PrettyPrint(new_xbel)

This is actually an updated version of an old script I posted ages ago:

http://mail.python.org/pipermail/xml-sig/1999-September/001441.html

This version is much faster and uses current APIs. For more info on XBEL, see:

http://pyxml.sourceforge.net/topics/xbel/

An alternate implementation could use straight Python 2.2 minidom. The main changes would be using "minidom.parse" instead of "NonvalidatingReader.parseUri" and "new_xbel.toxml()" instead of "PrettyPrint(new_xbel)".

For a great introduction to generators, about which I can hardly rave enough, see:

http://www-106.ibm.com/developerworks/library/l-pycon.html http://www-106.ibm.com/developerworks/linux/library/l-pythrd.html

To test this script, you can use the following 2 XBEL files:

Bookmarks of Joris Graaumans [excerpt]

 XML

    ZVON.org


    XML.ORG - The XML Industry Portal


    The XML Cover Pages - Home Page


     Software

     xmlsoftware.com

     VBXML
     Xpath Visualiser


    DOM

      JavaScript DOM level 1


      JavaScript DOM examples



    DTDs

    TEI
    TEI pizza chef

    TEI Consortium

    Xbel

      Xbel homepage


    Misc

      XMLephant: Technologies/DTDs_and_Examples



      DocBook DTD

         DocBook Character Entity Reference


         Docbook reference guide

and

Bookmarks of Joris Graaumans [excerpt]

 Dictionaries

    University of Alberta Cognitive Science Dictionary (Home
      Page)
    Cognitive science woordenboek


    Dictionary.com


    Cambridge International Dictionaries


    Het Van Dale Taalweb



 XML

    XML discussion lists


    The World Wide Web Consortium


    DOM

      DOM-Level-2-Core



 XSL

    XSLT benchmark


    xml.apache.org Examples


    XSL specs van W3c
    XSL working draft 1.0 van het W3c

 Extensible Stylesheet
    Language (XSL)

 Discussion lists

    Mulberry Technologies, Inc.: XSL-List -- Open Forum on XSL


    Mulberry Technologies, Inc.: XSL

Just call "python xbel_merge.py bm1.xbel bm2.xbel" and the results will go to stdout.

1 comment

Uche Ogbuji (author) 14 years, 7 months ago  # | flag

The sample XML files got corrupted. Looks as of the Cookbook uploader can't take XML in the "notes" field. Yes. I tried preview, and it did show the tags, though with indentation removed. Nothing like what showed up in the end. I've put the 2 XBEL files you can use for testing at

http://uche.ogbuji.net/etc/020625/bm1.xbel

and

http://uche.ogbuji.net/etc/020625/bm2.xbel

Sorry for any inconvenience.

Add a comment

Sign in to comment