Welcome, guest | Sign In | My Account | Store | Cart

This recipe uses DOM (precisely, cDomlette or the minidom variant in 4Suite) to merge two files containing XBEL boomark listings. It uses Python 2.2. generators for straightforward and efficient iteration over the XBEL DOM trees in document order. It requires Python 2.2 and 4Suite 0.12.0a2 or more recent versions.

Python, 88 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/env python
from __future__ import generators
from xml.dom import Node
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint


def in_order_iterator_filter(node, filter_func):
    if filter_func(node):
        yield node
    for child in node.childNodes:
        for cn in in_order_iterator_filter(child, filter_func):
            if filter_func(cn):
                yield cn
    return


def get_elements_by_tag_name_ns(node, ns, local):
    return in_order_iterator_filter(
               node,
               lambda n: n.nodeType == Node.ELEMENT_NODE and \
                         n.namespaceURI == ns and n.localName == local
           )


def string_value(node):
    text_nodes = in_order_iterator_filter(
        node, lambda n: n.nodeType == Node.TEXT_NODE)
    return u''.join([ n.data for n in text_nodes ])


def get_title(node):
    return string_value(
        get_elements_by_tag_name_ns(node, None, 'title').next())


def merge_folders(folder_node1, folder_node2):
    #Folder element children of folder1
    folder1_folders = \
        [ n for n in folder_node1.childNodes if n.nodeName == 'folder' ]
    #Yes, the list must be copied to avoid mutate-while-iterate bugs
    for elem in folder_node2.childNodes[:]:
        #No need to copy title element
        if elem.nodeName == 'title':
            continue
        #
        elif elem.nodeName == 'folder':
            title = get_title(elem)
            for a_folder in folder1_folders:
                if title == get_title(a_folder):
                    merge_folders(a_folder, elem)
                    break
            else:
                folder_node1.appendChild(elem)
        else:
            folder_node1.appendChild(elem)


def xbel_merge(xbel1, xbel2):
    xbel1_top_level = \
        [ n for n in xbel1.documentElement.childNodes \
            if n.nodeType == Node.ELEMENT_NODE ]
    xbel1_top_level_folders = \
        [ n for n in xbel1_top_level if n.nodeName == 'folder' ]
    xbel1_top_level_bookmarks = \
        [ n for n in xbel1_top_level if n.nodeName == 'bookmark' ]
    xbel2_top_level = \
        [ n for n in xbel2.documentElement.childNodes \
            if n.nodeType == Node.ELEMENT_NODE ]
    for elem in xbel2_top_level:
        if elem.nodeName == 'folder':
            title = get_title(elem)
            for a_folder in xbel1_top_level_folders:
                if title == get_title(a_folder):
                    merge_folders(a_folder, elem)
                    break
            else:
                xbel1.documentElement.appendChild(elem)
        elif elem.nodeName == 'bookmark':
            xbel1.documentElement.appendChild(elem)
    return xbel1


if __name__ == "__main__":
    import sys
    xbel1 = NonvalidatingReader.parseUri(sys.argv[1])
    xbel2 = NonvalidatingReader.parseUri(sys.argv[2])
    new_xbel = xbel_merge(xbel1, xbel2)
    PrettyPrint(new_xbel)

This is actually an updated version of an old script I posted ages ago:

http://mail.python.org/pipermail/xml-sig/1999-September/001441.html

This version is much faster and uses current APIs. For more info on XBEL, see:

http://pyxml.sourceforge.net/topics/xbel/

An alternate implementation could use straight Python 2.2 minidom. The main changes would be using "minidom.parse" instead of "NonvalidatingReader.parseUri" and "new_xbel.toxml()" instead of "PrettyPrint(new_xbel)".

For a great introduction to generators, about which I can hardly rave enough, see:

http://www-106.ibm.com/developerworks/library/l-pycon.html http://www-106.ibm.com/developerworks/linux/library/l-pythrd.html

To test this script, you can use the following 2 XBEL files:

Bookmarks of Joris Graaumans [excerpt]

 XML

   
ZVON.org


    XML
.ORG - The XML Industry Portal


   
The XML Cover Pages - Home Page


     
Software

     
xmlsoftware.com

     VBXML
     
Xpath Visualiser


    DOM

     
JavaScript DOM level 1


     
JavaScript DOM examples



   
DTDs

    TEI
    TEI pizza chef

    TEI
Consortium

   
Xbel

     
Xbel homepage


   
Misc

     
XMLephant: Technologies/DTDs_and_Examples



     
DocBook DTD

         
DocBook Character Entity Reference


         
Docbook reference guide

and

Bookmarks of Joris Graaumans [excerpt]

 Dictionaries

   
University of Alberta Cognitive Science Dictionary (Home
     
Page)
   
Cognitive science woordenboek


   
Dictionary.com


   
Cambridge International Dictionaries


   
Het Van Dale Taalweb



 XML

    XML discussion lists


   
The World Wide Web Consortium


    DOM

      DOM
-Level-2-Core



 XSL

    XSLT benchmark


   
xml.apache.org Examples


    XSL specs van W3c
    XSL working draft
1.0 van het W3c

 
Extensible Stylesheet
   
Language (XSL)

 
Discussion lists

   
Mulberry Technologies, Inc.: XSL-List -- Open Forum on XSL


   
Mulberry Technologies, Inc.: XSL

Just call "python xbel_merge.py bm1.xbel bm2.xbel" and the results will go to stdout.

1 comment

Uche Ogbuji (author) 21 years, 10 months ago  # | flag

The sample XML files got corrupted. Looks as of the Cookbook uploader can't take XML in the "notes" field. Yes. I tried preview, and it did show the tags, though with indentation removed. Nothing like what showed up in the end. I've put the 2 XBEL files you can use for testing at

http://uche.ogbuji.net/etc/020625/bm1.xbel

and

http://uche.ogbuji.net/etc/020625/bm2.xbel

Sorry for any inconvenience.