I have been working on a project called Meccano2 since October. Meccano2 is written primarily in Jython, the Java implementation of Python. The Meccano2 data model is implemented using dom4j, a Java framework for working with XML data. I have used both Jython and dom4j before and found them to be very useful tools. Now, in Meccano, I am using them both together. The combination is extraordinarily powerful and flexible. It is so good, I just have to share. Introduction to Meccano2Meccano is a specialized editor for internal use at my work. It uses XML for its native data format. (Meccano1 used a domain-specific data model.) Meccano has a custom tree display and domain-specific editors for the data. Both the tree display and the editors have been extensively customized to meet user requirements. Meccano1 was written entirely in Java. The existing Meccano1 codebase is incorporated into Meccano2 as a library. The current Java code base has 154 modules (including modules added for Meccano2). The Meccano2 application is written in primarily in Jython with a Swing GUI. It makes extensive use of Java libraries. Currently Meccano2 has about 6500 lines of Jython code in 62 modules. JythonJython is an implementation of the Python language in Java. There are many reasons why I think Python is a great language. For the purposes of this paper I will focus on just a few. Python has strong support for data structuresLists and dictionaries are built in to the language. It is very easy to create and use ad-hoc, heterogeneous data structures. Iteration and lookup for both lists and dictionaries are supported by the syntax of the language. This benefits all kinds of programs because lists, dictionaries and compound data structures are useful everywhere. This strong support for structured data really shines in data-driven programs, because it is easy to specify and use the actual data. Complex data structures can be created statically, in declarations, or dynamically, in code. Unlike XML or other text-based data formats, no user-written parser is needed - the Python runtime parses the data for you and yields a native data structure. Using the data is supported by the core language. Python has first-class functionsIn Python, you can have standalone functions (not part of a class). Functions can be put in a list and called later, so a data structure can include callback functions. And a function can return another function as a value, so you can create a function that binds some variables while leaving others unassigned. For example suppose you have a function adder that takes two arguments and returns their sum: def adder(x, y): return x + y Now you can make a function factory - given a value n, it returns a function of one argument that adds n to its argument: def makeAddN(n): return lambda x: adder(x, n) So a call makeAddN(5) returns a function that adds 5 to its argument. This is a little hard to explain but it is very powerful. Do you ever miss function pointers from C or C++? Python can do everything you can do with a function pointer and more. I will show you a real-world example of this below. In addition to Python's benefits, Jython provides easy access to Java libraries, both Java standard APIs (Swing), third party libraries (dom4j, Velocity, Jetty, etc) and our own Java code (Meccano1 codebase with additions for Meccano2) dom4jMeccano uses dom4j to create its runtime data model. dom4j is "an Open Source XML framework for Java. dom4j allows you to read, write, navigate, create and modify XML documents. dom4j is seamlessly integrated with full XPath support." dom4j is similar to W3C DOM but it is designed specifically for Java. As a result it has a simpler interface than DOM. dom4j also has an event-driven parser similar to SAX but at a higher level of abstraction. dom4j's integrated XPath engine is its killer feature. It allows many types of queries on the data to be done with one or two lines of code. For example, you can easily find, count or check for the existence of nodes matching an XPath expression. XPath expressions can look at the name, value, attributes and children of a node. For example to find all LearningPoints whose type attribute has the value "software_simulation" and which contain a child slide whose type is "exercise", you could use the expression node.selectNodes('LearningPoint[@type="software_simulation" and Slide/@type="exercise"]') Using a generic data model means the model is homogeneous in the sense that every element supports the same operations. The homogeneous data model allows tools to be applied consistently to the whole model and facilitates creating data-driven engines that operate on the data. Meccano has several examples of such engines including a module that builds tree model of the data for visual display and a word count engine. SynergyThe combination of Jython's expressiveness and flexible data structures with integrated queries against the data model using XPath allows some types of problems to be solved with remarkable ease. Example: Meccano audit engineThe Meccano audit engine takes advantage of Jython's flexible data and first-class functions, and dom4j's integrated XPath engine, to make a very flexible, table-driven course audit tool. The core of the engine is a recursive data structure consisting of XPath expressions and audit rules. The structure can be defined by these production rules: A RuleItem is either
The function which applies a set of rules is short and sweet - just 12 lines of code. It is passed a RuleItem structure and a dom4j node to apply the rules to. At each step it looks to see if it has a callable, a pair or a list, and takes the appropriate action: def walkByRule(ruleItem, data): if callable(ruleItem): # This is an actual rule, just call it on the current data ruleItem(data) elif type(ruleItem) == type( () ): # ruleItem is a tuple ( xpathexpression, childRuleItem ) # childRuleItem is applied to each node selected by the xpath expression childExpr, childRuleItem = ruleItem for child in data.selectNodes(childExpr): walkByRule(childRuleItem, child) elif type(ruleItem) == type( [] ): # A list of rules, evaluate it recursively for rule in ruleItem: walkByRule(rule, data) else: print 'Unknown rule type:', ruleItem Here is a short snippet of the actual audit rules table. At this point the current node is a Learning Point. The first part of the snippet is an xpath expression that selects only Learning Points whose type is "summary". This is followed by a list containing two rules. The first rule evaluates an XPath expression againt the current node. The particular expression checks to see if the summary point is the last point. The second rule checks for the existence of an exercise point in the same topic, which is not allowed. ('.[@type="summary"]', [ xpathRule('count(following-sibling::LearningPoint) = 0', 'Summary Learning Point must be the last Learning Point in a Topic'), limit('count(../LearningPoint[@type="software_simulation" and Slide/@type="exercise"]')', 0, 'Topic with Exercise Point must not contain a Summary Learning Point'), ] ), Note that the rules are actual function calls. How is this possible? The rule engine requires that the rules be instances of functions with one argument. Well, xpathRule is a function that returns a function of one argument. To see how this works, first look at the function that implements the actual rule. It is a pretty straightforward function that evaluates an XPath expression against a data node and logs an error if the result is not 'true': def _xpathRuleImpl(data, xpathExpr, desc): ''' data.valueOf(xpathExpr) must be 'true' ''' result = data.valueOf(xpathExpr) if not result == 'true': _logError(desc, data) To turn this function of three arguments into a function of one argument, we create a new function that binds the xpathExpr and desc args and just needs the data argument. That is exactly what xpathRule does: def xpathRule(xpathExpr, desc): return lambda data: _xpathRuleImpl(data, xpathExpr, desc) So the call to xpathRule in the audit data actually creates the function that will be called by the rule engine. The call to limit() does much the same thing. The complete table of audit rules contains over 100 rules. They are all created using the same simple building blocks. Many of the rules reuse xpathRule, limit and other generic rules. In a few cases, special-purpose functions check a single specific rule. In Meccano, the same core rule engine is used to create word counts. In this case, the rules at the leaves of the rule data are instructions for how to count the text in the corresponding node. Similar techniques are used in the part of Meccano that builds the visual representation of the dom tree for display in the GUI. I hope these examples have given you some idea of the power of Jython and dom4j when used together for data-driven programming. Detailed exampleHere is a complete DomUtil module containing walkByRule and an extensive example:
''' Module DomUtil '''
import org.dom4j as dom def walkByRule(ruleItem, data): ''' walkByRule does a specialized tree walk of a dom tree. The walk is specified by xpath expressions in a Python structure. The actual work is done by callback functions embedded in the structure. The arguments to walkByRule are the rule structure and the top-level dom node to which it should be applied. ruleItem contains the actual rules. This is a recursive data structure. Here is the production rule for the structure. Note that the parentheses and brackets are part of the Python syntax, not part of the rule syntax, and the distinction between lists and tuples is significant. RuleItem := [ RuleItem* ] | ( XPathChildExpression, RuleItem ) | callable In English: A RuleItem is either - A list of zero or more RuleItems - A 2-tuple containing an xpath expression and a RuleItem to apply to all the nodes that match that expression, or - A callable which is called with the current node as its only argument data is the top-level dom node. So the walk starts by applying the given ruleItem to the given data. Here are some examples. First some data to play with: >>> data = """ ... <root> ... <branch id="1"> ... <leaf type="text">Leaf text</leaf> ... <leaf type="number">123</leaf> ... </branch> ... <branch id="2"> ... <leaf type="text">Leaf text 2</leaf> ... </branch> ... </root> ... """ ... >>> doc = dom.DocumentHelper.parseText(data) >>> root = doc.getRootElement() The simplest use of walkByRule is to pass a callable as the rule. walkByRule(fn, node) is the same as fn(node) >>> def printName(node): ... print node.getName() ... >>> walkByRule(printName, root) root This is not very interesting by itself, but callables will be the leaf nodes of a more complex rule structure. If a rule is a tuple, the first element must be an XPath expression. It is used to select nodes from the current target. The full dom4j XPath syntax is supported. The second element is a rule to evaluate against the selected nodes. walkByRule( (xpath, rule), node) is the same as for child in node.selectNodes(xpath): walkByRule(rule, child) For example, >>> rule = ('branch', printName) >>> walkByRule(rule, root) branch branch If the rule is a list, each element of the list must be a rule. The rules are passed to walkByRule using the current target node. walkByRule( [ rule1, rule2 ], node) is the same as walkByRule(rule1, node); walkByRule(rule2, node) The component rules can themselves be callables, tuples or lists. We can apply both rules from the preceding two examples by putting them in a list: >>> rule = [ printName, ('branch', printName) ] >>> walkByRule(rule, root) root branch branch The combination of list rules and tuple rules is very powerful. Tuple rules allow you to dig down into the dom tree to get to a node of interest. List rules allow you to apply many rules to a node. Here is a more complex example: >>> def printInt(node): ... print 'Integer value:', int(node.valueOf('text()')) ... >>> def printText(node): ... print 'Text value:', str(node.valueOf('text()')) ... >>> rule = [ ... printName, ... ('branch', [ ... printName, ... ('leaf[@type="text"]', printText), ... ('leaf[@type="number"]', printInt), ... ] ), ... ] ... >>> walkByRule(rule, root) root branch Text value: Leaf text Integer value: 123 branch Text value: Leaf text 2 Notice that in all cases the value in the rules for a callable rule is a reference to the actual function - printName - not a call of the function - printName(). A callable rule must be a reference to a function of a single argument. You may find that you have several similar rules and you would like to pass more than one argument to the rule. You can do this by defining a function that is a closure - it binds the extra arguments together with the actual implementation function to make a function of one argument. For example, here is a test function that compares the node text to an argument: >>> def containsTextImpl(node, text): ... if node.getText().find(text) >= 0: ... print 'Found', text ... else: ... print 'No', text ... This function binds a text value and containsTextImpl into a function of one argument: >>> def containsText(text): ... return lambda node, text=text: containsTextImpl(node, text) ... To use this, put a _call_ to containsText in the rule: >>> rule = ('//leaf', [ containsText('Leaf'), containsText('123') ] ) >>> walkByRule(rule, root) Found Leaf No 123 No Leaf Found 123 Found Leaf No 123 ''' if callable(ruleItem): # This is an actual rule, just call it ruleItem(data) elif type(ruleItem) == type( () ): # ruleItem is a tuple ( xpathexpression, childRuleItem ) # childRuleItem is applied to each node selected by the xpath expression childExpr, childRuleItem = ruleItem for child in data.selectNodes(childExpr): walkByRule(childRuleItem, child) elif type(ruleItem) == type( [] ): # A list of rules, evaluate it recursively for rule in ruleItem: walkByRule(rule, data) else: print 'Unknown rule type:', ruleItem if __name__ == "__main__": import doctest import DomUtil doctest.testmod(DomUtil) last change 2004-10-28 09:15:44 |
A success story BlogRoll |
© 2004, Kent Johnson