SV Expertise: June 2013

I finished reading Effective XML by Elliotte Rusty Harold over the weekend. It's an old book, almost 10 years old, published in 2004. Still, it is interesting and I wanted to summarize the book for myself and anybody else who is interested.

It is divided into 4 parts: Syntax, Structure, Semantics and Implementation. Total, there are 50 suggestions to improving your XML.

Syntax

#1: Include an XML declaration.

Like “<?xml version="1.0" encoding="utf-8" standalone="yes" ?>” for example.

#2: Mark up with ASCII if possible.

Don’t use Chinese characters as tag names.

#3: Stay with XML 1.0.

XML 1.1 makes several inadvisable things possible, like tag names in obscure, non-alphabetic languages.

#4: Use standard entity references.

Prefer named references (e.g. &Ecaron;) to character references (e.g. ě). Don’t invent your own names if somebody else already has.

#5: Comment DTDs liberally.

DTDs are hard to understand. Use lots of comments so they can be understood.

#6: Name elements with camel case.

CamelCase instead of camel-case is easier to map to variables.

#7: Parameterize DTDs.

You can build all kinds of flexibility into DTDs using parameters, including conditionals and changing namespaces.

#8: Modularize DTDs.

You can split DTDs into multiple files for more flexibility.

#9: Distinguish text from markup.

The title isn’t very good but what he’s saying is that, once you delimit XML tags and use them as text, they aren’t accessible to the parser. They are just text.

#10: White space matters.

White space, such as newlines and indentation, will be included in a text node when it’s read in. In other cases, like DTDs, it is irrelevant.

Structure

#11: Make structure explicit through markup.

Avoid mini-formats where the developer has to parse a text node or attribute value to separate it further. Use tags instead of embedding spaces or commas in text or values.

#12: Store metadata in attributes.

The content/text should be the data itself; attributes should give metadata about the data. The content/text should contain what people normally want to see with attributes hiding away less important info.

#13: Remember mixed content.

A single sentence isn’t necessarily a single text node. It might be multiple text nodes, separated by child tags enclosing certain fragments. Don’t assume that the XML is flat or expect a rigid schema.

#14: Allow all XML syntax.

Don’t invent an XML format or XML parser that forbids XML processing directives, comments or other standard XML features.

#15: Build on top of structures, not syntax.

Don’t try to differentiate between things that the XML parser says are the same. Don’t write applications that do something different if it is a named reference vs a character reference. Or do something different if it is an empty element versus an element with the empty string.

#16: Prefer URLs to unparsed entities and notations.

DTDs can be used to define unparsed entities and notations but it’s better to avoid them and just stick the value right in the XML itself.

#17: Use processing instructions for process-specific content.

XML processing instructions, like “<?xml-stylesheet ?>”, are not returned by parsers as easily but have their uses, especially for data that cuts across parent-child relationships. Use them where appropriate.

#18: Include all information in the instance document.

Avoid using DTD and other XML features that modify the XML data from outside, such as default attributes. The XML should contain all the data, even if the parser doesn’t read the DTD or other linked documents.

#19: Encode binary data using quoted printable and/or Base64.

Binary data can be encoded and inserted in XML. If you need it, do that.

#20: Use namespaces for modularity and flexibility.

Namespaces look like URLs but they are just IDs. Choose and use a namespace for the data. Don’t avoid it just because you don’t understand it or don’t care.

#21: Rely on namespace URIs, not prefixes.

Don’t use the “svg:” prefix without setting it to “http://www.w3.org/2000/svg”. Don’t use namespaces without defining the namespace URI.

#22: Don’t use namespace prefixes in element content and attribute values.

It is very confusing when XML prefixes are used other places than as a tag name. Instead, require the full namespace URL to be used, maybe as an attribute that modifies the unprefixed tag name. For example, instead of <element xmlns:ex=”…” type=”ex:year”>, do <element type=”year” typens=”…”>.

#23: Reuse XHTML for generic narrative content.

Use XHTML to content that is paragraphs of text instead of inventing your own schema or restricting the content to unformatted text.

#24: Choose the right schema language for the job.

You have a choice between DTDs and XML Schema. There is also RELAX NG, Schematron or even using Java.

#25: Pretend there’s no such thing as the PSVI.

PSVI is XML data annotated with its schema information produced by advanced XML parsers. Some libraries can read the PSVI and produces a memory objects automatically from XML data, like Hibernate does for SQL databases. PSVI is a nice theory but not practical.

#26: Version documents, schemas and stylesheets.

Add version numbers to XML and its related documents because it will change over time. You can use dates or major/minor versions. Don’t assume that your data, schemas and stylesheets will never need revision.

#27: Mark up according to meaning.

Put XML tags around things according to what they are, not just how they are formatted. For example, italics has several different uses so be more specific with the XML tag.

Semantics

#28: Use only what you need.

XML has lots of parts: XML 1.0, well-formedness, DTDs, Namespaces, XPath, Schemas, XLinks (Simple and Extended), XPointers, XInclude, Infoset, PSVI, XML 1.1, Namespaces 1.1, SVG, MathML, RDF, OWL, CSS, XSLT, XSL-FO, XQuery and so on. Use what you need. Don’t feel that you have to understand and use it all.

#29: Always user a parser.

Don’t try to write your own XML parser using regular expressions or something. Use an off-the-shelf XML parser.

#30: Layer functionality.

Feel free to process XML is whatever order works best for you. Feel free to do validation before and/or after other processing. Creating a processing chain that gets you to your final result.

#31: Program to standard APIs.

Write code so it is easy to swap in a new parser.

#32: Choose SAX for computer efficiency.

SAX is an event-based, streaming parser. Efficiency isn’t usually needed so you normally don’t need SAX.

#33: Choose DOM for standards support.

DOM is a solid standard with lots of implementations. Many developers understand it. It is weird in some places, though.

#34: Read the complete DTD.

If standalone is set to “no” in the XML declaration, the DTD is required. Skipping a required DTD may result in parsing errors. Be flexible in accepting XML data by reading DTDs.

#35: Navigate with XPath.

Doing “//name” with XPath is easier and less error prone than crawling the DOM tree using getChildNode(). It’s hard to write getChildNode() code that doesn’t rely on tag parent-child relationships, tag order, number of tags and other variations in XML data.

#36: Serialize XML with XML.

Don’t convert XML into an opaque binary format for no reason. Leave XML as XML.

#37: Validate inside your program with schemas.

Validate XML data using schemas rather than just breaking/crashing. The point of validation is to detect invalid data.

Implementation

#38: Write in Unicode.

Use UTF-8. ASCII is a subset of UTF-8. If using Japanese, Chinese or similar languages, use UTF-16. Don’t use obsolete ASCII formats. Do Unicode correctly with normalization and sorting.

#39: Parameterize XSLT stylesheets.

Use xsl:variable (like a constant) and xsl:param to make it easy to change fonts, sizes and other stuff in XSLT.

#40: Avoid vendor lock-in.

Avoid tools that have binary XML formats, unclear tag names, proprietary XML parsers and proprietary APIs.

#41: Hang on to your relational database.

XML does not replace SQL databases but you can use XML with them.

#42: Document namespaces with RDDL.

Namespaces are just IDs but people still try to use them as URLs. RDDL is an XML schema for a web page that is posted at a namespace “URL” that can provide resources, natures and purposes (such as DTDs) that might be useful.

#43: Preprocess XSLT on the server side.

For speed and consistency, preprocess and cache XSLT transformations on the server side using web server plugins.

#44: Serve XML+CSS to the client.

Browser clients can style using XSLT. CSS can be applied conditionally, depending on the display type.

#45: Pick the correct MIME media type.

Have your web server serve up application/xml instead of text/xml. Use more official, more accurate mime types, like application/xml+svg, if appropriate.

#46: Tidy up your HTML.

Converting HTML to XHTML will uncover bugs which are worth fixing. Do validation and fix any bugs that are found. Really old browsers don’t support some XHTML constructs.

#47: Catalog common resources.

XML Catalogs, used with parsers, allow you locally cache remote resources like DTDs and schemas. Instead of getting the file from a remote site, the request is redirected to the local machine.

#48: Verify documents with XML digital signatures.

You probably don’t need it but there is a standard for doing digital signatures in XML.

#49: Hide confidential data with XML encryption.

You probably don’t need it but there is a standard for doing encryption in XML.

#50: Compress if space is a problem.

XML doesn’t waste that much space but, if needed, you can compress it.

Monday, June 3, 2013

Effective XML by Elliotte Rusty Harold (2004)