adding word-level markup

Rehdon and snail and others occasionally have asked me recently about marking up words inside another element where there may be markup (sometimes containing more than one word) inside this so I thought I’d write it up.

So for example we might have an XML file that looked like:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<line>This is a test</line>
<line>Only a <seg type="foo">test</seg> ok?</line>
<line>And <seg>so; is</seg> this as well.</line>
</root>

Let’s say we want to mark up each of the whitespace-separated words, and for some reason the randomly added semi-colons, as words with a element. What we can use is and a regex. For example:

<xsl:template match="line//text()">
 <xsl:analyze-string regex="(\w+|;+)" select=".">
 <xsl:matching-substring><w><xsl:value-of select="."/></w></xsl:matching-substring>
 <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
 </xsl:analyze-string>
</xsl:template>

In this example we’re matching any text() inside an element anywhere and if it matches the \w regex (or is a semicolon) it will get wrapped in a element. If it doesn’t match, then the text that was there gets output. Because this is l//text() (as opposed to l/text()) it will recurse down into grandchildren elements and further.

So assuming we have a copy-all template something like:

<xsl:template match="@*|node()" priority="-1">
  <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
</xsl:template>

(where we basically copy any nodes and attributes unless something else matches them) then we should get the result:

<root>
  <line><w>This</w> <w>is</w> <w>a</w> <w>test</w><w>;</w></line>
  <line><w>Only</w> <w>a</w> <seg type="foo"><w>test</w></seg> <w>ok</w>?</line>
  <line><w>And</w> <seg><w>so</w><w>;</w> <w>is</w></seg> <w>this</w> <w>as</w> <w>well</w>.</line>
</root>

Of course that is only the beginning, as your documents will probably have weird special cases and punctuation that you want to handle differently. And also it would, of course, be useful to create an @xml:id attribute for each word element.

-James

Posted in XML, XSLT | 4 Comments

4 Responses to “adding word-level markup”

  1. Lou says:

    Marking up a semicolon as a word is just plain evil. We went to all that trouble to give you , why not use it?

  2. James Cummings says:

    The point wasn’t necessarily to mark a semi-colon as a <w> but just to show that you could use analyze-string to wrap around not just things that match the \w+ regex but ‘some other random string’ as well. It was an arbitrary hypothetical example. But yes, indeed, you could segment punctuation-things into <punc> if you wanted or do multiple passes. In XSLT2 the things I tend to think are ‘really cool’ include for-each-group’ing, creating hierarchies in variables that you can then address later (dispensing with result tree fragment problems), result-documents, collection(), and analyze-string/regex stuff.

  3. leighman says:

    Any way to keep character entities intact when doing this?

  4. jamesc says:

    @leighman: Good question. I don’t _think_ so in that the unicode character reference will be processed just as if it is that character. The only way I can think to do it is to have a pass before which escapes those characters somehow. I’d ask on the xsl-list.

Leave a Reply