Rehdon and snail and others occasionally have asked me recently about marking up words inside another element where there may be markup (sometimes containing more than one word) inside this so I thought I’d write it up.
So for example we might have an XML file that looked like:
<?xml version="1.0" encoding="UTF-8"?> <root> <line>This is a test</line> <line>Only a <seg type="foo">test</seg> ok?</line> <line>And <seg>so; is</seg> this as well.</line> </root>
Let’s say we want to mark up each of the whitespace-separated words, and for some reason the randomly added semi-colons, as words with a element. What we can use is and a regex. For example:
<xsl:template match="line//text()"> <xsl:analyze-string regex="(\w+|;+)" select="."> <xsl:matching-substring><w><xsl:value-of select="."/></w></xsl:matching-substring> <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
In this example we’re matching any text() inside an element anywhere and if it matches the \w regex (or is a semicolon) it will get wrapped in a element. If it doesn’t match, then the text that was there gets output. Because this is l//text() (as opposed to l/text()) it will recurse down into grandchildren elements and further.
So assuming we have a copy-all template something like:
<xsl:template match="@*|node()" priority="-1"> <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy> </xsl:template>
(where we basically copy any nodes and attributes unless something else matches them) then we should get the result:
<root> <line><w>This</w> <w>is</w> <w>a</w> <w>test</w><w>;</w></line> <line><w>Only</w> <w>a</w> <seg type="foo"><w>test</w></seg> <w>ok</w>?</line> <line><w>And</w> <seg><w>so</w><w>;</w> <w>is</w></seg> <w>this</w> <w>as</w> <w>well</w>.</line> </root>
Of course that is only the beginning, as your documents will probably have weird special cases and punctuation that you want to handle differently. And also it would, of course, be useful to create an @xml:id attribute for each word element.