Date sent: Thu, 25 Feb 1999 14:32:14 -0500
Send reply to: Historical documents on-line <E-DOCS@LISTSERV.UIC.EDU> From: Lloyd Benson <Lloyd.Benson@furman.edu> Subject: E-DOCS: (long) sgrep: utility for finding things in marked-up docs (T. Horton, x-ELTA)
To: E-DOCS@LISTSERV.UIC.EDU
[Editor's note: this message has been cross-posted from the ELTA textual analysis list with the author's permission.]
Those of you with SGML/XML files who want to find things based on mark-up and words should know about sgrep, which stands for "structured" grep. (grep is a standard UNIX utility that finds lines that match patterns defined as regular expressions.)
[Note: this is a long message, with more and more detail and examples as you go along. Stop reading when you learn more than you want to know!]
sgrep is a free, command-line utility for finding patterns in structured text, where "structure" is pretty flexible. sgrep has good support for SGML/XML, although it doesn't include a parser, it's "markup aware" and has operators to find elements, attributes, etc.
To read about or get a copy of sgrep, see the Web page:
http://www.cs.helsinki.fi/~jjaakkol/sgrep.html
The latest alpha version has much-improved SGML/XML support, but it's got some bugs in it and isn't documented that well. But get that one and try it, and then wait for them to fix it. There are executables for many UNIX systems and for Windows. (I've had better luck with the Windows version.)
sgrep patterns match regions in a file, where a region is just a start position through an end position. When you give it a pattern to find, it returns a region-set that includes all the regions that match your pattern. So this query:
"mother"
returns a set of regions, each of which is the location/region of the string "mother" in your text. This query:
"<SP>" .. "</SP>"
returns all SP elements in a file. But what if they're nested, you ask? sgrep understands nesting, and returns nested elements as you'd expect them, in much the same way that parentheses are nested and balanced in a math expression.
Now, what's even more exciting (can you take all this excitement? :-) ) is that sgrep lets you combine region-sets with other region-sets in all kinds of ways. There is a full-blown algebra for region sets. Any query can be combined with any other, so you can refine this to find those same speeches in Act I, Sc. 3. Or any in prose (if this is marked as an attribute of some element), etc. It's amazing.
Well, let me show examples at the very end of this, but for now I'll just say it's straight-forward (if very wordy) to create queries that find:
Elements that contain words or other elements, words, etc. Elements that have a certain attribute, or attribute/value pair, or just a any-atrribute/specific-value pair.
Larger contexts that surround any other kind of search. (See below).
Many different kinds of combinations of these: regions that include other matches, all regions that do NOT include a match, regions that overlap (or don't) regions resulting from a previous query.
Above when I siad "larger context", I mean that when you find something, you can say "give me the DIV1 that encloses this thing I just found", or "give me the parent element of what I just found".
Caveats:
So, before we go on to the longer example, I suggest you have a look if you need this and you can live with these issues. For my part, I've found the whole approach very interesting, and it's changing some of my ideas about how to build TA tools for every-day users. So I'm hoping to borrow some ideas from sgrep and use them so we all can build more friendly tools that have its power. More on this later.
I'm no sgrep expert, but if anyone has questions, email me or the list.
Tom
P.S. If you want to go look at it, read the following. Then the example is still later on.
Suggestion: To learn about it, I would read the first 6 pages of the paper "Using Sgrep for Querying Structurd Text Files." Then I'd look at the example queries link on the Web page. Then I'd skip straight to the README file for the latest version, and scan that to find the SGML operators sections.
Warning: The Win32 binary of the latest alpha version seems to have a bug in the handling of double-quote in command-line way of expressing queries. It's much better to put queries (and query macros) into a file and use the -f option (perhaps the -e option -- you'll see examples of this in the REAME file). In particular, the examples below assume all these queries are in a file; from the command line, I'd have to precede each double-quote with a backslash. (Yuck!)
Note: if you are not from the US or the UK, you may not have heard of Teletubbies, a popular childrens show for toddlers under 2 years old. You may also not have heard that a well-known conservative religious leader here in the States recently announced that he thought that one of the characters, Tinky Winky, the purple Teletubby, depicted a gay character and thus the show was not suitable for America's toddlers. This has been a great source of inspiration for America's comics, as you can imagine! I am not doing such an analysis, really! I just thought you all might be tired of all those Shakespeare or religious text examples. :-)
So a sample of our corpus might look like this:
<VB/> <SP><SPEAKER>La-La</SPEAKER><L N="1">Ma ta dada wa.</L></SP> <SP><SPEAKER>Po</SPEAKER><L N="2">Hal-lo!</L></SP><VE/> <VB/> <SP><SPEAKER>La-La</SPEAKER><L N="3">Wa ti noo-noo?</L></SP> <SP><SPEAKER>Tinky Winky</SPEAKER><L N="4">Big hug!</L></SP><VE/>
I have a file called macros.sgrep in which I put some macros that become short-hand for certain queries. So first I want to find lines. My query would be:
"<L" .. "</L"
which finds all the regions between these two strings. Results:
<L N="1">Ma ta dada wa.</L> <L N="2">Hal-lo!</L> <L N="3">Wa ti noo-noo?</L> <L N="4">Big hug!</L>
To be honest, I should report I put that query into the file macros.sgrep, and then I typed:
sgrep -o"%r\n" -g xml -f macros.sgrep lala.xml
In later steps, if I was going to reuse this a lot, I'd define it as a macro in the file like this:
define(LINES,("<L" .. "</L"))
and then I could type:
sgrep -o"%r\n" -g xml -f macros.sgrep -e LINES lala.xml
The README file shows you some very useful macros for SGML. (In particular, you'll want the ELEMENT macro which would let me say ELEMENT("L") to get all the lines. I won't use that here.)
To find speeches, my query would be:
"<SP" .. "</SP>"
To find SPEAKER elements with La-La as their content, the query would be:
"<SPEAKER" .. "</SPEAKER>" containing "La-La"
which produces:
<SPEAKER>La-La</SPEAKER> <SPEAKER>La-La</SPEAKER>
Not too interesting, but what about getting the speeches that include this result (two regions). The query:
elements parenting ("<SPEAKER" .. "</SPEAKER>" containing "La-La") or
"<SP" .. "</SP>" containing ("<SPEAKER" .. "</SPEAKER>" containing
"La-La")
each produce:
<SP><SPEAKER>La-La</SPEAKER><L N="1">Ma ta dada wa.</L></SP> <SP><SPEAKER>La-La</SPEAKER><L N="3">Wa ti noo-noo?</L></SP>
Note my markup uses milestones for verses that are not hierarchical with the speech tags. I can find verses like with this query:
"<VB/" .. "<VE/>"
And if we can combine this with anything else. Say we want the verses in which Tinky-Winky (our suspected corrupter of America's toddlers) speaks. The query:
"<VB/" .. "<VE/>" containing ("<SPEAKER" .. "</SPEAKER>" containing
"Tinky Winky")
produces:
<VB/> <SP><SPEAKER>La-La</SPEAKER><L N="3">Wa ti noo-noo?</L></SP> <SP><SPEAKER>Tinky Winky</SPEAKER><L N="4">Big hug!</L></SP><VE/>
For convenience sake, I'll create a macro in the file I'm using, macros.sgrep, so I don't have to type this query in:
define(TW-VERSES,("<VB/" .. "<VE/>" containing ("<SPEAKER" ..
"</SPEAKER>" containing "Tinky Winky")))
So now I can just use TW-VERSES to find Tinky Winky's verses. But I'm interested in his word usage, so I just want all the lines in this verse. The query is:
"<L" .. "</L>" in TW-VERSES
and the result is:
<L N="3">Wa ti noo-noo?</L> <L N="4">Big hug!</L>
And I really just want the words. The query is:
"<L" .. "</L>" in TW-VERSES extracting "<"..">"
The pattern "<"..">" finds regions set of by angle-brackets, and the sgrep extracting operator removes from the left-hand set of regions all the regions on the right-hand side. So the result is: Wa ti noo-noo?
Big hug!
OK, now we're getting somewhere. At this point perhaps I can feed this into something that counts words, looks for patterns, or something that will reveal the secrets about whether or not Tinky Winky is, well, you know.
Let me quickly show you how sgrep supports attributes and attribute value pairs, because several have asked about that. There's an "attribute" operator. To find all attributes N in any element, the query is:
attribute("N")
and the results are:
N="1" N="2" N="3" N="4"
Note if this was something like LANG we might want all of them, no matter what element. But here let's say we use N in lots of places, and we just want lines, the <L> element. Our query:
"<L".."</L> containing attribute("N")
and the result is:
<L N="1">Ma ta dada wa.</L> <L N="2">Hal-lo!</L> <L N="3">Wa ti noo-noo?</L> <L N="4">Big hug!</L>
What is we just wanted any element with N=3 set? The query uses the sgrep "attvalue" operator. Like this query:
"<L".."</L> containing attribute("N") containing attvalue("3") which
produces:
<L N="3">Wa ti noo-noo?</L>
So you can see how you combine things in interesting ways to, say, find all elements that contain an attribute/value pair.
OK, enough! Hope this has been useful. Comments or corrections or suggestions by email to me, please.
-------------------------------------------------------------------------