| Store | Cart

How to read between xml tags?

From: Anthony Liu <anto...@yahoo.com>
Tue, 9 Mar 2004 21:28:15 -0800 (PST)
I have a news corpus that looks like the following.  I
want to do a statistical survey of the words used in
the news report per se.  So, I must not consider those
words in the XML tags.

I know that we can use the sgmllib and strip the SML
tags.  But what I want is this:

1. The read operation must either read a full tag or
ignore the tag.

2. If the read operation reads between <P> and </P>,
then it must reads the whole thing between those 2
tags all at once.

How can I achieve this please?


<DOC id="XIN19910101.0052" type="story">
<HEADLINE>
This is the news headline
</HEADLINE>
<DATELINE>
March 09, 2004
</DATELINE>
<TEXT>
<P>
Here comes the first paragraph. There might be more
than one new line characters ('\n') in each paragraph.
</P>
<P>
And here is the second paragraph.
</P>
<P>
This is the third paragraph. Please note that the news
articles do not necessarily have the same number of
paragraphs.
</P>
</TEXT>
</DOC>
<DOC id="XIN19910101.0053" type="story">
<HEADLINE>
This is another news report
</HEADLINE>
<DATELINE>
......

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you?re looking for faster
http://search.yahoo.com

Recent Messages in this Thread
Anthony Liu Mar 10, 2004 05:28 am