I have a news corpus that looks like the following. I
want to do a statistical survey of the words used in
the news report per se. So, I must not consider those
words in the XML tags.
I know that we can use the sgmllib and strip the SML
tags. But what I want is this:
1. The read operation must either read a full tag or
ignore the tag.
2. If the read operation reads between <P> and </P>,
then it must reads the whole thing between those 2
tags all at once.
How can I achieve this please?
<DOC id="XIN19910101.0052" type="story">
<HEADLINE>
This is the news headline
</HEADLINE>
<DATELINE>
March 09, 2004
</DATELINE>
<TEXT>
<P>
Here comes the first paragraph. There might be more
than one new line characters ('\n') in each paragraph.
</P>
<P>
And here is the second paragraph.
</P>
<P>
This is the third paragraph. Please note that the news
articles do not necessarily have the same number of
paragraphs.
</P>
</TEXT>
</DOC>
<DOC id="XIN19910101.0053" type="story">
<HEADLINE>
This is another news report
</HEADLINE>
<DATELINE>
......
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you?re looking for faster
http://search.yahoo.com