In previous post on POI frame work I discussed on reading the word docs having only paragraphs having text. This is not very useful cause lot of documents will have formatting contains Images tables etc.
For example consider document having images in between text paragraphs. Every object embedded inside the document can be accessed as the paragraph even Images. So these are also become part of paragraph. While accessing the document from the code some times we are not interested in images. by using following code we will get only the text containing in the paragraphs.
try
{
FileInputStream fis = new FileInputStream("");
HWPFDocument doc = new HWPFDocument(fis);
Range range = doc.getRange();
int numOfParas = range.numParagraphs();
for(int i = 0; i < numOfParas; i++)
{
Paragraph para = range.getParagraph(i);
if(!para.text().trim().equalsIgnoreCase(""))
{
System.out.println(para.text());
}
}
}
catch(FileNotFoundException fnfe)
{
fnfe.printStackTrace();
}
catch(IOException ioe)
{
ioe.printStackTrace();
}
This is same as the code used to read the simple word docs without any images. Just we added one condition while displaying the text. If the paragraph contains image only then there will not be any text if we run the .trim().equalsIgnoreCase("") on this paragraph text we will get Boolean true as result because there is not text.
For example consider document having images in between text paragraphs. Every object embedded inside the document can be accessed as the paragraph even Images. So these are also become part of paragraph. While accessing the document from the code some times we are not interested in images. by using following code we will get only the text containing in the paragraphs.
try
{
FileInputStream fis = new FileInputStream("");
HWPFDocument doc = new HWPFDocument(fis);
Range range = doc.getRange();
int numOfParas = range.numParagraphs();
for(int i = 0; i < numOfParas; i++)
{
Paragraph para = range.getParagraph(i);
if(!para.text().trim().equalsIgnoreCase(""))
{
System.out.println(para.text());
}
}
}
catch(FileNotFoundException fnfe)
{
fnfe.printStackTrace();
}
catch(IOException ioe)
{
ioe.printStackTrace();
}
This is same as the code used to read the simple word docs without any images. Just we added one condition while displaying the text. If the paragraph contains image only then there will not be any text if we run the .trim().equalsIgnoreCase("") on this paragraph text we will get Boolean true as result because there is not text.