Saturday, September 27, 2014

Reading documents containing Images with POI

In previous post on POI frame work I discussed on reading the word docs having only paragraphs having text. This is not very useful cause lot of documents will have formatting contains Images tables etc.
For  example consider document having images in between text paragraphs. Every object embedded inside the document can be accessed as the paragraph even Images. So these are also become part of paragraph. While accessing the document from the code some times we are not interested in images. by using following code we will get only the text containing in the paragraphs.

try
        {
            FileInputStream fis = new FileInputStream("");
            HWPFDocument doc = new HWPFDocument(fis);
            Range range = doc.getRange();
            int numOfParas = range.numParagraphs();
            for(int i = 0; i < numOfParas; i++)
            {
                Paragraph para = range.getParagraph(i);           
                if(!para.text().trim().equalsIgnoreCase(""))
                {
                    System.out.println(para.text());
                }
            }
        }
        catch(FileNotFoundException fnfe)
        {
            fnfe.printStackTrace();
        }
        catch(IOException ioe)
        {
            ioe.printStackTrace();
        }

This is same as the code used to read the simple word docs without any images. Just we added one condition while displaying the text. If the paragraph contains image only then there will not be any text if we run the .trim().equalsIgnoreCase("") on this paragraph text we will get Boolean true as result because there is not text.

Readind word document with Apache POI

Reading the files is very easy with Java. we can read any type of file. But each file has it's own formatting and the features. consider word document and the excel sheet each has it's own formatting. we can use buffered reader and any other type of readers available in java but these are not feasible while handling the documents containing lot of formatting.
We have sophisticated framework developed by Apache called POI to process the any type of document generated using MS office. Apache provides the ready made jars just we have to include them in our class path and we can use the functionality provide by them.
Following example provides the very basic functionality of reading the word document containing single paragraph single line.

public static void main(String[] args)
    {
        try
        {
            FileInputStream fis = new FileInputStream("C:/Practice/Reading.doc");
            HWPFDocument doc = new HWPFDocument(fis);
            Range range = doc.getRange();
            int numOfParas = range.numParagraphs();
            for(int i = 0; i < numOfParas; i++)
            {
                Paragraph para = range.getParagraph(i);
                System.out.println(para.text());
            }
        }
        catch(FileNotFoundException fnfe)
        {
            fnfe.printStackTrace();
        }
        catch(IOException ioe)
        {
            ioe.printStackTrace();
        }
    }

HWPFDocument is the wrapper containing all the data structures of the word document. The variable doc of type HWPFDocument points to the instance of the word document pointed by the HWPFDocument class. HWPFDocument takes the File or path to the word document as string. The variable range is of type Range contains all the data of the word document except the header and footer section. By using range we can read all the data present in the word document. The method numParagraphs employed on range gives the total number of paragraphs of the word document. getParagraph  method returns the paragraph of provided index.

DC motor control with Pulse Width Modulation Part 1

DC Motor intro DC motor is a device which converts electrical energy into kinetic energy. It converts the DC power into movement. The typica...