Saturday, November 15, 2014

Reading table data from word document with Apache POI frame work

In previous posts reading word document I explained how we can use POI frame work to read the plain text data from the word document. Always this is not the case some times we may encounter tables in the document.
POI frame work has the built in support to read the tables data. It is so nice that we can read the data row by row and cell by cell. Java collection framework made it so easy to iterate over the table. Irrespective of the number of tables in the word document we can process the word document.
It is so simple to write Java code for reading the tables data. we want to read the table from word document so we need one input stream to open the word document for us. The Java code to accomplish this will look like

FileInputStream fis = new FileInputStream("")

After getting the data stream we need to use this in POI frame work. Just get the word document object in Java code by creating the object of XWPFDocument by passing the object of FileInputStream as the argument of the XWPFDocument constructor.

We will get the object of the word document. The following LOC will do this

XWPFDocument doc = new XWPFDocument(fis)

The object supports lot of methods to work on word document. In this post I will explain how to use this object to get the data present in the table. It is so simple by using the getTables method on the XWPFDocument object we will get all the tables of the word document which we are loaded in FileInputStream. This implies that we need one collection to hold all the tables. In fact the getTables method will return the List of the tables. This look like

List<XWPFTable>  tables = doc.getTables()

After getting the list of tables it is easy for us to iterate through all the tables of the word document. You may get doubt the size of the tables in the document may vary then how we can iterate with simple loop. It is so easy we are no where hard coding the number of columns or the number of rows of the table.

Every table object has the methods to get the number of rows and columns of the table. We can use Java for all loop to achieve this easily. The code to iterate through the tables look like

for ( XWPFTable table : tables )
{


The code inside the above for loop can use table(one table returned by the for loop from the list) as the object pointing the real table in word document. This table object has so many methods one is to get the number of rows of the table

for ( XWPFTableRow row : table.getRows() )
{
}

we can use row the local variable of the above for loop to get cell and in turn to get the data present in that cell.

for ( XWPFTableCell cell : row.getTableCells() )
{
   System.out.print(cell.getText());
}

The method getText will return the text data by removing any formatting of the text.
The following code snippet is useful for reading and printing the table data as table on the Java console.
fis = new FileInputStream("D:/POI/ex1.docx");
            doc = new XWPFDocument(fis);
            tables = doc.getTables();
            for ( XWPFTable table : tables )
            {
                for ( XWPFTableRow row : table.getRows() )
                {
                    for ( XWPFTableCell cell : row.getTableCells() )
                    {
                        System.out.print(cell.getText());
                        System.out.print("\t");
                    }
                    System.out.println("");
                }
            }
It looks like so simple but we have one problem in  this code that is the formatting of the data is lost. And the cell may contain multiple paragraphs and tables as well we will see how to get the data with out lose of formatting in the next post.

No comments:

Post a Comment

DC motor control with Pulse Width Modulation Part 1

DC Motor intro DC motor is a device which converts electrical energy into kinetic energy. It converts the DC power into movement. The typica...