Forum Discussion

vkumar_2's avatar
vkumar_2
New Contributor
11 years ago

Getting exception while trying to read PDF file using pdfbox dll

I did some searching from earlier posts and found this way of reading text from pdf file.



I found solution to read PDF using dll file.

- Add IKVM.GNU.Classpath.dll, PDFBox-0.7.3.dll as references

- Copy FontBox-0.1.0-dev.dll, IKVM.Runtime.dll into TC’s bin directory



I added above dlls accordingly and used below code to get the text.



var filename = "C:\\iarf-carepricer.pdf" ;


var doc = dotNET.org_pdfbox_pdmodel.PDDocument.load(filename)



var pdfStripper = dotNET.org_pdfbox_util.PDFTextStripper.zctor();



var str = pdfStripper.getText_2(doc);





While executing i am getting below exception at last line.



System.NullReferenceException: Object reference not set to an instance of an object.

   at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )

   at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)

   at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()

   at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)

   at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)





Can someone please help.



Thanks,

Vikas

 






  • HKosova's avatar
    HKosova
    SmartBear Alumni (Retired)

    Hi Vikas,

    The complete PDFBox .NET binaries contain about 40 DLLs. Most likely, the 4 DLLs you mentioned aren't enough and .NET Framework cannot resolve some assembly dependencies. Try this:

      1. Download the complete and latest version of PDFBox .NET from here:
        http://www.squarepdf.net/pdfbox-in-net
        and unzip the archive.

      2. Unblock the DLLs: In Windows Explorer, right-click each DLL and select Properties. On the General tab, at the bottom, if there's an Unblock button, click it, then click OK. (Screenshot)

      3. In TestComplete, go to Tools > Current Project Properties > CLR Bridge, click Browse Files and add pdfbox-<version>.dll.

     

    It works file for me with this PDF:
    http://smartbear.com/SmartBear/media/pdfs/TestComplete-Datasheet.pdf

    function Test()
    {
      var strFileName = "C:\\Work\\TestComplete-datasheet.pdf";
      var doc = dotNET.org_apache_pdfbox_pdmodel.PDDocument.load(strFileName);
      var pdfStripper = dotNET.org_apache_pdfbox_util.PDFTextStripper.zctor_2();
      var str = pdfStripper.getText_2(doc);
      Log.Message("See Additional Info", str);
    }
  • No idea if this will work in TestComplete ....



    In a previous workplace, we used QTPro and had to extract text from PDF's. We simply loaded the PDF by invoking Adobe reader, activated it as the current window, and then forced a select all & copy and then dumped the contents of the clipboard into a string within our script and worked with that.



    Primitive, but it worked well.



    No good on protected PDF's though.
    • palatha3's avatar
      palatha3
      Occasional Contributor

      Hi,

       

      The example/solution given here seems interesting for pdf file comparsions but I wanted to do it with VBScript. I appreciate if someone can help me on this. thanks

      • HKosova's avatar
        HKosova
        SmartBear Alumni (Retired)

        Hi palatha3,

         

        Here's a VBScript example:

        Sub Test
         Dim strFileName, doc, pdfStripper, str
        strFileName = "C:\Work\TestComplete-datasheet.pdf"
        Set doc = dotNET.org_apache_pdfbox_pdmodel.PDDocument.load(strFileName) Set pdfStripper = dotNET.org_apache_pdfbox_util.PDFTextStripper.zctor_2
        str = pdfStripper.getText_2(doc) Call Log.Message("See Additional Info", str) End Sub

         

        Before running this example, you need to do the following:

         

        1. Download PDFBox .NET from here:
          http://www.squarepdf.net/pdfbox-in-net
          and unzip the archive.

        2. Unblock the DLLs: In Windows Explorer, right-click each DLL and select Properties. On the General tab, at the bottom, if there's an Unblock button, click it, then click OK. (Screenshot)
          Repeat for all DLLs.

        3. In TestComplete, go to Tools > Current Project Properties > CLR Bridge, click Browse Files and add pdfbox-<version>.dll.
  • vkumar_2's avatar
    vkumar_2
    New Contributor
    Hi Tanya,



    Thanks for the reply.



    I tried what you are saying but it seems it only can work for applications built on .net. OR i am not dong it right.



    var doc = Sys.Process("AcroRd32").AppDomain("AcroRd32.exe").dotNET.org_pdfbox_pdmodel.PDDocument.load(filename);



    Acrobet reader in not a .net based application so maybe its not actually exposing expected variable type.



    Thanks,

    Vikas
  • TanyaYatskovska's avatar
    TanyaYatskovska
    SmartBear Alumni (Retired)

    Vikas, one more suggestion. Do the following:


     


      * Place the FontBox-0.1.0-dev.dll, IKVM.GNU.Classpath.dll and IKVM.Runtime.dll files to the same folder where PDFBox-0.7.3.dll is stored.


          DO NOT copy any of these to the <TestComplete 10>\Bin folder (as you've previously copied them there, remove them.)


     


      * Add only PDFBox-0.7.3.dll to your project's CLR Bridge in TestComplete 10.


     


    Let us know how it works.

  • vkumar_2's avatar
    vkumar_2
    New Contributor
    Hi Tanya,



    Still getting same error after trying this approach ( removing dlls from Bin and linking only the PDFBox-0.7.3).



    System.NullReferenceException: Object reference not set to an instance of an object.





    Looks like will have to figure out other way of reading pdf files.



    Thanks,

    Vikas
  • TanyaYatskovska's avatar
    TanyaYatskovska
    SmartBear Alumni (Retired)

    Hi Vikas,


     


    A similar questions is discussed here. Helen posted many troubleshooting steps to make it work. Try performing them.