Released: Aug 19, 2010
Updates in this release
Updates in V4.0
New features in PDFlib TET 4.0:
- Performance enhancements: faster for many classes of documents
- Higher speed and smaller memory consumption for very large documents up to hundreds of thousands of pages
- Extract right-to-left and bidirectional text for Arabic, Hebrew, etc.
- Unicode post-processing:
- Foldings preserve, remove or replace characters
- Decompositions replace a character with an equivalent sequence, e.g. replace narrow or vertical Japanese characters with their standard counterparts.
- Text can be converted to all four Unicode normalization forms, e.g. emit NFC form to meet the requirements for Web text or a database.
- Improved shadow removal, word boundary detection, and dehyphenation
- Improved super and subscript detection
- Workarounds for non-conforming PDF documents to enhance robustness
- Enhanced repair mode for successfully extracting text from damaged PDF
- More information in TET's XML output (TETML), e.g. dehyphenation, dropcap, shadow, and super/subscript
- Improved C++ and Perl language bindings
New features in PDFlib TET PDF IFilter 4.0:
- Takes advantage of the improved TET 4.0 kernel
- Automatic language detection for improved search results (find word stems, partial matches, etc.)
- Support for SharePoint 2010