PDFlib TET - .NET/COM/Java/Application - V2.3 - Zusammenfassung

von PDFlib - Produkttyp: Komponente / Anwendung / .NET Class / ActiveX DLL / DLL / JavaBean

Zusammenfassung

PDFlib TET by PDFlib

Text extraction toolkit. PDFlib TET (Text Extraction Toolkit) is software for reliably extracting text information from any PDF file. It is available as a library/component and as a command-line tool. PDFlib TET makes available the text contents of a PDF as Unicode strings or structured XML, plus detailed glyph and font information. With PDFlib TET you can retrieve the corresponding Unicode values for text in a PDF document, as well as its position on the page.

In addition to low-level text retrieval TET contains advanced content analysis algorithms for determining word boundaries, removing redundant duplicate text (such as shadows and artificial bold). Using the auxiliary pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, hypertext, etc.

With PDFlib TET you can:

Extract text from PDF, e.g. to store it in a database

Implement a search engine for processing PDF

Convert the text content of PDF pages to XML for processing with other tools

Process PDFs based on their contents

Supported PDF Input

PDFlib TET supports all relevant flavors of PDF input:

All PDF versions up to PDF 1.7 (Acrobat 8)

All font and encoding types: base 14 fonts, TrueType, PostScript, OpenType, CID fonts

Encrypted PDF with 40- and 128-bit encryption (appropriate permission settings or password required)

Unicode

Although text in PDF is usually not encoded in Unicode, PDFlib TET will normalize the text from a PDF document to Unicode:

TET converts all text contents to Unicode. In C the text will be returned in the UTF-8 or UTF-16 formats, and as native Unicode strings in all other language bindings

Ligatures and other multi-character glyphs will be decomposed into a sequence of their constituent Unicode characters

Vendor-specific Unicode assignments (Private Use Area, PUA) are identified, and mapped to characters in the common Unicode area if possible

Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character

Full CJK Support

TET includes full support for extracting Chinese, Japanese, and Korean text. All predefined CJK CMaps (encodings) are recognized; horizontal and vertical writing modes are supported.

Content Analysis and Word Identification

TET can be used to retrieve low-level glyph information, but also includes advanced algorithms for content analysis:

Detect word boundaries to retrieve words instead of characters

Recombine the parts of hyphenated words

Remove duplicate instances of text, e.g. shadow and artificial bold text

Recombine paragraphs into reading order

Reorder text which is scattered over the page

Reconstruct lines of text

Geometry

TET provides precise metrics for the text, such as the position on the page, glyph widths, text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

pCOS Interface for simple Access to PDF Objects

TET includes the pCOS (PDFlib Comprehensive Object System) interface for retrieving arbitrary PDF objects. With pCOS you can retrieve PDF metadata, hypertext, or any other information outside the actual page descriptions with a simple query interface without the need for low-level programming.

Programming and Performance

TET has been developed with portability, performance, and robustness in mind. TET is thread-safe for deployment in multi-threaded server applications. The core library is written in highly optimized C code for maximum performance and minimum overhead. Additional language bindings are available for COM, C, C++, Java, and .NET.

TET Command-Line Tool and TET Library

TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer the same base functionality, but are suitable for different deployment tasks. Here are some guidelines for choosing among both TET flavors:

The TET programming library can be used for integration into your desktop or server application. Examples for using the library with all supported language bindings are included in the TET package

The TET command-line tool is suited for batch processing PDF documents. It doesn’t require any programming, but offers command-line options which can be used to integrate it into complex workflows. The TET command-line tool can be used to convert PDF page content to an XML document with Unicode text, with or without character metrics

TET Plugin

PDFlib TET Plugin is a free plugin for extracting Text out of PDF documents. The TET Plugin provides easy access to the PDFlib Text Extraction Toolkit (TET). Although the TET Plugin runs as an Acrobat plugin, the underlying text extraction does not use Acrobat functions, but is completely based on TET. The TET Plugin is provided as a technology study to demonstrate the power of PDFlib TET.

PartNumbers: PC-517723-147965 517723-147965 PC-517723-147971 517723-147971 PC-517723-147972 517723-147972 PC-517723-147966 517723-147966 PC-517723-147973 517723-147973 PC-517723-147974 517723-147974 PC-517723-147967 517723-147967 PC-517723-147975 517723-147975 PC-517723-147976 517723-147976 PC-517723-147968 517723-147968 PC-517723-147977 517723-147977 PC-517723-147978 517723-147978 PC-517723-147969 517723-147969 PC-517723-147979 517723-147979 PC-517723-147980 517723-147980 PC-517723-147970 517723-147970 PC-517723-147981 517723-147981 PC-517723-147982 517723-147982

PurchaseOptions: PDFlib TET V2.3 Windows Desktop Systems 1 User License for Windows 2000/XP/Vista , PDFlib TET V2.3 Windows Desktop Systems 1 User License for Windows 2000/XP/Vista, price per license from 5-9 Licenses , PDFlib TET V2.3 Windows Desktop Systems 1 User License for Windows 2000/XP/Vista, price per license from 10 Licenses , PDFlib TET V2.3 Mac OS X Desktop Systems 1 User License for Mac OS X PPC/Intel , PDFlib TET V2.3 Mac OS X Desktop Systems 1 User License for Mac OS X PPC/Intel, price per license from 5-9 Licenses , PDFlib TET V2.3 Mac OS X Desktop Systems 1 User License for Mac OS X PPC/Intel, price per license from 10 Licenses , PDFlib TET V2.3 Windows Server Systems 1 Server License for Windows 2000/2003/2008 , PDFlib TET V2.3 Windows Server Systems 1 Server License for Windows 2000/2003/2008, price per license from 5-9 licenses , PDFlib TET V2.3 Windows Server Systems 1 Server License for Windows 2000/2003/2008, price per license from 10 licenses , PDFlib TET V2.3 Mac OS X Server Systems 1 Server License for Mac OS X Server PPC/Intel , PDFlib TET V2.3 Mac OS X Server Systems 1 Server License for Mac OS X Server PPC/Intel, price per license from 5-9 licenses , PDFlib TET V2.3 Mac OS X Server Systems 1 Server License for Mac OS X Server PPC/Intel, price per license from 10 licenses , PDFlib TET V2.3 Linux Server Systems 1 Server License for Linux x86/IA-64/x86_64/EM64T , PDFlib TET V2.3 Linux Server Systems 1 Server License for Linux x86/IA-64/x86_64/EM64T, price per license from 5-9 licenses , PDFlib TET V2.3 Linux Server Systems 1 Server License for Linux x86/IA-64/x86_64/EM64T, price per license from 10 licenses , PDFlib TET V2.3 FreeBSD Server Systems 1 Server License for FreeBSD on x86 , PDFlib TET V2.3 FreeBSD Server Systems 1 Server License for FreeBSD on x86, price per license from 5-9 licenses , PDFlib TET V2.3 FreeBSD Server Systems 1 Server License for FreeBSD on x86, price per license from 10 licenses

Resources: Read the PDFlib TET Manual, Read the PDFlib TET Datasheet, Read the PDFlib TET License Agreement, Download the PDFlib TET V2.3 Windows evaluation on to your computer - Fully Functional Limited Input Size, Download the PDFlib TET V2.3 Mac OS X evaluation on to your computer - Fully Functional Limited Input Size, Download the PDFlib TET V2.3 Linux evaluation on to your computer - Fully Functional Limited Input Size, Download the PDFlib TET V2.3 FreeBSD evaluation on to your computer - Fully Functional Limited Input Size

Operating System for Deployment: Windows Vista, Windows XP, Windows Server 2003, Windows 2000, Linux Kernel V2.4.x, FreeBSD 5.x, FreeBSD 6.x, Mac OS X

Architecture of Product: 32Bit, 64Bit

Product Type: Component, Application

Component Type: .NET Class, ActiveX DLL, DLL, JavaBean

Compatible Containers: Microsoft Visual Studio 2005, Microsoft Visual Studio .NET 2003, Microsoft Visual Studio 6.0, Microsoft Visual Basic 2005, Microsoft Visual Basic .NET 2003, Microsoft Visual Basic 6.0, Microsoft Visual C++ 2005, Microsoft Visual C++ .NET 2003, Microsoft Visual C++ 6.0, Microsoft Visual C# 2005, Microsoft Visual C# .NET 2003, Microsoft Visual J++ 6.0, Microsoft Visual InterDev 6.0, CodeGear C++ 5.0 (formerly Borland), C++Builder 2006, C++Builder 6, Delphi 2007 for Win32, Delphi 2006 (10.0), Delphi 2005 (9.0), Delphi 8.0, Delphi 7.0, Delphi 6.0, JBuilder 2006, JBuilder X, JBuilder 9, .NET Framework 2.0, .NET Framework 1.1

Keywords: PDFlib GmbH pdf Text word words processing textbox Conversion Convert converts converting Professional Partner extract text from PDF PDFs extraction extracted

Produkt-Suche

Suchbegriffe eingeben:

Weitere Links

Hersteller

Primärkategorie

Zugehörige Produkte

Zugehörige Kategorien

Award