關於 PDFlib TET

文本提取工具組。

PDFlib TET (Text Extraction Toolkit) reliably extracts text, images and metadata from any PDF file. It is available as a library/component and as a command-line tool. PDFlib TET makes available the text contents of a PDF as Unicode strings or structured XML, plus detailed glyph and font information. With PDFlib TET you can retrieve the corresponding Unicode values for text in a PDF document, as well as its position on the page.

In addition to low-level text retrieval TET contains advanced content analysis algorithms for determining word boundaries, removing redundant duplicate text (such as shadows and artificial bold). Using the auxiliary pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, hypertext, etc.

With PDFlib TET you can:

Extract text from PDF, e.g. to store it in a database
Implement a search engine for processing PDF
Convert the text content of PDF pages to XML for processing with other tools
Process PDFs based on their contents

Supported PDF Input
PDFlib TET supports all relevant flavors of PDF input:

PDF 1.0 up to PDF 1.7 extension level 8 and PDF 2.0, corresponding to Acrobat 1-XI
All font and encoding types: base 14 fonts, TrueType, PostScript, OpenType, CID fonts
Encrypted PDF with 40- and 128-bit encryption (appropriate permission settings or password required)

Unicode
Although text in PDF is usually not encoded in Unicode, PDFlib TET will normalize the text from a PDF document to Unicode:

TET converts all text contents to Unicode. In C the text will be returned in the UTF-8 or UTF-16 formats, and as native Unicode strings in all other language bindings
Ligatures and other multi-character glyphs will be decomposed into a sequence of their constituent Unicode characters
Vendor-specific Unicode assignments (Private Use Area, PUA) are identified, and mapped to characters in the common Unicode area if possible
Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character

Full CJK Support
TET includes full support for extracting Chinese, Japanese, and Korean text. All predefined CJK CMaps (encodings) are recognized; horizontal and vertical writing modes are supported.

Content Analysis and Word Identification
TET can be used to retrieve low-level glyph information, but also includes advanced algorithms for content analysis:

Detect word boundaries to retrieve words instead of characters
Recombine the parts of hyphenated words
Remove duplicate instances of text, e.g. shadow and artificial bold text
Recombine paragraphs into reading order
Reorder text which is scattered over the page
Reconstruct lines of text

Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

pCOS Interface for simple Access to PDF Objects
TET includes the pCOS (PDFlib Comprehensive Object System) interface for retrieving arbitrary PDF objects. With pCOS you can retrieve PDF metadata, hypertext, or any other information outside the actual page descriptions with a simple query interface without the need for low-level programming.

Programming and Performance
TET has been developed with portability, performance, and robustness in mind. TET is thread-safe for deployment in multi-threaded server applications. The core library is written in highly optimized C code for maximum performance and minimum overhead. Additional language bindings are available for COM, C, C++, Java, and .NET.

TET Command-Line Tool and TET Library
TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer the same base functionality, but are suitable for different deployment tasks. Here are some guidelines for choosing among both TET flavors:

The TET programming library can be used for integration into your desktop or server application. Examples for using the library with all supported language bindings are included in the TET package
The TET command-line tool is suited for batch processing PDF documents. It doesn’t require any programming, but offers command-line options which can be used to integrate it into complex workflows. The TET command-line tool can be used to convert PDF page content to an XML document with Unicode text, with or without character metrics

TET Plugin
PDFlib TET Plugin is a free plugin for extracting Text out of PDF documents. The TET Plugin provides easy access to the PDFlib Text Extraction Toolkit (TET). Although the TET Plugin runs as an Acrobat plugin, the underlying text extraction does not use Acrobat functions, but is completely based on TET. The TET Plugin is provided as a technology study to demonstrate the power of PDFlib TET.

搜索元件，應用程式、外掛程式和雲服務

元件類別

元件類型

元件的環境

元件出版商

彙集了1700+ 的軟體元件在一個地方

應用程式類別

應用程式類型

應用程式發行者

在一處匯集600+個以上的應用軟體

Add-in 類別

Add-in 類型

Add-in 出版商

彙集了 250+ 的軟體Add-ins在一個地方

暢銷品牌

在一個地方匯集了200+ 以上的開發原廠的品牌。

分類新聞

結構新聞

品牌新聞

24,000+ 新聞文章

PDFlib TET

關於 PDFlib TET

官方供應商

中文的產品授權諮詢服務

30 年一直深受信賴

客戶服務

我的帳戶

公司資訊

銷售& 技術支援︰

搜索元件，應用程式、 外掛程式和雲服務

元件類別

元件類型

元件的環境

元件出版商

彙集了1700+ 的軟體元件在一個地方

應用程式類別

應用程式類型

應用程式發行者

在一處匯集600+個以上的應用軟體

Add-in 類別

Add-in 類型

Add-in 出版商

彙集了 250+ 的軟體Add-ins在一個地方

暢銷品牌

在一個地方匯集了200+ 以上的開發原廠的品牌。

分類新聞

結構新聞

品牌新聞

24,000+ 新聞文章

PDFlib TET

關於 PDFlib TET

官方供應商

中文的產品授權諮詢服務

30 年一直深受信賴

客戶服務

我的帳戶

公司資訊

銷售& 技術支援︰

搜索元件，應用程式、外掛程式和雲服務