Categories

AddFlow AmCharts JavaScript Stock Chart AmCharts 4: Charts Aspose.Total for Java Altova SchemaAgent Altova DatabaseSpy Altova MobileTogether Altova UModel Altova MapForce Altova MapForce Server Altova Authentic Aspose.Total for .NET Altova RaptorXML Server ComponentOne Ultimate Chart FX for SharePoint Chart FX CodeCharge Studio ComponentOne Enterprise combit Report Server Controls for Visual C++ MFC Chart Pro for Visual C ++ MFC DbVisualizer version 12.1 DemoCharge DXperience Subscription .NET DevExpress Universal Subscription Essential Studio for ASP.NET MVC FusionCharts Suite XT FusionCharts for Flex FusionExport V2.0 GrapeCity TX Text Control .NET for WPF GrapeCity Spread Studio Highcharts Gantt Highcharts 10.0 版 HelpNDoc Infragistics Ultimate ImageKit9 ActiveX ImageKit.NET JetBrains--Fleet JetBrains-DataSpell JetBrains--DataGrip jQuery EasyUI jChart FX Plus OPC DA .NET Server Toolkit OSS ASN.1/C Oxygen XML Author OSS 4G NAS/C, C++ Encoder Decoder Library OSS ASN.1 Tools for C with 4G S1/X2 OSS ASN.1/C# OSS ASN.1/C++ OPC HDA .NET Server Toolkit OPC DA .Net Client Development Component PowerBuilder redgate NET Developer Bundle Report Control for Visual C++ MFC Sencha Test SPC Control Chart Tools for .Net Stimulsoft Reports.PHP Stimulsoft Reports.JS Stimulsoft Reports.Java Stimulsoft Reports. Ultimate Stimulsoft Reports.Wpf Stimulsoft Reports.Silverlight SlickEdit Source Insight Software Verify .Net Coverage Validator Toolkit Pro for VisualC++MFC TeeChart .NET Telerik DevCraft Complete Altova XMLSpy Zend Server

Versioning

Dynamsoft SourceAnywhere

ComponentOne Ultimate

Files Comparison Database

files comparison

APS 2.10 ApexSQL Diff ApexSQL Data Diff ApexSQL Backup ApexSQL Audit ApexSQL Log ApexSQL Source Control ApexSQL Monitor ApexSQL Restore ApexSQL Recover PL/SQL DEVELOPER redgate SQL Toolbelt

Database Management

Beyond Compare Formulasoft Excel Compare Formulasoft Active File Compare

Energy Spectrum

Energy - Saving Analog

Faronics Power Save Meteotest Meteonorm PVsyst TRNSYS

FileLocator Pro Rapid SEO Tool

Reference Management

EndNote NoteExpress

Reverse Engineering tool

photonics design platform

SIP

Authentication

Virtual background SDK

Polliwog Corporation PollEx Cross Probe Polliwog PollEx PCB DFE Polliwog PollEx PCB Polliwog PollEx PCB DFM

ARM

CAD/CAM

Acme CAD Converter

OPC	IBH OPC Server SCADA System Software

CYPECAD

Courseware Education

E-learning Courseware Production Tools

Articulate Storyline Articulate studio iSpring Suite Full Service iSpring Suite

English Education

iSpring Suite Full Service

Network Safety

Greyware Domain Time II DeviceLock DLP Suite SafeBreach StaffCop Advantages StaffCop Enterprise

Vulnerability Scanning

Data backup

Encryption

Kryptel Oreans Themida RoboForm Enterprise Silver Key VMProtect

Data Migration and Restoration

Horizon DataSys Reboot Restore Rx Pro Nucleus Data Recovery Kernel for Lotus Notes to Outlook Nucleus Data Recovery Kernel for GroupWise to Exchange Nucleus Data Recovery Kernel for Outlook Express recovery Nucleus Data Recovery Kernel for Outlook PST Repair SysTools System Admin Tools

Password Restoration Tools

Active@ Password Changer ElcomSoft Distributed Password Recovery

Faronics Deep Freeze Enterprise

Compiler Installer

Installer Development

Advanced Installer Longtion AutoRun Pro Enterprise

The Compiler

Absoft Windows Pro Fortran Compiler Suite paxCompiler

Decompiler

Hex-Rays Decompiler Oreans Code Virtualizer

Virtualization Video

Virtualization Software

Altraro VM Backup ThinSoft WinConnect Server ES Trilead VM Explorer VMware Fusion VMware Workstation for Windows Veeam Backup & Replication

Fax	Equisys Zetafax

Office Management

BoostSolutions SharePoint Permission Boost Keyboard Express Macro Express Office Timesheets Splashtop Whiteboard Total Commander

Financial Software

Intuit QuickBooks Enterprise Intuit QuickBooks Premier Intuit QuickBooks Pro

CodeTwo Exchange Rules 2019 CodeTwo Exchange Migration GlobalSCAPE Mail Express IMail Server Kerio Connect Mailstore Server

Office Plug-ins

DaCS

LDAP Project

LDAP Table of Content

LDAPSoft LDAP Admin & Reporting Tool LDAPSoft AD Admin & Reporting Tool

AVG AntiVirus Business Edition AVG Internet Security Business Edition Avast Endpoint Protection Faronics Anti-Virus Malwarebytes Endpoint Protection SPAMfighter PC

Printing Management

EngraveLab PhotoLaser Plus EngraveLab Laser Engraver EngraveLab Version 9 EngraveLab Pro print-conductor SignLab Version 10 SignLab Print and Cut Thinprint

PDFlib TET

PDFlib TET 5 - Text and Image Extraction Toolkit

What is PDFlib TET?

PDFlib TET (Text and Image Extraction Toolkit) reliably extracts text, images and metadata from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed color, glyph and font information as well as the position on the page. Raster images are extracted in common image formats. TET optionally converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information.

TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from PDF, such as metadata, interactive elements, etc.

With PDFlib TET you can:

Implement the PDF indexer for a search engine

Repurpose text and images in PDFs

Convert the contents of PDFs to other formats

Process PDFs based on their contents, e.g. splitting based on headings (requires PDFlib+PDI in addition to TET)

Check wether an area on the page is empty or contains any text, images, or vector graphics

PDFlib TET 5 - Features

The PDFlib Text and Image Extraction Toolkit (TET) is targeted at extracting text and images from PDF documents, but can also be used to retrieve other information from PDF.

Accepted PDF Input

l TET supports all relevant flavors of PDF input:

l All PDF versions up to to Acrobat DC, including ISO 32000-1 and -2

l Protected PDFs which do not require a password for opening the document

l Damaged PDF documents will be repaired

All Writing Systems of the World

l TET processes PDF documents in all writing systems of the world and implements special processing required for some scripts:

l Latin, Greek and Cyrillic scripts including dehyphenation

l Arabic and Hebrew including logical reordering of right-to-left and bidirectional text; normalization of Arabic presentation forms

l Simplified and Traditional Chinese, Japanese, and Korean regardless of encoding; horizontal and vertical text

l Indian scripts (without glyph reordering)

l All other languages and scripts supported in Unicode

Unicode

Since text in PDF is usually not encoded in Unicode, PDFlib TET normalizes the text in a PDF document to Unicode:

l TET converts all text contents to Unicode. In C and other non-Unicode aware languages the text is returned in UTF-8 or UTF-16 format, and as native strings in Unicode-capable programming languages.

l Ligatures and other multi-character glyphs are decomposed into a sequence of the corresponding Unicode characters.

l Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character in order to avoid misinterpretation.

l TET implements various workarounds for problems with specific document creation packages, such as InDesign and TeX documents or PDFs generated on mainframe systems.

Content Analysis and Word Detection

TET includes patented content analysis algorithms:

l Determine word boundaries which are required to retrieve proper words

l Recombine the parts of hyphenated words (dehyphenation)

l Remove duplicate instances of text, e.g. shadow and artificially bolded text

l Recombine paragraphs in reading order

l Correctly order text which is scattered over the page

Page Layout and Table Detection

The page content is analyzed to determine text columns. Tables are detected, including cells which span multiple columns. This improves the ordering of the extracted text. Table rows and the contents of each table cell can be identified.

Geometry

TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

Text Color

TET analyzes color information in the PDF page description and returns precise color information for each glyph. This can be used, for example, to identify headings or other highlighted text.

Image Extraction

Images on PDF pages can be extracted as TIFF, JPEG, JPEG 2000 or JBIG2 files. Precise geometric information (position, size, and angles) is reported for each image. Fragmented images are combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color conversion occurs. This ensures the highest possible image quality.

PDF Analysis

The TET library includes the pCOS interface for querying details about a PDF document, such as document info and XMP metadata, font lists, page size, and many more.

Configuration Options for problematic PDF

TET contains special handling and workarounds for various kinds of PDF where the text cannot be extracted correctly with other products. In addition, it includes various configuration features to improve processing of problem documents:

l Unicode mapping can be customized via user-supplied tables for mapping character codes or glyph names to Unicode.

l PDFlib FontReporter is an auxiliary tool for analyzing fonts, encodings, and glyphs in PDF. It works as a plugin for Adobe Acrobat. This plugin is freely available for OS X/macOS and Windows.

l Embedded fonts are analyzed to find additional hints for Unicode mapping. External font files or system fonts are used to improve text extraction results if a font is not embedded.

Unicode Postprocessing

TET supports various Unicode postprocessing steps which can be used to improve the extracted text:

l Foldings preserve, remove or replace characters, e.g. remove punctuation or characters from irrelevant scripts.

l Decompositions replace a character with an equivalent sequence of one or more other characters, e.g. replace narrow, wide or vertical Japanese characters or Latin superscript variants with their respective standard counterparts.

l Text can be converted to all four Unicode normalization forms, e.g. emit NFC form to meet the requirements for Web text or a database.

Document Domains

PDF documents may contain text in other places than the page contents. While most applications will deal with the page contents only, in many situations other document domains may be relevant as well. TET extracts the text from all of the following document domains:

l page contents

l predefined and custom document info entries

l XMP metadata on document and image level

l bookmarks

l file attachments and PDF portfolios can be processed recursively

l form fields

l comments (annotations)

l general PDF properties can be queried, such as page count, conformance to standards like PDF/A or PDF/X, etc.

XMP Metadata

TET supports XMP metadata in several ways:

l Using the integrated pCOS interface, XMP metadata for the document, individual pages, images, or other parts of the document can be extracted programmatically.

l TETML output contains XMP document and image metadata if present in the PDF.

l Images extracted in the TIFF or JPEG formats contain image metadata if present in the PDF.

TETML represents PDF Contents as XML

TET optionally represents the PDF contents in an XML flavor called TETML. It contains a variety of PDF information in a form which can easily be processed with common XML tools. TETML contains the actual text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata.

TETML also includes interactive elements such as form fields, annotations, bookmarks etc. It can even be used to analyze JavaScript or color space details, ICC profiles or output intents.

TETML is governed by a corresponding XML schema to make sure that TET always creates consistent and reliable XML output. TETML can be processed with XSLT stylesheets, e.g. to apply certain filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.

The following fragment shows TETML output with glyph details:

PDFlib

P
D
F
l
i
b

TET Connectors

TET connectors provide the necessary glue code to interface TET with other software. The following TET connectors make PDF text extraction functionality available for various software environments:

l TET connector for the Lucene Search Engine

l TET connector for the Solr Search Server

l TET connector for the TIKA toolkit

l TET connector for Oracle Text

l TET connector for MediaWiki

l TET PDF IFilter for Microsoft products is available as a separate product. It extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows.

TET Cookbook

The TET Cookbook is a collection of programming examples which demonstrate the use of TET for various text and image extraction tasks. Several Cookbook samples show how to combine the TET and PDFlib+PDI products in order to enhance PDF documents, e.g. add bookmarks or links based on the text on the page.

Quick Navigation;