À propos de GroupDocs.Parser for .NET

Extrayez du texte brut et formaté à partir de différents formats.

GroupDocs.Parser for .NET is a text, metadata and image extractor API for business applications developed using C#, ASP.NET, and other .NET technologies. It supports the extraction of raw, formatted and structured text as well as metadata from the files of supported formats. Through GroupDocs.Parser for .NET, your applications can also perform parsing of password protected documents for popular formats, such as Microsoft Word documents, Excel spreadsheets, PowerPoint presentations, OneNote, PDF files and ZIP archives.

Supported File Formats

Text Extraction

  • Text: DOC, DOCX, DOT, DOTM, DOTX, DOCM, RTF, ODT, OTT, TXT, MD, WordprocessingML (XML)
  • Spreadsheets: XLS, XLSX, CSV, XLSM, XLSB, ODS, SpreadsheetML (XML), XLT, XLTX, XLTM, OTS, XLA, XLAM, TSV
  • Presentations: PPT, PPTX, PPTM, PPS, PPSX, PPSM, POT, POTX, POTM, ODP, OTP
  • OneNote: ONE
  • Email: MSG, EML, EMLX, PST, OST, MS EXCHANGE SERVER, POP, IMAP
  • Electronic Publishing: EPUB, FB2
  • Portable Document: PDF, PDF Portfolio, Encrypted PDF
  • DOM-Based: XML, HTML, XHTML, MHTML
  • Compression & Packaging: ZIP, CHM
  • Database: ADO.NET

Encoding Detection

  • BOM: UTF32 LE, UTF32 BE, UTF16 LE, UTF16 BE, UTF8, and UTF7
  • Content: UTF32 LE, UTF32 BE, UTF16 LE, UTF16 BE, UTF8, and ANSI

Metadata Extraction

  • Text: DOC, DOCX, DOT, DOTX, DOTM, OTT, ODT
  • Spreadsheets: XLS, XLSX, XLT, XLTX, XLTM, XLA, XLAM, OTS, ODS
  • Presentations: PPT, PPTX, POT, POTX, POTM, PPSM, PPTM, OTP, ODP
  • Email: MSG, EML, EMLX
  • Electronic Publishing: EPUB, FB2
  • Other: PDF

Text and Metadata Extraction

  • Template: DOTX, POTX
  • Macro-Enabled Template: DOTM, POTM, PPSM, PPTM
  • OpenDocument Template: OTT

Image Extraction

  • Text: DOC, DOCX, DOCM, RTF, DOT, DOTM, DOTX, ODT
  • Spreadsheets: XLS, XLSX, XLSM, XLSB, ODS, XLT, XLTM, XLTX
  • Presentations: PPT, PPTX, PPTM, ODP, POT, POTM, POTX, PPS, PPSX, PPSM
  • Portable Document: PDF, POT, POTM, POTX
  • Ebook: CHM, EPUB, FB2
  • Markup: HTML

GroupDocs.Parser for .NET Features

  • Statistically count word occurrence in single or multiple files.
  • Extract text and Metadata from Excel worksheets and Presentation templates.
  • Extract text content from a file or stream without installing document reader.
  • Get formatted text from a document using fast or standard text extraction mode.
  • Detect the media type of password protected XML documents and pull text from them.
  • Programmatically get formatted Text from within emails and attachments.
  • Draw out text from single or multiple pages of OneNote documents.
  • Take out text from simple PDF file or a PDF portfolio document.
  • Extract data from the PDF forms and obtain formatted table from a PDF or Word document.
  • Get formatted text from PowerPoint presentations or drive out text from specific slide.
  • Gather raw or formatted text from cells, rows, and columns from Excel spreadsheet.
  • Extract raw or HTML formatted text from Word documents.
  • HTML Formatter supports formatting of paragraph, hyperlink, font, headings, lists and tables.
  • Pull out single sentence or whole Text from EPUB, CHM, Markdown and FB2 files.
  • Excerpt table of content from EPUB and CHM documents.
  • Pull out text with its content structure intact and excerpt highlighted text from documents.
  • Obtain text area from documents for analysis and draw out Metadata from supported document formats.
  • Obtain all or selected images from supported formats and rotate extracted image(s).
  • Take out text from files within Zip archives and OST containers and extract text from database containers.
  • Get data from Email container (Exchange Web Server, POP3, IMAP).
  • Search simple text, whole word and regular expression within documents.
  • Prepare document template, extract data from document and analyze data fields and tables.
  • Search and extract highlighted expressions in documents.
  • Get text with plain text formatter (Simple & ASCII) or with Markdown formatter.
  • Markdown formatter supports formatting of font, hyperlinks, headings, lists and tables.
  • Perform custom formatting with edges, angles, and intersections to format plain text.
  • Move table layout and detect tables in a rectangular area by column separators.
  • Extract Text from Shapes, WordArt Objects & Text Boxes within Microsoft Office File Formats.
  • Extract Images to Files – Save to JPG, PNG, GIF, BMP, PNG or WEBP Formats.