Requirements for Source Documents

Document Requirements and Limitations

The XBRL Tagger is able to tag MS Word, InDesign, XHTML and PDF documents properly with the following requirements and limitations:

  • It is not possible to tag any value of a table that is included as an image in a document.

  • InDesign documents (IDD) can be tagged by exporting to ePub (File → Export to ePub). The document must use fonts that are available for web-browsing (XHMTL).

  • Scanned (PDF) reports can't be tagged, the XBRL Tagger does not include an OCR module.

  • For ePub files special fonts must be provided in the XBRL Tagger's fonts folder (see below).

  • For PDFs hidden text as well as some font-specific settings might lead to issues. (See more information below)

Differences Between Source Formats

 

Word

EPub

PDF (pdf2htmlEx)

HTML

 

Word

EPub

PDF (pdf2htmlEx)

HTML

A4 Layout

Optional

Enforced

Enforced

N/A

WYSIWYG

N/A

Light

Full

N/A

Tags-Saving

In file/external

External

External

In file/external

Chapter detection

Styles Outline Level

Pages

Document Bookmarks/Pages

Headers

Font handling

Integrated

Font must be supplied externally

Integrated

Integrated

Table detection

Auto

Auto

Manual

Auto

Smart anchors

Yes

Yes

Yes

No

XHTML formatting preserved

Partly*

Full

Full

Full

Multiple tags per value

Yes

No

No

No

*This depends on styles and formats applied to paragraphs, see limitations above.

 

How to Prepare a PDF File

PDF Requirements and Limitations

The XBRL Tagger is able to tag any PDF documents properly with the following requirements and limitations:

  • It is not possible to tag any value of a table that is included as an image in a document.

  • Make sure that the fonts that are being used are correct (this also applies to Word fonts when converting to PDF) with regards to Glyphs, otherwise conversion could lead to usage of wrong characters. 

  • Scanned (PDF) reports can't be tagged, the XBRL Tagger does not include an OCR module.

  • For PDFs, hidden text as well as some font-specific settings might lead to issues. (See more information below)

  • Always use the same software to create different versions of the PDF, otherwise restoring the mapping might be an issue

    • This means when creating a PDF from Word and initially tagging this, you could get issues if you change the PDF afterwards with, for example, Adobe Pro

    • If you need to stitch multiple documents together, rather use the Merge iXBRL functionality in the Tagger after converting all parts to XHTML

  • Don't embed external PDFs into InDesign documents

Recommendations on Tagging of PDF Documents

PDF is a very universal format for creating documents. Converting it to XHTML can be a challenge, especially if the PDF document that is used as a source has issues itself.

Here are some recommendations to create the best-possible conversion outcome:

  • Keep in mind that PDF to HTML convertion is similar to actual printing but on a very special virtual device. Like printing on a physical print station this process can have font and color issues. 

  • Make all fonts embedded.

  • Never add tables as pictures, also when converting from Word

  • Do not use Type 3 fonts, they are not supported in any case.

  • For CID fonts, make sure they include correct character mappings definitions.

  • Do not include hidden text in PDF documents, or remove it with Adobe Acrobat Redact.

  • Do not place any stamps/signs to PDF comments.

  • Use RGB color space.

  • Do not use special ICC color profiles.

  • Create a PDF document that is compliant with PDF/A-1a standard and that does not contain text that cannot be mapped to Unicode or inconsistent with information for rendered glyphs.

  • Major layout changes (styles, one-column to two-columns) can have a serious impact on the mapping restoration. Bear that in mind when planning.

Keep in mind that the tagging of PDFs requires an extra step.

What Are Hidden Facts?

When converting and tagging a PDF report with special font face in the XBRL Tagger, some facts (tags) might become hidden. The reason is that the Inline XBRL Specification does not allow individually formatted numbers to be tagged; e.g. when the font requires a special spacing between single characters by using HTML tags like <span>, the number is no longer taggable. In the screenshot below, the number 24,540 is not taggable. In order to preserve the spacing and formatting of the PDF in the XHTML report, the XBRL Tagger moves the tag to an unformatted hidden section of the document and includes a link to the visual original number.

However, hiding facts is an official mechanism of the Inline XBRL specification, as well as being allowed by ESMA in the ESEF Reporting Manual, page 34:

From AMANA's point of view, untaggable items, like the number in the example above, are not eligible for transformation and can be hidden. The XBRL International standard setter working group is aware of the issue and will probably publish and update Inline XBRL specification, which will make those numbers taggable in the future.

How to Avoid Hidden Facts

There are multiple ways to avoid or reduce hidden facts in iXBRL reports:

  • Tag Microsoft Word files instead of PDFs

  • Do not use special non-web fonts in PDF reports that provide a special spacing between characters.

  • Set the XBRL Tagger CMaps option to "Ignore" when opening a PDF file (this might lead however to uglier reports).

  • Use the latest XBRL Tagger version, which includes some new options to reduce/avoid hidden facts.

  • All numbers that are tagged need to have the OpenType setting “Default figure Style” to avoid “Hidden facts”. This setting only affects the digits in the report.
    To apply this setting you can manually choose “Default Figure Style” in the number columns or you can apply the setting in the Paragraph Style under “OpenType features”. Use 0 kerning in the tagged cells for best result.

  • Other problems that can occur when you convert the PDF to XHTML may be:

    • Text opacity - If you have a text with opacity in the document, the opacity will go back to default 100% after the conversion to XHTML. It will work if you create outlines of the text.

    • Text behind - If you have text hidden behind something in InDesign, the text will be visible when you convert to XHTML.

Remove Hidden Text From PDF Files Using Adobe Redact

In the case that you have hidden elements, it is possible to remove some of them using Adobe Redact. Hidden Text will be visible in the converted XHTML document. So, it must be removed before processing the PDF document with the Tagger.

Load the file into Adobe Acrobat Pro and click on the tools button.

 

Go to Protect & Standardize and click on Redact.

Click on Sanitize Document.

In the opening window you have to click on Click here.

After that you get a selection of all hidden elements. Remove all checks, but keep the one for Hidden Text and click on Remove.

Further Information About PDF Conversion

The limitations of the PDF converter:

  1. CID (identity H) fonts embedded to the source document.

    1. In this case, the converted document can contain unreadable (weird looking) text. To resolve this it is recommended to save the source document as PDF/X format in the Adobe Acrobat DC "Print Production" tool. 

  2. If the converted document has wrong color palette, see step 1.

  3. The converter does not support PDF hidden text layers.

    1. If so, you should remove hidden text layers in the Adobe Acrobat DC "Redact" tool.

  4. The converter has fine tuning options helping to resolve the issues:

    1. Please change the option "PDF unicode CMaps handling" to "Auto" and "Use autohint on fonts without hint"to "Use AutoHint" if the converted document does not look good.

If the conversion still doesn't meet the expectations or some tables cannot be tagged properly, the source file might need corrections. 

The following cases are known:

  1. The converted PDF looks good, but the imported table is unreadable.

  2. The converted PDF contains unreadable fragments.

  3. The PDF document has not been converted at all in the Tagger.

  4. The converted PDF shows wrong colors, visual artifacts or extra
    text fragments or pages.

For cases 1-3 there are two methods to repair the document in Adobe:

  1. Export the PDF to postscript and create a new file from it in Adobe Acrobat Distiller DC;

  2. Convert the PDF to the stadard PDF 1/A with Adobe Preflight in the "PDF standards" tool.

For case 4 use "Sanitize Document" in the "Redact" tool and convert the document to PDF/X for correct colors.

If the document after all processing still has artifacts, the "fallback mode" option in the Tagger can be used.

 

How to Prepare a PDF File using InDesign (Best Practices)  

General information

In this section you will get information on how to prepare your InDesign documents for tagging so you will avoid issues in your ESEF report.

When you create your ESEF report, a PDF is converted to XHTML. Converting PDF to XHTML is complicated and it’s important that you have the right settings in your InDesign documents to avoid issues.

Most of the issues are related to OpenType features in the documents’ fonts. In InDesign you can use different variants (Glyphs) of the characters. If you use two or more glyphs of a character with the same Unicode in your report the Tagger cannot distinguish between them. Only one of the glyph variants will be used in the report. This could make the report look different in the Tagger compared to the PDF.

Ligatures and different Figure Styles are known to create issues. For example,

you cannot mix different Figure Styles in the report. The Tagger will then choose one Figure Style in the report and this may cause the XHTML not to be generated correctly. The recommendation is to set the report to Default Figure Style to avoid issues.

The fonts you are using must be Unicode compatible. Most fonts today are compatible, but older or custom fonts may not be.

Do not use Variable fonts in your report. You need to use Static fonts to avoid converting issues.

Variable fonts have many different variations of a typeface to be incorporated into a single file.

Static fonts have a separate font file for every width, weight, or style.

Example of character with same Unicode but different Glyph ID (GID)

InDesign settings to avoid issues in the ESEF report

1.  Turn off OpenType features

Ensure that OpenType features are turned off in your Paragraph Styles.

The recommendation is to use Default Figure Style in the report.

If you want to change the Figure Style (e.g. Tabular Lining) you need to be sure to change the setting in the whole report to avoid issues in the conversion to XHTML. If you use different Figure Styles in your report the Tagger will choose one of them and this may cause the XHTML not to be generated correctly.

2.  Turn off Ligatures

Ensure that Ligatures (e.g. “fi” and “ff”) are turned off in your Paragraph Styles. Ligatures are default in InDesign and must be turned off to avoid issues.

Ligature: A glyph that combines the shapes of certain sequences of characters into a new form that makes for a more harmonious reading experience.

3.  Avoid “White Space”

Avoid using “Insert White Space” in your report. “White spaces” can result in issues with spacing between words and letters in the XHTML. “Nonbreaking Space” is commonly used by designers but often creates issues. The recommendation is to use “Nonbreaking Space (Fixed Width)” instead. This space is more likely to work. You can create a “Keyboard Shortcut” in InDesign if you use it commonly.

4.  Avoid OpenType Alternatives

Do not use the OpenType alternatives: Superscript/Superior, Superscript/Inferior, Numerator, Denominator. Use InDesign Superscript/Subscript instead.

 5.  Issues with “Small Caps”

Depending on the font you are using there could be issues if you use “Small Caps”. The recommendation is to avoid it. If you need to use “Small Caps”, pay attention to how the text looks after the conversion to XHTML.

6.  Issues with “Section Marker”

If you insert a “Section Marker” in InDesign, it’s important that you do not use “All Caps” in the “Section Marker”. If you use “All Caps” font issues can arise in the Tagger and some Lowercase characters may be replaced with Uppercase characters in the report. If you want Uppercase characters in the “Section Marker” you can use Uppercase characters in the “Numbering & Section Options/Section Marker”.

7.  Tab Issue

This issue is less common and depends on the font you are using. If you use tabs with dots as “Leader” make sure you do not have a “Space” in the “Leader”. In some fonts this can become an issue in the conversion to XHTML and all dots in your document will also have a space next to the dot.

8.  Do Not Use Private-use Characters

Private-use character: A character whose use is defined by private users and companies rather than defined by a standard such as Unicode, and which therefore has no universally accepted meaning.

“Private-use characters” are quite unusual to use in InDesign. If you use a “Private-use character” it will not be displayed correctly in XHTML. To get information about characters open the Glyphs panel (Type>Glyphs). See example below.

Private-use character                                

Unicode character

9.  Substituted Glyphs

In InDesign you can highlight the “Substituted Glyphs” that may create issues in the ESEF report. Not all of the highlighted glyphs will create issues.

In the below example Ligatures, Contextual alternatives and Tabular Lining are highlighted and can create issues after conversion to XHTML.

Note: Hyphen-minus is always highlighted as “Substituted Glyphs” in InDesign.

 

10.  Elements (text frames) in correct reading order

It's important that the reading order of the different elements (text frames) in InDesign is arranged correctly on the Text block pages/spreads. If the reading order in the layer is incorrect, it may be difficult to select the content in the AMANA XBRL Tagger. The reading order in the ESEF report can also be incorrect if the order of the elements are wrongly arranged.
The reading order is done per page/spread in InDesign. The elements (text frames) within a layer in InDesign should be arranged to correspond with the reading order as it appears on the page/spread.
The screenshot below shows an example of how the elements (text frames) should be arranged in a particular reading order and how the elements should be organised in the layer to match that same sequence. Note that what comes first in the reading order is placed in the bottom of the layers panel. If the elements needs to be arranged you can drag and drop them in the correct (reading) order in the layers panel.

Additional Information

1.  Text Effects

If you have applied effects (e.g. opacity, multiply) on text in the InDesign document it will go back to default after the conversion to XHTML. If you want to apply effects, you need to create outlines of the text.

2.  Text Behind

If you have text hidden behind an object/image in InDesign the text will become visible when you convert to XHTML

EPub Font Folder

If special fonts are used in ePub files, the TTF files have to be added to the Tagger font folder:

How to Prepare a Word File

MS Word Requirements and Limitations

The XBRL Tagger is able to tag any MS Word documents properly with the following requirements and limitations:

  • It is not possible to tag any value of a table that is included as an image in a document.

  • For MS Word documents it is required to use styles (heading 1, heading 2, etc.) to structure the documents.

    • The chapter headings are used by the Tagger to allow easy navigation through the document.

    • All tables that have to be tagged must be normal Word tables (no embedded Excel or similar).

    • To change the outline level of styles, right click on the paragraph and select Paragraph and then select Outline level. For more information look at our FAQ #304 and FAQ #305.

  • Shapes and images anchored in front of text or behind text are placed at the anchor position. This might lead to different layout when converting to XHTML.

  • Images and shapes inserted as embedded Office objects (e.g. diagrams from PowerPoint or Excel) can't be converted to XHTML. Those images must be converted to pure images e.g. by taking a screenshot and inserting it.

  • Two-column text layout is not yet supported for MS Word to XHTML conversion.

You can also checkt out the FAQ for the HTML Converter, where many questions on Word Documents are answered.

How to Create a Compatible PDF From Word

The most reliable way to create an iXBRL-compatible PDF from Word is to use the PDF-export functionality from Adobe. For that, you will have to have Adobe Acrobat installed on your computer and then use the following settings: