LayoutLM v3 Research Paper

--

Detailed Explanation about LaoutLMv3 Paper

Agenda:

  1. Introduction
  2. LayoutLMv3 Architecture
  3. Pre-Training Objectives
  4. Conclusion & Future Work

Let’s get Started.

Introduction

The paper proposes 2 things: text and image popularly known as text modality & image modality. Text centric tasks are carried out in segment format & image centric are carried out in patches format.

Architecture

LayoutLMv3 Architecture
  1. Text Embeddings:

Text embedding is a combination of word embeddings + position embeddings.

Process:

a) It pre-processed document images with the OCR toolkit to obtain textual content and corresponding 2D position information.

b) Initialize the word embeddings with a word embedding matrix from a pre-trained model RoBERTa.

c) We get T1,T2,T3,T4 text tokens some are masked & some not.. We add a [CLS] and a [SEP] token at the beginning and end of each text sequence.

The position embeddings include:

a) 1D position: 1D position refers to the index of tokens within the text sequence, and the

b) 2D layout position embeddings: 2D layout position refers to the bounding box coordinates of the text sequence.

2. Image Embeddings

Process

  1. Specifically, we resize a document image into 𝐻 ×𝑊 and denote the image with I ∈ R 𝐶×𝐻×𝑊 , where 𝐶, 𝐻 and𝑊 are the channel size, width and height of the image respectively.
  2. We then split the image into a sequence of uniform 𝑃 × 𝑃 patches, linearly project the image patches to 𝐷 dimensions and flatten them into a sequence of vectors
  3. we represent document images with linear projection features of image patches before feeding them into the multimodal Transformer.
  4. Then we add learnable 1D position embeddings to each patch since we have not observed improvements from using 2D position embeddings in our preliminary experiments

Pre-Training Objectives

Full pre-training objectives of LayoutLMv3 is defined as 𝐿 = 𝐿𝑀𝐿𝑀 + 𝐿𝑀𝐼𝑀 + 𝐿𝑊PA

Reconstructive pre training is nothing but the MLM is pretrained in a way to learns to reconstruct masked word tokens of the text modality and symmetrically reconstruct masked patch tokens of the image modality.

a) Masked Language Modelling (MLM):

MLM Mask 30% Tokens as span masking strategy and Poisson Distribution to take span lengths. Here, task is to maximize the Log-likelihood to correctly identify the masked text tokens. Maximize Log-likelihood of correct masked text tokens based on the contextual representations of corrupted sequences of image tokens.

b) Masked Image Modelling (MIM):

Using Block wise masking strategy. The labels of image tokens come from an image tokenizer, which can transform dense image pixels into discrete tokens according to a visual vocabulary.

Image tokenizer is initialized from a pre-trained image tokenizer in DiT, a self-supervised pre-trained document image Transformer mode.

c) Word-Patch Alignment(WPA)

There was no explicit alignment learning between text and image modalities in earlier modals. As WPA came into the role to play a cross modal alignment.

The WPA objective is to predict whether the corresponding image patches of a text word are masked, . We exclude the masked text tokens when calculating WPA loss to prevent the model from learning a correspondence between masked text words and image patches.

They use a two-layer MLP head that inputs contextual text and image and outputs the binary aligned/unaligned labels with a binary cross-entropy loss.

Conclusion & Future Work

  1. LayoutLMv3 does not rely on a pre-trained CNN or Faster R-CNN backbone to extract visual features.
  2. Uses unified text and image masking pre-training objectives: masked language modeling, masked image modeling, and word-patch alignment.
  3. Explore fewshot and zero-shot learning capabilities to facilitate more real-world business scenarios in the Document AI industry.

Thanks for Reading

Do Follow for more content like this!

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Build awareness and adoption for your tech startup with Circuit.

--

--