GNN-based Visually Rich Document Processing

Date and Time

July 26, 2020, 19:00-22:15 (UTC+8)


Zhibo Yang, is currently an Algorithm Expert with OCR Group, DAMO, Alibaba. Email: [email protected].

Qi Zheng, is currently an Algorithm Expert with OCR Group, DAMO, Alibaba. Email: ‍[email protected]

Detailed Outline

19:00 – 20:30 Hierarchal text detection: component and relation study (By Zhibo Yang).

20:30 – 20:45 Short break

20:45 – 22:15 Graph convolution for multi-modal information extraction (By Qi Zheng)

Brief Description

Visually rich documents (VRDs) are ubiquitous in daily business and life. In a border sense, VRD is a piece of paper with rich visual layout, such as alignment of text, structured forms, and hierarchical outlines. Examples are purchase receipts, insurance policy documents, tax invoices and so on.

The analysis of VRDs aims to serialize 2D document image into structured computer languages. With the development of deep learning, the printed character recognition was no longer the bottleneck. Due to the increasing demands for high accuracy recognition, more and more attention is now being paid on details that are easily overlooked. For example, it is hard to combine two characters with large spacing, and extract an entity separated by a line feed.

The components in VRDs can be various forms and are usually not in adjacent regions, which lead to poor performances of CNNs. It is a consensus that graph neural network (GNN) has innate advantages at learning relations between two non-adjacent and even multi-modal components. In this tutorial, we will introduce the concept of GNN as well as its applications in text detection, document structure analysis and KV extraction.

The main challenges of text detection lie on arbitrary orientations, densely text and characters with large spacing. We will introduce a novel bottom- up text detection based on CNNs and GNNs. We found that CNNs are good at predicting adjacent relations, while GNNs are more effective when relations are global.

The semantic structure of the document is not only determined by the text within it but also the visual features such as layout, tabular structure and font size of the document. We represent the VRDs as a graph of element in GNN, where each element is comprised of the geometric embedding and the semantic information. With the construction of graph, we extract headings, chapters, k-v pairs and the structure of forms.

With the application of new technologies above, our OCR system has achieved the state-of-the-art performance with new features like reading-order of text segments, extraction of discrete key words, recognition of complex forms, and so on.