Text recognition is a field that has been researched and applied for many years. Text recognition
process is performed through the following main steps: The input image page will go through the
preprocessing step, then the page analysis step, the output of the page analysis will be the input
of the recognition step, and finally post-processing. The result of a recognition system depends
on two main steps: page analysis and recognition. At this point, the problem of recognition on
printed text has been resolved almost completely (ABBYY's FineReader 12.0 commercial product
can recognize printed text in various languages, recognition software of Vietnamese words in
VnDOCR 4.0 of the Hanoi Information Technology Institute can recognize with accuracy over
98%). However, in the world as well as in Vietnam, the page analysis problem remains a major
challenge for researchers. Until now, page analysis is still receiving the attention of many
researchers. Every two years in the world there is an international page analysis contest to
promote the development of page analysis algorithms. These were the motivations for the
dissertation to try researching so that they can propose effective solutions to the page analysis
problem.
In recent years, there are many page analysis algorithms have been developed, especially are
hybrid-oriented approached development algorithms. The proposed algorithms show different
strengths and weaknesses, but in general most of them still suffer from two basic errors: an error
separating a correct text area into smaller that leads to mislead or miss the information of text
lines or paragraph (over-segmentation), the aggregation error of text areas in text columns or
paragraphs together (under-segmentation). Therefore, the objective of the dissertation is to
study and develop page analysis algorithms that simultaneously reduce both types of errors:
over-segmentation, under-segmentation. The issues in page analysis are very broad so the
dissertation limits the scale of the study within the scope of text image pages written in Latin
language which particularly is English and focuses on the analysis of the text areas. The
dissertation has not proposed the problem of detecting and analyzing the structure of table
spaces, detecting image areas and analyzing logical structures. With the objectives of the
dissertation have achieved the following results:
1. Propose a solution that speeds up the algorithm for detecting background images.
2. Proposed adaptive parameterization method reduces the effect of size and font type on
the results of page analysis.
3. Proposed a new solution for the problem of detecting and using separator objects in page
analysis algorithms.
4. Proposes a new solution that separates text areas into paragraphs based on context
analysis
26 trang |
Chia sẻ: thientruc20 | Lượt xem: 338 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Tóm tắt Luận án Document geometric layout analysis based on adaptive threshold, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
MINISTRY OF EDUCATION AND
TRAINING
VIETNAM ACADEMY OF SCIENCE
AND TECHNOLOGY
GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY
.............***.............
HA DAI TON
DOCUMENT GEOMETRIC LAYOUT ANALYSIS BASED
ON ADAPTIVE THRESHOLD
Major: Mathematics for Informatics
Code: 62 46 01 10
SUMMARY OF PhD THESIS IN MATHEMATICS
Hanoi - 2018
The work was completed at: Graduate university of Science and
Technology – Vietnam Academy of Science and Technology
Supervisor: Prof. Dr Nguyen Duc Dung
Review 1: ...
Review 2: ...
Review 3: ....
The thesis will be protected on the PhD thesis defense, meeting at the Graduate
university of Science and Technology – Vietnam Academy of Science and
Technology on ... hour ..., date ... month ... 201 ... .
The dissertation can be found at:
- Library of the Graduate university of Science and Technology
- National Library of Vietnam
INTRODUCTION
Text recognition is a field that has been researched and applied for many years. Text recognition
process is performed through the following main steps: The input image page will go through the
preprocessing step, then the page analysis step, the output of the page analysis will be the input
of the recognition step, and finally post-processing. The result of a recognition system depends
on two main steps: page analysis and recognition. At this point, the problem of recognition on
printed text has been resolved almost completely (ABBYY's FineReader 12.0 commercial product
can recognize printed text in various languages, recognition software of Vietnamese words in
VnDOCR 4.0 of the Hanoi Information Technology Institute can recognize with accuracy over
98%). However, in the world as well as in Vietnam, the page analysis problem remains a major
challenge for researchers. Until now, page analysis is still receiving the attention of many
researchers. Every two years in the world there is an international page analysis contest to
promote the development of page analysis algorithms. These were the motivations for the
dissertation to try researching so that they can propose effective solutions to the page analysis
problem.
In recent years, there are many page analysis algorithms have been developed, especially are
hybrid-oriented approached development algorithms. The proposed algorithms show different
strengths and weaknesses, but in general most of them still suffer from two basic errors: an error
separating a correct text area into smaller that leads to mislead or miss the information of text
lines or paragraph (over-segmentation), the aggregation error of text areas in text columns or
paragraphs together (under-segmentation). Therefore, the objective of the dissertation is to
study and develop page analysis algorithms that simultaneously reduce both types of errors:
over-segmentation, under-segmentation. The issues in page analysis are very broad so the
dissertation limits the scale of the study within the scope of text image pages written in Latin
language which particularly is English and focuses on the analysis of the text areas. The
dissertation has not proposed the problem of detecting and analyzing the structure of table
spaces, detecting image areas and analyzing logical structures. With the objectives of the
dissertation have achieved the following results:
1. Propose a solution that speeds up the algorithm for detecting background images.
2. Proposed adaptive parameterization method reduces the effect of size and font type on
the results of page analysis.
3. Proposed a new solution for the problem of detecting and using separator objects in page
analysis algorithms.
4. Proposes a new solution that separates text areas into paragraphs based on context
analysis.
CHAPTER 1. OVERVIEW OF DOCUMENT LAYOUT ANALYSIS
In this chapter, I present an overview of the text recognition system, the page analysis
problem, the typical page analysis algorithms, the most basic errors of page analysis algorithms.
This leads to the research objectives and results of this dissertation.
1.1. The main elements of the text recognition system
Basically, a text recognition system is usually done through the basic steps described in
Figure 1. Information is in the form of text such as books, newspapers, magazines, etc. after
scanning process, it will show us the result in the image file. These image files will be the input of
an recognition system, the output of the recognition system are text files that can be easily edited
and archived, such as files of * .doc, * .docx, * .excel, * .pdf, etc. The dissertation focuses on
studying the the page analysis steps, in which the focus is the analysis of the geometric structure
of the layout.
Figure 1: Illustration of basic processing steps of text recognition system
1.1.1. Pre-processing
The task of pre-processing a layout is usually binary, defines the components of connected
image, filters noise, and aligns the gradient. The output of the pre-processing step will be the
input of the page analysis process. As a result, the pre-processing results will also have significant
effects on the results of the page analysis.
1.1.2. Document layout analysis
Document layout analysis is one of the major components of text recognition systems
(OCR - System). Besides, it is also widely used in other fields of computing such as document
digitization, automatic data entry, computer vision, etc. The task of page analysis includes
automatically detecting image areas on a document layout (physical structure) and categorize
them into different data regions such as text area, image, table, header, footer, etc. (logical
structure). Page analysis results are used as an input to the recognition and automatic data entry
of document imaging processing systems.
1.1.3. Recognition of optical characters
This is the most important stage, this stage determines the accuracy of the recognition
system. There are many different classification methods applied to word recognition systems,
such as: matching method, direct approach method, grammar method, graph method, neural
network, statistic method, and support vector machine.
1.1.4. Post-processing
This is the final stage of the recognition process. Maybe post-processing is a step to joint
the recognized characters into words, sentences, and paragraphs to reconstitute text while
detecting false recognized errors by checking spelling based on structure and semantics of words,
sentences or paragraphs of text. The discovery of errors, mistakes in recognition at this stage
significantly contributed to improving the quality of recognition.
Document layout Pre-processing Analysis of the geometric
structure
Text file Post-processing Recognize
Analysis of the logical
structure
1.2. The typical algorithms for analyzing page’s geometric structure
Over the decades of development so far, there are a lot of page analysis algorithms have
been published. Based on the order of algorithms’ execution, document layout analyzing
algorithms can be divided into three different directions of approach: top-down, bottom-up and
Hybrid methods.
1.2.1. Top-down direction of approach
Typical top-down algorithms such as XY Cut, WhiteSpace, etc. These approach algorithms
perform page analysis by dividing the document layout into horizontal or vertical directions
under spaces in the page. These spaces are usually along the boundary of the column or border of
paragraphs. The strength of these algorithms is their low computational complexity, which
results in good analysis on rectangular pages, ie, layouts where the image areas can be
surrounded by rectangle does not cross. However, they cannot process pages which are non-
rectangular image areas.
1.2.2. Bottom-up direction of approach
Typical bottom-up algorithms such as Smearing, Docstrum, Voronoi, etc. These approach
algorithms start with small areas of the image (pixels or characters) and in turn group the small
areas of the same type together to form the image area. The strength of this approach is that
algorithms can well process image pages with any structure (rectangle or non-rectangle). The
weakness of bottom-up algorithms is that memory is slow, because small areas are grouped
together based on distance parameters, which are typically estimated on the entire image page.
So these algorithms are often too sensitive to parameter values and over-segmentation of
textured image areas, especially font areas with differences in font size and style.
1.2.3. Hybrid direction of approach
From the above analysis, the advantage of the bottom-up direction of approach is the
disadvantage of the Top-down direction of approach and vice versa. Thus, in recent years there
have been many algorithms developed in the hybrid between top-down and bottom-up, one of
the typical algorithms such as RAST, Tab-Stop, PAL, etc. Algorithms developed in this direction
are often based on analytic objects such as clear space of rectangles, tab stops, etc. to infer the
structure of text columns. From there, the image areas are determined by the bottom-up method.
The results show that hybrid algorithms have overcome some of the limitations of top-down and
bottom-up algorithms, which can be implemented on any document layouts with any structure
and less restrictions on distance parameters. However, defining analytic objects is a difficult
problem for many reasons, such as having too closely spaced letters, the text area is aligned, left
and right are not aligned or the distance between connected components is too large, etc. This
has led to the fact that existing algorithms often suffer from forgotten errors or misidentification
of analytical paths leading to error analysis.
1.3. Methods and data sets that evaluate the document layout analysis algorithms
1.3.1. Measure
Evaluating analysis algorithms for document layout is always a complex issue as it
depends on data sets, ground-truths, and evaluation methods. The issue of evaluating the quality
of page analysis algorithms has received a lot of attention. In this dissertation, three measures are
used: F-Measure, PSET-Measure and PRImA-Measure for all experimental assessments. PRImA-
Measure has been successfully used at international page analysis events in 2009, 2011, 2013,
2015 and 2017.
1.3.2. Data
In this dissertation, I used three data sets of UW-III, a PRImA data set and a UNLV data set
for experimental assessment and comparison of document layout analysis algorithms. The UW-III
has 1600 images, PRImA has 305 images, and UNLV has 2000 images. These data sets have a
ground-truth at the paragraph level and text level, represented by non-intersecting polygons. The
layouts are scanned at 300 DPI resolution and have been re-adjusted the tilt. It contains a variety
of layouts on layout styles, which reflect many of the challenges of page analysis. The structure of
the layout contains a blend from simple to complex, consists of pictures with text around the
layouts, with a large change in font size. Therefore, these are very good data sets to perform
comparative analysis of page analysis algorithms.
1.4. Conclusion of chapter
This chapter presents an overview of the field of text recognition, in which page analysis is
an important step. So far the problem of page analysis is still a problem that many domestic and
foreign research interest. There are many recommended page analysis algorithms, especially at
international page analysis competitions (ICDAR). However, the algorithms still suffer from two
basic errors: over-segmentation and under-segmentation. Therefore, the dissertation will focus
on the solutions for the problem of document layout analysis.
There are three main approaches for the problem of document layout analysis: top-down,
bottom-up and hybrid. In particular, the hybrid approach has been thriving in recent times as it
overcomes the disadvantages of both top-down and bottom-up approaches. For that reason, the
dissertation will focus more on hybrid algorithms, particularly the techniques for detecting and
using analytical objects of hybrid algorithms. The next chapter of the dissertation presents a
quick layout background detection technique, this technique will be used as a module in the
algorithm proposed in Chapter 3.
CHAPTER 2. QUICK ALGORITHM TO DECTECT THE BACKGROUND OF DOCUMENT LAYOUT
This chapter presents the advantages and disadvantages of a direction of approach based
on the background of layout background in document layout analysis, WhiteSpace page analysis
algorithms, fast layout background detection algorithms, and finally experimental results.
2.1. Advantages and disadvantages of the direction of approach based on the
background of layout background in document layout analysis
On the intuitive aspect, in many cases, the background layout can be detected more easily,
and at the same time based on the layout background can easily separate the page layout into
different areas. So early on, there were a lot of page analysis algorithms based on the layout
background developed, typical example such as X-Y Cut, WhiteSpace-Analysis, WhiteSpace-Cuts,
and etc. and recently there are also many algorithms based on the layout developed, for example,
Fraunhofer (winning at IC-DAR2009), Jouve (winning at ICDAR2011), PAL (winning at
ICDAR2013), etc. The direction of approach based on layout background is not only used in page
analysis, but also widely used in the problem of table detection, table structure analysis, and
logical structure analysis.
The above examples show that the direction of approach based on layout background has
many advantages. There are many different algorithms developed for layout background
detection, such as X-Y Cuts, WhiteSpace-Analysis, WhiteSpace-Cuts (hereinafter referred to as
WhiteSpace), etc. In which, WhiteSpace is known as a well-known geometric algorithm for layout
background detection, algorithms are included in the OCROpus open code-source so it is widely
used as a basic step to develop algorithm. However, the WhiteSpace algorithm has a very limited
execution time which is quite slow, as shown in Figure 2. Thus, acceleration of the WhiteSpace
algorithm has many real meanings.
2.2. Layout background detection algorithms (WhiteSpace) for the problem of page
analysis
Figure 2. Illustration of average execution time of each algorithm.
2.2.1. Definition
The largest white space in a layout is defined as the largest rectangle located in the
envelope of the layout and does not have any characters, as shown in Figure 3.
Figure 3. Blue rectangle represents the largest white space found.
2.2.2. The algorithm for finding the largest white space
The algorithm for finding the largest white space (hereinafter referred to as
MaxWhitespace) can be applied to objects that are points or rectangles. The key idea of the
algorithm is the branch and bound method and the Quicksort algorithm. Figure. 5 a) and 4
illustrate the fake code of algorithm and the step of dividing the rectangle into sub rectangles.
In the repository of this dissertation, the input of the algorithm is a set of rectangles (the
envelope of characters), the bound rectangle (envelope of whole layout) and the quality function
(rectangle), return to area of each rectangle, see Figure 4.a). The algorithm defines a state
consisting of a rectangle r, a set of obstacles rectangles (envelope of characters) that reside in the
rectangle r and the area of the rectangle r (q = quality (r)). State statei is defined as greater than
state statej if quality (ri)> quality (rj). The queue priority is used to store the state.
Each algorithm loop will derive state = (q, r, obstacles) as the beginning of the priority
queue, which is the state in which the rectangle r has the largest area. If no rectangular obstacles
are contained in r then r is the largest rectangular white area found and the algorithm terminates.
In contrast, the algorithm will select one of the rectangle obstacles to make pivot, the best choice
is as close to the center of the rectangle as possible, see Figure 4.b). We know that the largest
white space will not contain any rectangular obstacles so it will not contain the pivot either.
Therefore, there are four possibilities which may happens for the largest white space: is the left
and the right of the pivot, see Figure 4.c), or the top and bottom of the pivot, see Figure 4.d). Next,
the algorithm will identify the rectangle obstacles intersected with each of these sub rectangles,
with four sub rectangles r0, r1, r2, r3 generated from the rectangle r, see Figure 5 and calculate the
upper bound of the largest possible white space in each newly sub created rectangle, the upper
bound mainly selected is the area of each sub rectangle. The sub rectangle along with the
obstacles in it and the upper bound corresponding to it are pushed into the priority queue and
the above steps are repeated until the state appears with a rectangular r which does not contain
any obstacles. This rectangle is the overview solution of the problem to find the largest white
space.
Figure 4: Describes the step divided layout into four sub-regions of algorithm to find the largest white space, (a)
envelope and rectangles, (b) findable pivots, (c, d) left/right and above/below sub-regions.
def find_whitespace(bound,rectangles):
queue.enqueue(quality(bound),bound,rectangles)
while not queue.is_erapty():
(q,r,obstacles) = queue.dequeue_max0
if obstacles==[]:
return r
pivot = pick(obstacles)
r0 = (pivot.xl,r.yG,r.xl,r.yl)
rl = (r.x0,r.y0,pivot.x0,r.yl)
r2 = (r.x0,pivot.yl,r.xl,r.yl)
r3 = (r.x0,r.y0,r.xl,pivot.y0)
subrectangles = [r0,rl,r2,r3]
for sub_r in subrectangles:
sub_q = quality(sub_r)
sub_obstacles =
[list of u in obstacles if not
overlapslu,sub_r)]
queue.enqueue(sub_q,sub_r,sub_obStacies}
Figure 5: Illustrates the fake code of algorithm to find the largest white space.
2.2.3. Layout background detection algorithm
To detect the layout background, algorithm is proposed as a module of the WhiteSpace
algorithm applying the MaxWhitespace algorithm to find m-Whitespace (with m - Whitespace of
about 300 is sufficient to well describe the layout background), the following background
detection algorithm is called WhiteSpaceDetection. Diagram of the algorithm is shown in Figure 5
b).
2.3. Acceleration of layout background detection algorithm
To find the white space which cover the layout background, white space detection
algorithm recursively divides the layout into sub areas so that the sub area does not contain any
characters. When each repeat algorithm will divide each sub area of the layout into four different
sub-regions, See Figure 6. This process will form a quadrilateral tree, so if the loop is large then
the number of regions that need to be considered will be very large. Therefore, the execution
time of the algorithm is very slow. Therefore, in order to accelerate the layout background
detection algorithm, it is necessary to minimize the number of subspaces which need to be
considered, by limiting the arising of unnecessary sub branch of the quadrilateral tree.
Figure 6 shows that the ZG region (the grandparents region) is divided into four sub
regions: ZPT sub-region, ZPB sub-region, ZPL left sub-region, and ZPR right sub-region. Continuing to
divide the ZPT region, the sub-region must be ZCTR in the ZPR region, so when considering the ZPR
region, also consider the ZCTR region, or the ZCTR region to be reconsidered. The example
illustrated in Figure 6 shows that the sub-region on the ZC