Tóm tắt Luận án Document geometric layout analysis based on adaptive threshold

Text recognition is a field that has been researched and applied for many years. Text recognition process is performed through the following main steps: The input image page will go through the preprocessing step, then the page analysis step, the output of the page analysis will be the input of the recognition step, and finally post-processing. The result of a recognition system depends on two main steps: page analysis and recognition. At this point, the problem of recognition on printed text has been resolved almost completely (ABBYY's FineReader 12.0 commercial product can recognize printed text in various languages, recognition software of Vietnamese words in VnDOCR 4.0 of the Hanoi Information Technology Institute can recognize with accuracy over 98%). However, in the world as well as in Vietnam, the page analysis problem remains a major challenge for researchers. Until now, page analysis is still receiving the attention of many researchers. Every two years in the world there is an international page analysis contest to promote the development of page analysis algorithms. These were the motivations for the dissertation to try researching so that they can propose effective solutions to the page analysis problem. In recent years, there are many page analysis algorithms have been developed, especially are hybrid-oriented approached development algorithms. The proposed algorithms show different strengths and weaknesses, but in general most of them still suffer from two basic errors: an error separating a correct text area into smaller that leads to mislead or miss the information of text lines or paragraph (over-segmentation), the aggregation error of text areas in text columns or paragraphs together (under-segmentation). Therefore, the objective of the dissertation is to study and develop page analysis algorithms that simultaneously reduce both types of errors: over-segmentation, under-segmentation. The issues in page analysis are very broad so the dissertation limits the scale of the study within the scope of text image pages written in Latin language which particularly is English and focuses on the analysis of the text areas. The dissertation has not proposed the problem of detecting and analyzing the structure of table spaces, detecting image areas and analyzing logical structures. With the objectives of the dissertation have achieved the following results: 1. Propose a solution that speeds up the algorithm for detecting background images. 2. Proposed adaptive parameterization method reduces the effect of size and font type on the results of page analysis. 3. Proposed a new solution for the problem of detecting and using separator objects in page analysis algorithms. 4. Proposes a new solution that separates text areas into paragraphs based on context analysis

26 trang | Chia sẻ: thientruc20 | Lượt xem: 227 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Tóm tắt Luận án Document geometric layout analysis based on adaptive threshold, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

MINISTRY OF EDUCATION AND TRAINING VIETNAM ACADEMY OF SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY .............***............. HA DAI TON DOCUMENT GEOMETRIC LAYOUT ANALYSIS BASED ON ADAPTIVE THRESHOLD Major: Mathematics for Informatics Code: 62 46 01 10 SUMMARY OF PhD THESIS IN MATHEMATICS Hanoi - 2018 The work was completed at: Graduate university of Science and Technology – Vietnam Academy of Science and Technology Supervisor: Prof. Dr Nguyen Duc Dung Review 1: ... Review 2: ... Review 3: .... The thesis will be protected on the PhD thesis defense, meeting at the Graduate university of Science and Technology – Vietnam Academy of Science and Technology on ... hour ..., date ... month ... 201 ... . The dissertation can be found at: - Library of the Graduate university of Science and Technology - National Library of Vietnam INTRODUCTION Text recognition is a field that has been researched and applied for many years. Text recognition process is performed through the following main steps: The input image page will go through the preprocessing step, then the page analysis step, the output of the page analysis will be the input of the recognition step, and finally post-processing. The result of a recognition system depends on two main steps: page analysis and recognition. At this point, the problem of recognition on printed text has been resolved almost completely (ABBYY's FineReader 12.0 commercial product can recognize printed text in various languages, recognition software of Vietnamese words in VnDOCR 4.0 of the Hanoi Information Technology Institute can recognize with accuracy over 98%). However, in the world as well as in Vietnam, the page analysis problem remains a major challenge for researchers. Until now, page analysis is still receiving the attention of many researchers. Every two years in the world there is an international page analysis contest to promote the development of page analysis algorithms. These were the motivations for the dissertation to try researching so that they can propose effective solutions to the page analysis problem. In recent years, there are many page analysis algorithms have been developed, especially are hybrid-oriented approached development algorithms. The proposed algorithms show different strengths and weaknesses, but in general most of them still suffer from two basic errors: an error separating a correct text area into smaller that leads to mislead or miss the information of text lines or paragraph (over-segmentation), the aggregation error of text areas in text columns or paragraphs together (under-segmentation). Therefore, the objective of the dissertation is to study and develop page analysis algorithms that simultaneously reduce both types of errors: over-segmentation, under-segmentation. The issues in page analysis are very broad so the dissertation limits the scale of the study within the scope of text image pages written in Latin language which particularly is English and focuses on the analysis of the text areas. The dissertation has not proposed the problem of detecting and analyzing the structure of table spaces, detecting image areas and analyzing logical structures. With the objectives of the dissertation have achieved the following results: 1. Propose a solution that speeds up the algorithm for detecting background images. 2. Proposed adaptive parameterization method reduces the effect of size and font type on the results of page analysis. 3. Proposed a new solution for the problem of detecting and using separator objects in page analysis algorithms. 4. Proposes a new solution that separates text areas into paragraphs based on context analysis. CHAPTER 1. OVERVIEW OF DOCUMENT LAYOUT ANALYSIS In this chapter, I present an overview of the text recognition system, the page analysis problem, the typical page analysis algorithms, the most basic errors of page analysis algorithms. This leads to the research objectives and results of this dissertation. 1.1. The main elements of the text recognition system Basically, a text recognition system is usually done through the basic steps described in Figure 1. Information is in the form of text such as books, newspapers, magazines, etc. after scanning process, it will show us the result in the image file. These image files will be the input of an recognition system, the output of the recognition system are text files that can be easily edited and archived, such as files of * .doc, * .docx, * .excel, * .pdf, etc. The dissertation focuses on studying the the page analysis steps, in which the focus is the analysis of the geometric structure of the layout. Figure 1: Illustration of basic processing steps of text recognition system 1.1.1. Pre-processing The task of pre-processing a layout is usually binary, defines the components of connected image, filters noise, and aligns the gradient. The output of the pre-processing step will be the input of the page analysis process. As a result, the pre-processing results will also have significant effects on the results of the page analysis. 1.1.2. Document layout analysis Document layout analysis is one of the major components of text recognition systems (OCR - System). Besides, it is also widely used in other fields of computing such as document digitization, automatic data entry, computer vision, etc. The task of page analysis includes automatically detecting image areas on a document layout (physical structure) and categorize them into different data regions such as text area, image, table, header, footer, etc. (logical structure). Page analysis results are used as an input to the recognition and automatic data entry of document imaging processing systems. 1.1.3. Recognition of optical characters This is the most important stage, this stage determines the accuracy of the recognition system. There are many different classification methods applied to word recognition systems, such as: matching method, direct approach method, grammar method, graph method, neural network, statistic method, and support vector machine. 1.1.4. Post-processing This is the final stage of the recognition process. Maybe post-processing is a step to joint the recognized characters into words, sentences, and paragraphs to reconstitute text while detecting false recognized errors by checking spelling based on structure and semantics of words, sentences or paragraphs of text. The discovery of errors, mistakes in recognition at this stage significantly contributed to improving the quality of recognition. Document layout Pre-processing Analysis of the geometric structure Text file Post-processing Recognize Analysis of the logical structure 1.2. The typical algorithms for analyzing page’s geometric structure Over the decades of development so far, there are a lot of page analysis algorithms have been published. Based on the order of algorithms’ execution, document layout analyzing algorithms can be divided into three different directions of approach: top-down, bottom-up and Hybrid methods. 1.2.1. Top-down direction of approach Typical top-down algorithms such as XY Cut, WhiteSpace, etc. These approach algorithms perform page analysis by dividing the document layout into horizontal or vertical directions under spaces in the page. These spaces are usually along the boundary of the column or border of paragraphs. The strength of these algorithms is their low computational complexity, which results in good analysis on rectangular pages, ie, layouts where the image areas can be surrounded by rectangle does not cross. However, they cannot process pages which are non- rectangular image areas. 1.2.2. Bottom-up direction of approach Typical bottom-up algorithms such as Smearing, Docstrum, Voronoi, etc. These approach algorithms start with small areas of the image (pixels or characters) and in turn group the small areas of the same type together to form the image area. The strength of this approach is that algorithms can well process image pages with any structure (rectangle or non-rectangle). The weakness of bottom-up algorithms is that memory is slow, because small areas are grouped together based on distance parameters, which are typically estimated on the entire image page. So these algorithms are often too sensitive to parameter values and over-segmentation of textured image areas, especially font areas with differences in font size and style. 1.2.3. Hybrid direction of approach From the above analysis, the advantage of the bottom-up direction of approach is the disadvantage of the Top-down direction of approach and vice versa. Thus, in recent years there have been many algorithms developed in the hybrid between top-down and bottom-up, one of the typical algorithms such as RAST, Tab-Stop, PAL, etc. Algorithms developed in this direction are often based on analytic objects such as clear space of rectangles, tab stops, etc. to infer the structure of text columns. From there, the image areas are determined by the bottom-up method. The results show that hybrid algorithms have overcome some of the limitations of top-down and bottom-up algorithms, which can be implemented on any document layouts with any structure and less restrictions on distance parameters. However, defining analytic objects is a difficult problem for many reasons, such as having too closely spaced letters, the text area is aligned, left and right are not aligned or the distance between connected components is too large, etc. This has led to the fact that existing algorithms often suffer from forgotten errors or misidentification of analytical paths leading to error analysis. 1.3. Methods and data sets that evaluate the document layout analysis algorithms 1.3.1. Measure Evaluating analysis algorithms for document layout is always a complex issue as it depends on data sets, ground-truths, and evaluation methods. The issue of evaluating the quality of page analysis algorithms has received a lot of attention. In this dissertation, three measures are used: F-Measure, PSET-Measure and PRImA-Measure for all experimental assessments. PRImA- Measure has been successfully used at international page analysis events in 2009, 2011, 2013, 2015 and 2017. 1.3.2. Data In this dissertation, I used three data sets of UW-III, a PRImA data set and a UNLV data set for experimental assessment and comparison of document layout analysis algorithms. The UW-III has 1600 images, PRImA has 305 images, and UNLV has 2000 images. These data sets have a ground-truth at the paragraph level and text level, represented by non-intersecting polygons. The layouts are scanned at 300 DPI resolution and have been re-adjusted the tilt. It contains a variety of layouts on layout styles, which reflect many of the challenges of page analysis. The structure of the layout contains a blend from simple to complex, consists of pictures with text around the layouts, with a large change in font size. Therefore, these are very good data sets to perform comparative analysis of page analysis algorithms. 1.4. Conclusion of chapter This chapter presents an overview of the field of text recognition, in which page analysis is an important step. So far the problem of page analysis is still a problem that many domestic and foreign research interest. There are many recommended page analysis algorithms, especially at international page analysis competitions (ICDAR). However, the algorithms still suffer from two basic errors: over-segmentation and under-segmentation. Therefore, the dissertation will focus on the solutions for the problem of document layout analysis. There are three main approaches for the problem of document layout analysis: top-down, bottom-up and hybrid. In particular, the hybrid approach has been thriving in recent times as it overcomes the disadvantages of both top-down and bottom-up approaches. For that reason, the dissertation will focus more on hybrid algorithms, particularly the techniques for detecting and using analytical objects of hybrid algorithms. The next chapter of the dissertation presents a quick layout background detection technique, this technique will be used as a module in the algorithm proposed in Chapter 3. CHAPTER 2. QUICK ALGORITHM TO DECTECT THE BACKGROUND OF DOCUMENT LAYOUT This chapter presents the advantages and disadvantages of a direction of approach based on the background of layout background in document layout analysis, WhiteSpace page analysis algorithms, fast layout background detection algorithms, and finally experimental results. 2.1. Advantages and disadvantages of the direction of approach based on the background of layout background in document layout analysis On the intuitive aspect, in many cases, the background layout can be detected more easily, and at the same time based on the layout background can easily separate the page layout into different areas. So early on, there were a lot of page analysis algorithms based on the layout background developed, typical example such as X-Y Cut, WhiteSpace-Analysis, WhiteSpace-Cuts, and etc. and recently there are also many algorithms based on the layout developed, for example, Fraunhofer (winning at IC-DAR2009), Jouve (winning at ICDAR2011), PAL (winning at ICDAR2013), etc. The direction of approach based on layout background is not only used in page analysis, but also widely used in the problem of table detection, table structure analysis, and logical structure analysis. The above examples show that the direction of approach based on layout background has many advantages. There are many different algorithms developed for layout background detection, such as X-Y Cuts, WhiteSpace-Analysis, WhiteSpace-Cuts (hereinafter referred to as WhiteSpace), etc. In which, WhiteSpace is known as a well-known geometric algorithm for layout background detection, algorithms are included in the OCROpus open code-source so it is widely used as a basic step to develop algorithm. However, the WhiteSpace algorithm has a very limited execution time which is quite slow, as shown in Figure 2. Thus, acceleration of the WhiteSpace algorithm has many real meanings. 2.2. Layout background detection algorithms (WhiteSpace) for the problem of page analysis Figure 2. Illustration of average execution time of each algorithm. 2.2.1. Definition The largest white space in a layout is defined as the largest rectangle located in the envelope of the layout and does not have any characters, as shown in Figure 3. Figure 3. Blue rectangle represents the largest white space found. 2.2.2. The algorithm for finding the largest white space The algorithm for finding the largest white space (hereinafter referred to as MaxWhitespace) can be applied to objects that are points or rectangles. The key idea of the algorithm is the branch and bound method and the Quicksort algorithm. Figure. 5 a) and 4 illustrate the fake code of algorithm and the step of dividing the rectangle into sub rectangles. In the repository of this dissertation, the input of the algorithm is a set of rectangles (the envelope of characters), the bound rectangle (envelope of whole layout) and the quality function (rectangle), return to area of each rectangle, see Figure 4.a). The algorithm defines a state consisting of a rectangle r, a set of obstacles rectangles (envelope of characters) that reside in the rectangle r and the area of the rectangle r (q = quality (r)). State statei is defined as greater than state statej if quality (ri)> quality (rj). The queue priority is used to store the state. Each algorithm loop will derive state = (q, r, obstacles) as the beginning of the priority queue, which is the state in which the rectangle r has the largest area. If no rectangular obstacles are contained in r then r is the largest rectangular white area found and the algorithm terminates. In contrast, the algorithm will select one of the rectangle obstacles to make pivot, the best choice is as close to the center of the rectangle as possible, see Figure 4.b). We know that the largest white space will not contain any rectangular obstacles so it will not contain the pivot either. Therefore, there are four possibilities which may happens for the largest white space: is the left and the right of the pivot, see Figure 4.c), or the top and bottom of the pivot, see Figure 4.d). Next, the algorithm will identify the rectangle obstacles intersected with each of these sub rectangles, with four sub rectangles r0, r1, r2, r3 generated from the rectangle r, see Figure 5 and calculate the upper bound of the largest possible white space in each newly sub created rectangle, the upper bound mainly selected is the area of each sub rectangle. The sub rectangle along with the obstacles in it and the upper bound corresponding to it are pushed into the priority queue and the above steps are repeated until the state appears with a rectangular r which does not contain any obstacles. This rectangle is the overview solution of the problem to find the largest white space. Figure 4: Describes the step divided layout into four sub-regions of algorithm to find the largest white space, (a) envelope and rectangles, (b) findable pivots, (c, d) left/right and above/below sub-regions. def find_whitespace(bound,rectangles): queue.enqueue(quality(bound),bound,rectangles) while not queue.is_erapty(): (q,r,obstacles) = queue.dequeue_max0 if obstacles==[]: return r pivot = pick(obstacles) r0 = (pivot.xl,r.yG,r.xl,r.yl) rl = (r.x0,r.y0,pivot.x0,r.yl) r2 = (r.x0,pivot.yl,r.xl,r.yl) r3 = (r.x0,r.y0,r.xl,pivot.y0) subrectangles = [r0,rl,r2,r3] for sub_r in subrectangles: sub_q = quality(sub_r) sub_obstacles = [list of u in obstacles if not overlapslu,sub_r)] queue.enqueue(sub_q,sub_r,sub_obStacies} Figure 5: Illustrates the fake code of algorithm to find the largest white space. 2.2.3. Layout background detection algorithm To detect the layout background, algorithm is proposed as a module of the WhiteSpace algorithm applying the MaxWhitespace algorithm to find m-Whitespace (with m - Whitespace of about 300 is sufficient to well describe the layout background), the following background detection algorithm is called WhiteSpaceDetection. Diagram of the algorithm is shown in Figure 5 b). 2.3. Acceleration of layout background detection algorithm To find the white space which cover the layout background, white space detection algorithm recursively divides the layout into sub areas so that the sub area does not contain any characters. When each repeat algorithm will divide each sub area of the layout into four different sub-regions, See Figure 6. This process will form a quadrilateral tree, so if the loop is large then the number of regions that need to be considered will be very large. Therefore, the execution time of the algorithm is very slow. Therefore, in order to accelerate the layout background detection algorithm, it is necessary to minimize the number of subspaces which need to be considered, by limiting the arising of unnecessary sub branch of the quadrilateral tree. Figure 6 shows that the ZG region (the grandparents region) is divided into four sub regions: ZPT sub-region, ZPB sub-region, ZPL left sub-region, and ZPR right sub-region. Continuing to divide the ZPT region, the sub-region must be ZCTR in the ZPR region, so when considering the ZPR region, also consider the ZCTR region, or the ZCTR region to be reconsidered. The example illustrated in Figure 6 shows that the sub-region on the ZC