KAUNAS UNIVERSITY OF
SIGNALS THEORY PAPER WORK
WORD RECOGNITION AND ITS METHODS
Amir Salha Lecturer: Marius Gudauskis
Table of Contents
I. Abstract: 3
II. Introduction to word recognition: 3
III. Visual word recognition: 3
i. Theoretical Approach: 3
Bayes theorem: 3
Interactive activation (IA) model: 4
Lexical competition: 4
Lexical decision: 4
Masked priming: 4
Neighbourhood density: 5
Open bigrams: 5
Reaction time (RT) distribution: 5
Word-frequency effect: 5
ii. Practical Approach: 6
Step 1: Detect Candidate Text Regions Using MSER; 6
Step 2: Remove Non-Text Regions Based on Basic Geometric
Step 3: Remove Non-Text Regions Based on Stroke Width
Step 4: Merge Text Regions for Final Detection Result; 7
Step 5: Recognize Detected Text Using OCR; 8
Text Recognition Using the OCR Function: 8
Challenges Obtaining Accurate Results: 9
Image Pre-processing Techniques to Improve Results: 10
ROI-based Processing to Improve Results: 10
IV. Speech-recognition systems
and its Example: 11
The Development Workflow: 12
Acquiring Speech: 12
Analyzing the Acquired Speech: 12
Developing a Speech-Detection Algorithm: 13
Developing the Acoustic Model: 13
Selecting a Classification Algorithm: 15
Building the User Interface: 15
V. Conclusion: 16
VI. References: 16
The abstract of this paper work is to recognize
what is word recognition and the approach of methods used. Initializing the
visual and speech word recognition and its different approaches by providing
examples and real-life computation.
to word recognition:
Word recognition is a computational model that converts
or exchange visual (picture, video) content or speech (sound, genuine voice)
into a real content document. This calculation model could be used in various
strategies and methods, either by checking the content like in OCR instruments
or by taking live pictures. And in addition, it could be a voice or discourse
acknowledgment of words like computational linguistics that creates strategies
and innovations that empowers the recognition and interpretation of talked
dialect into content by PCs.
Is a computational recognition
referring to the branch of software engineering that includes perusing content
from various plan of action and translating the pictures, recordings or live
layouts into a shape that the PC can control (for instance, into ASCII codes).
a numerical technique for
updating probabilities or convictions obtaining new evidence and confirmation.
In the case of word recogniton, the probability of a word given the
information, or confirmation, is as follows:
Figure 1:Mathematical equation used
models expressed as artificial neural networks. These models are
proposed to catch general properties of neurons, or neuronal populaces.
activation (IA) model:
Words are represented as nodes or hubs in a system that are associated
by inhibitory connections.
Figure 2: The
top panel illustrates a simplified interactive activation model.
Neighbouring words contend with each other for the sake of recognition,
this is because of the inhibitory connections between word hubs.
Members are required to choose whether a series of letters is a word or
not (a nonword).
a variation on the
lexical decisions tasks in which the objective is gone before by a quick
displayed prime, which can be a word or a nonword. The prime is generally
displayed in lower case and the objective in capitalized to limit physical overlap.
Masked priming is most
regularly used to address questions concerning the representation of
A measure of how similar a word is to different words. A typical measure
is what number of different words can be shaped by changing a single letter in
a word? This means that just expressions of a similar length can be neighbours.
A more flexible measures similarity as far as the quantity of ‘edits’ –
additions, deletion, and substitutions – so WORD and WORDS will now consider to
A recommendation that the order of letters in a word is coded regarding
an arrangement of requested letter sets, which might be non-bordering. WORD may
be coded as WO, WR WD, OR, OD, or RD.
Figure 3: Three
different representations of letter order.
time (RT) distribution:
Factors like word frequency
shifts the mean of distribution, yet more often than not the type of the distribution,
The most significant effect on how a word can be distinguished is its frequency
of occurance in the language.
Words that appear all the time in the dialect are perceived
distinguished more rapidly than low-frequency words.
The speed and straightforwardness with which words can be recognised is
an aproximate logarithmic function of word frequency.
Automatically detect and recognize text in natural images:
This technique used here is to demonstrate and distinguish distcrete text
in an image that contain content. This method contains known situations where
the position of content/text is known previously.
The automated text detection calculation distinguishes a large region of
content/text elements and logically takes out the unidentified elements containing
1: Detect Candidate Text Regions Using MSER;
Figure 4:MSER regions
2: Remove Non-Text Regions Based on Basic Geometric Properties;
Geometric properties that detects text and
Figure 5: after removing non-text regions
based on basic geometric properties
3: Remove Non-Text Regions Based on Stroke Width Variation;
Figure 6:after removing non-text regions based
on stroke width variation
4: Merge Text Regions for Final Detection Result;
5: Recognize Detected Text Using OCR;
Recognition of text using optical character recognition (OCR):
Recognition refers to the branch of software/computer engineering. Is a
strategy that includes perusing content from different clear text plates and making
an interpretation of the pictures into a frame that the PC can manipulate.
Recognition Using the OCR Function:
Using the OCR is of a various application such as image search,
document analysis, and robot navigation.
The OCR functions as following:
returns the recognized text, the recognition confidence, and the location of
the text in the original image. You can use this information to identify the
location of misclassified text within the image.
The OCR works as following:
restores the identified text,
the recognized confidence,
the area of the content in the original image.
Note: This data
recognizes the location of misclassified message inside the image, using the confidence
values where the error can be identified
before any further processing error takes place.
Obtaining Accurate Results:
accurate results in OCR performance is dependent on accuracy, stability and
uniformity of text, by other means if the text used was static and stable
having a word format the performance will be way accurate and better than
having a non-uniform, or unclear text where additional initial processing steps
should be taken into account.
These images show how OCR initial processing steps changed the image to a more
steady, uniform image allowing the Character to have a clear recognition of
‘TextLayout’ parameters help improve the results, if the text indicates that no text is
recognized in the image, due to irregularities in the background.
Causing the “OCR” a failure
in find text margins and elements in the text, which leads to recognition failure.
Note: If the OCR keeps on failing after laying
out the text, checking the initial pre-processing steps is required to detect
the cause of the failure and this could be done using initial binarization
steps that improves text segmentation.
Pre-processing Techniques to Improve Results:
Step 1: pre-processing using morphological reconstruction;
works by cleaning the image by taking out all the unclarity already found in
the images or residues.
artifacts and producing a cleaner image for OCR.
Step 2: “Locate Text” method;
By locating the
text in the original image helps recognizes the characters needed especially if
they are of the same parameters, and ignoring all unnecessary text. This method
is used when there is still initial unclarity defined by noise in the image.
ignoring irrelevant text using local text method
Processing to Improve Results:
of specific regions in the image that OCR should process by selecting the regions
needed manually, or by automating the process.
detection automatically detect and recognizes text in natural images by using vision
defined by “BlobAnalysis”.
connected regions within the keypad image
are not likely to contain any text can be removed using vision “Blob Analysis”, where regions
having an area smaller than the assumed area is removed.
Speech-recognition systems and
Speech-recognition is a
recognition method in detecting words from speakers speech with high accuracy
and precision using filtering systems and adaptation methods to remove all
types of noises in order to obtain a clear voice leading to better recognition
of Speech Recognition Systems:
Isolated, requires a brief pause between spoken
Continuous, pauses are not necessary.
speech-recognition algorithm is a complex task requiring detailed knowledge of
signal processing and statistical modelling.
speech recognition system is of two levels:
The first step is the teaching level known as
training mode where the system should be taught words to have a reference data
in detecting words said in the speech.
The second step is the testing stage: after the
system is of a sufficient data of words and a trusted reference dictionary the
system shall be tested in order to check how would the system would react to
the real life interpretation and to check what problems will appear to be
The development workflow consists of three
User interface development
Training stage: During the
training stage we use the microphone to input the spoken words should that
should be in a repetition of each digit in the dictionary to restore it in the
data base and the system is tested in an offline analysis.
Testing stage: speech is continuously
streamed into the environment for online processing, where continuous buffer
speech samples are acquired, plus processing the incoming speech frame by frame.
We use Data Acquisition Toolbox™
to set up continuous acquisition of the speech signal and simultaneously
extract frames of data for processing.
Analysing the Acquired Speech:
Developing a word-detection algorithm that
separates each word from noise.
Derive a model that provides a representation
of each word at the training stage.
Select an appropriate classification algorithm
for the testing stage.
Developing a Speech-Detection Algorithm:
Ø Algorithm is developed by using
the initial recorded speech frame using a loop system, this algorithm detects
isolated digits on a specific period of time dependent on the frame by using zero-crossing
counts and signal energy for different speech frames.
Signal energy works well for detecting voiced
zero-crossing counts work well for detecting
the Acoustic Model:
Is dependent on speech
characteristics causing the system to obtain different words form the data base
built known as the dictionary.
the frequency characteristics of the human vocal tract by examining the power
spectral density (PSD) estimates of various spoken digits.
Figure 1b. Yule Walker PSD
estimate of three different utterances of the word “TWO.” Click on image to
see enlarged view.
Figure 14a. Yule Walker PSD
estimate of three different utterances of the word “ONE.” Click on image to
see enlarged view.
Ø Measuring the
energy of overlapping frequency bins of a spectrum within a frequency scale
using Mel Frequency Cepstral Coefficients. By combining all the feature
vectors, and the estimation of multidimensional probability density function
(PDF) of the vectors for a specific digit. Repeating this process for each
digit, we obtain the acoustic model for each digit.
Ø (PDF) of the vectors for a specific digit.
Repeating this process for each digit, we obtain the acoustic model for each
Ø In the testing
stage, we extract the MFCC vectors from the test speech and use a probabilistic
measure to determine the source digit with maximum likelihood.
Figure 15: Distribution of the first dimension of
MFCC feature vectors for the digit one.
Ø Providing a good
fit of standard distributions so it won’t look arbitrary.
Figure 16: Overlay of estimated
Gaussian components (red) and overall Gaussian mixture model (green) to the
Definition: GMM Gaussian mixture
density is parameterized by the mixture weights, mean vectors, and covariance
matrices from all component densities.
Ø using an iterative expectation-maximization algorithm to obtain
a maximum likelihood estimate to estimate the parameters of a GMM for a set of
MFCC feature vectors.
Ø Use of the Statistics and Machine Learning Toolbox distribution function
to estimate the GMM parameters.
a Classification Algorithm:
Use of dictionary during testing stage to
estimate GMM for each digit.
Test speech to extract again the MFCC feature
vectors from each frame of the detected word.
Find the digit model with the maximum a
posteriori probability for the set of test feature vectors.
The log-likelihood value is computed using
the posterior function in
Statistics and Machine Learning Toolbox after knowing the digit model and
some test feature vectors.
the User Interface:
interface that displays the time domain plot of each detected word as well as
the classified digit, after developing the digit recognition system in an
offline environment with pre-recorded speech.
Figure 17: Interface to final application. Click on image to see enlarged vie
ü Visual word recognition is a technology
that enables you to convert different types of documents, such as scanned paper
documents, PDF files, images captured by a digital camera into editable and
searchable data or real live images using different methods and software’s
(e.g. OCR, MICR and others…) were each program as a different computational
format capable of diagnosing and turning scanned words into a real documented
of Visual word text process acquires a lot of stages especially if the
background of the text’s template is non-uniform, static, or clear which will
lead for having pre-processing steps taken to remove all the unclarity and
specifying the exact margins and regions of the text.
ü Speech-recognition software
programmes work by analysing voice speech and converting them to words.
building a speech-recognition system goes into two stages training the system
by providing the database needed (dictionary) in order to recognize the frame
and digit or words and after that testing the system in live processing time to
check whether the system is functioning appropriately and finally building and
interface that functions according to time with respect to the frame of words
ü Major challenges with Speech Recognition
technology faces an effective voice user interface requires a strong error resistant
and the capacity to effectively exhibit the abilities of the design needed.
1 Matlab support main website www.mathworks.com/company/newsletters/articles/
2 Chen, Huizhong, et al. “Robust Text Detection in
Natural Images with Edge-Enhanced Maximally Stable Extremal Regions.”
Image Processing (ICIP), 2011 18th IEEE International Conference on. IEEE,
3 Gonzalez, Alvaro, et al. “Text location in complex
images.” Pattern Recognition (ICPR), 2012 21st International Conference
on. IEEE, 2012.
4 Li, Yao, and Huchuan Lu. “Scene text detection via
stroke width.” Pattern Recognition (ICPR), 2012 21st International
Conference on. IEEE, 2012.
5 Neumann, Lukas, and Jiri Matas. “Real-time scene text
localization and recognition.” Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on. IEEE, 2012.
6 Ray Smith. Hybrid Page Layout Analysis via
Tab-Stop Detection. Proceedings of the 10th international conference on
document analysis and recognition. 2009.
7 Morton J. The interaction of information in word
recognition. Psychol. Rev. 1969;76:165–178.
8 Davis C.J. The spatial coding model of visual word
identification. Psychol. Rev. 2010;117:713–758.PubMed