IMAGE TEXT TO SPEECH CONVERSION USING OPTICAL CHARACTER RECOGNITION TECHNIQUE IN RASPBERRY PI

INTRODUCTION

In our planet of 7.4 billion humans, 285 million are visually impaired out of whom 39 million people are completely blind, i.e. have no vision at all, and 246 million have mild or severe visual impairment (WHO, 2011). It has been predicted that by the year 2020, these numbers will rise to 75 million blind and 200 million people with visual impairment [5]. As reading is of prime importance in the daily routine (text being present everywhere from newspapers, commercial products, sign-boards, digital screens etc.) of mankind, visually impaired people face a lot of difficulties. Our device assists the visually impaired by reading out the text to them. There have been numerous advances in this area to help visually impaired to read without much difficulties. The existing technologies use a similar approach as mentioned in this paper, but they have certain drawbacks. Firstly, the input images taken in previous works have no complex background, i.e. the test inputs are printed on a plain white sheet. It is easy to convert such images to text without pre-processing, but such an approach will not be useful in a real-time system [1][2][3]. Also, in methods that use segmentation of characters for recognition, the characters will be read out as individual letter and not a complete word. This gives an undesirable audio output to the user. For our project, we wanted the device to be able to detect the text from any complex background and read it efficiently. Inspired by the methodology used by Apps such as “CamScanner”, we assumed that in any complex background, the text will most likely be enclosed in a box eg billboards, screens etc. By being able to detect a region enclosing four points, we assume that this is the required region containing the text. This is done using warping and cropping. The new image obtained then undergoes edge detection and a boundary is then drawn over the letters. This gives it more definition. The image is then processed by the OCR and TTS to give audio ouput.

2. MOTIVATION

Our device is designed for people with mild or moderate visual impairment by providing the capability to listen to the text. It can also act as a learning aid for people suffering from dyslexia or other learning disabilities that involve difficulty in reading or interpreting words and letters. We wish to enable these people to be independent and self-reliant as they will no longer need assistance to understand printed text. Such people will always have access to information hence they will never feel at a disadvantage. The impact of the development and introduction of our system into the technological world will be a revolutionary boon to modern civilization.

3. PROBLEM STATEMENT

Visual impairment people uses braille system for witting and reading purposes. The visually impaired person feels the arrangement of the raised dots which conveys the information and are very difficult so keeping these things in mind we have designed our system in such a way that reading any book for blind people become easier. The braille system is very difficult and time consuming so if we can convert a text to audio then it would be very faster and easier.

4. OBJECTIVES

 The objectives of our project are:

  1. To extract information (text) and convert them into digital form and then recite it accordingly.
  2. To be as  effective medium for communication.

5.  LITERATURE REVIEW

Visual impairment or vision loss is defined as the decreased ability to see clearly and cannot be fixed using glasses. Blindness is the term used for complete vision loss. The common causes of vision loss are uncorrected refractive errors, cataracts and glaucoma. People with visual impairment face a number of difficulties in normal daily activities like walking, driving and reading.[6] 

 Braille

Braille  is  writing  and  reading  system  used  people  who  have  visual  impairment.  Braille language is written  on  embossed  paper.  The  braille  characters  are  small  rectangular  blocks  called  cells  that  contain  bumps called  raised  dots.  The  visually  impaired  person  feels  the  arrangement  of  the  raised  dots  which  conveys  the information. [7]

 Although  braille  readers,  keyboards  and  monitors  exist,  they  are  not  accessible  to  the  rural communities and braille material is not easily and abundantly available. [8]

 Raspberry pi

The  raspberry  Pi  is  a  small,  low  cost  CPU  which  can  be  used  with  a  monitor,  keyboard  and  mouse  to become  an  efficient,  full-fledged  computer  [9].  The reason  we  chose  Raspberry  Pi  micro-computer  for  our project  is  that,  firstly,  it  is  an  easily  available,  low-cost  device.  RPi  uses  software  which  are  either  free  or  open source,  which  also  makes  it  cost-effective.  The  Raspberry  Pi  uses  an  SD  card  for  storage  and  its  small  size  also gives us the advantages of portability.[10]

As  a  part  of  the  software  development,  the  Open  CV  (Open  source  Computer  Vision)  libraries  are  utilized  for image processing. Each function and data structure was designed with the Image Processing coder in mind. [11]  Existing systems & their limitations

  One of the  biggest  advantages  of  barcode  readers  is  portability.  Hence,  they  can  be  used  by the  visually impaired  in  identifying  different  products.  An extensive database is created  which  contains  all  the information about the product. The user simply scans the bar code and the product details are listed through e-braille readers. The disadvantage  with  this  product  is  that  the  user  might  not  be  able  to point  the  bar  code reader in the correct direction. [2]

 Another  approach  is  optical  enhancement  solutions  such  as  an  optical  zooming  device  that  expands  the braille character. However, not all visually impaired people need to know braille language. [4]. Some methods  aim  at  converting  text  to  speech.  This is accomplished  using  a  scanner,  speakers and  a computer.  This method is efficient  only  with  simple  scanned  documents.  It cannot extract text from  an image with a complex background. [4]

fig 1:- block diagram of proposed system

6.2.  HARDWARE SPECIFICATIONS

Raspberry  pi  

Raspberry  pi  is  a  device  that  contains  several  important  functions  on  a  single  chip.  It  is  a  system  on  a chip(SoC).  The Raspberry Pi 3 uses  Broadcom  BCM2837 SoC  Multimedia  processor.  The Raspberry Pi’s CPU is  the  4x  ARM  Cortex-A53,  1.2GHz  processor.  It  has  internal  memory  1GB  LPDDR  RAM  (900Mhz)  and external  memory  can  be  extended  to  64  GB.  In  Raspberry  Pi  3,  the  two  main  new  features  are  wireless  internet connection 

802.11n  and  Bluetooth  4.1  classic.  It  has  40  GPIO  pins.  The  Raspberry  pi  camera  is  5MP  and has  a  resolution  of  2592×1944. The  Raspberry  Pi  has  a  3.5mm  audio  port  so  earphones  or speaker  can  easily  be connected to it to hear audio.

Amazon.com: Raspberry Pi 3 Model B Board: Computers & Accessories

Fig.2 schematic diagram of Raspberry pi

Camera Module 

The Raspberry Pi camera module size is 25mm square, 5MP sensor much smaller than the Raspberry Pi computer, to which it connects by a flat flex cable (FFC, 1mm pitch, 15 conductor, and type B )

RASPBERRY Pi Camera Module V2 - Module | Alzashop.com

Fig.3 Raspberry Pi Camera Module 

The Raspberry Pi camera module offers a unique new capability for optical instrumentation with critical capabilities as follows: 

1080p video recording to SD flash memory cards. Simultaneous output of 1080p live video via  HDMI, while recording. Sensor type: Omni Vision  OV5647  Colour CMOS QSXGA (5megapixel) , Sensor size: 3.67  x 2.74 mm , Pixel Count: 2592 x 1944 ,Pixel Size: 1.4 x 1.4 um, Lens: f=3.6 mm, f/2.9 ,Angle of View: 54 x 41 degrees, Field of View: 2.0 x 1.33 m at 2 m , Full-frame SLR lens, equivalent: 35 mm ,Fixed Focus: 1 m to infinity, Removable lens.  Adapters for M12, C-mount, Canon EF, and Nikon F Mount lens interchange. In-camera image mirroring. 

  6.3. SOFTWARE SPECIFICATIONS

Raspbian  is  a  free  operating  system,  based  on  Debian,  optimized  for  the  Raspberry  Pi  hardware.Raspbian  Jessie  is  used  as  the  version  is  RPi’s  main  operating  system  in  our  project.  Our  code  is  written  in Python  language  (version  2.7.13)  and  the  functions  are  called  from  OpenCV.  OpenCV,  which stands  for  Open Source  Computer  Vision,  is  a library  of  functions  that  are  used  for real-time  applications  like  image  processing, and many others [14]. Currently, OpenCV supports a wide variety of programming languages like C++, Python, Java etc.  and  is  available  on  different  platforms  including  Windows,  Linux,  OS  X,  Android,  iOS  etc. The version used for our project is opencv-3.0.0.  OpenCV’s application  areas  include  Facial  recognition  system, Gesture  recognition,  Human–computer  interaction  (HCI),  Mobile  robotics,  Motion  understanding,  Object identification, 

Segmentation  and  recognition,  Motion  tracking,  Augmented  reality  and  many  more.  For performing  OCR  and  TTS  operations  we  install  Tesseract  OCR  and  Festival  software. Tesseract is  an  opensource  Optical  Character  Recognition  (OCR)  Engine,  available  under  the  Apache  2.0  license.  It  can  be  used directly,  or  (for  programmers)  using  an  API  to  extract  typed,  handwritten  or  printed  text  from  images.  It supports a wide variety of languages.  The package is generally called ‘tesseract’ or ‘tesseract-ocr’. Festival  TTS  was  developed  by  the  “The  Centre  for  Speech  Technology  Research”,UK.  It  is  an  open  source software  that  has  a  framework  for  building  efficient  speech  synthesis  systems.  It is  multi-lingual  (supports British  English,  American  English  and  Spanish).  As  Festival  is  a  part  of  the  package  manager  for  Raspberry  Pi, it is easy to install.

Image Processing 

Books and papers have letters. Our aim is to extract these letters and convert them into digital form and then recite it accordingly. Image processing is used to obtain the letters. Image processing is basically a set of functions that is used upon an image format to deduce some information from it. The input is an image while the output can be an image or set of parameters obtained from the image. Once the image is being loaded, we can convert it into gray scale image. The image which we get is now in the form of pixels within a specific range. This range is used to determine the letters. In gray scale, the image has either white or black content; the white will mostly be the spacing between words or blank space.

 Feature Extraction 

In this stage we gather the essential features of the image called feature maps. One such method is to detect the edges in the image, as they will contain the required text.  For this we can use various axes detecting techniques like: Sobel, Kirsch, Canny, Prewitt etc. The most accurate in finding the four directional axes: horizontal, vertical, right diagonal and left diagonal is the Kirsch detector. This technique uses the eight point neighborhood of each pixel. 

Optical Character Recognition 

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine encoded text. It is widely used as a form of data entry from some sort of original paper data source, whether documents, sales receipts, mail, or any number of printed records. It is crucial to the computerization of  printed texts so that they can be electronically searched, stored more compactly, displayed on-line and used in machine processes such as machine translation, textto- speech and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. 

Tesseract 

Tesseract is a free software optical character recognition engine for various operating systems. Tesseract is considered as one of the most accurate free software OCR engines currently available. It is available for Linux, Windows and Mac OS. 

An image with the text is given as input to the Tesseract engine that is command based tool. Then it is processed by Tesseract command. Tesseract command takes two arguments: First argument is image file name that contains text and second argument is output text file in which, extracted text is stored. The output file extension is given as .txt by Tesseract, so no need to specify the file extension while specifying the output file name as a second argument in Tesseract command. After processing is completed, the content of the output is present in .txt file. In simple images with or without color (gray scale), Tesseract provides results with 100% accuracy. But in the case of some complex images Tesseract provides better accuracy results if the images are in the gray scale mode as compared to color images. Although Tesseract is command-based tool but as it is open source and it is available in the form of Dynamic Link Library, it can be easily made available in graphics mode.

7. METHODOLOGY

fig4:-basic methodology of project

Image  acquisition:  In  this  step,  the  inbuilt  camera  captures  the  images  of  the  text.  The  quality  of  the  image captured depends  on  the  camera  used.  We  are  using  the  Raspberry  Pi’s  camera  which  5MP  camera  with  a resolution of 2592×1944.

Image  pre-processing:  This  step  consists  of  color  to  gray  scale  conversion,  edge  detection,  noise  removal, warping  and  cropping  and  thresholding.  The  image  is  converted  to  gray  scale  as  many  OpenCV  functions require  the  input  parameter  as  a  gray  scale  image.  Noise removal is done using  bilateral  filter.  Canny edge detection is performed on the gray scale image for better detection of the contours. The warping and cropping of the image are performed  according  to  the  contours.  This  enables  us  to  detect  and  extract  only  that  region  which contains  text  and  removes  the  unwanted  background.  In  the  end,  Thresholding  is  done  so  that  the  image  looks like a scanned document. This is done to allow the OCR to efficiently convert the image to text.

Fig 5. image preprocessing

Image  to  text  conversion:  The  above  diagram(fig.5)  shows  the  flow  of  Text-To-Speech.  The first block is the image  pre-processing  modules  and  the  OCR.  It  converts  the  preprocessed  image,  which  is  in  .png  form,  to  a .txt file. We are using the Tesseract OCR.

Text  to  speech  conversion:  The  second  block  is  the  voice  processing  module.  It converts the .txt file  to  an audio  output.  Here,  the  text  is  converted  to  speech  using  a  speech  synthesizer  called  Festival  TTS.  The Raspberry Pi has an on-board audio jack, the on-board audio is generated by a PWM output.

7.1.   Algorithm

step1:- Start with initial values

step2:-Import sub-process with initialization of GPIO pins

step3:-If button pressed,

i. capture  image with webcam(camera)

ii. perform Tesseract OCR iii. Thresholding save into text file iv. Festival software operation for text to speech

step4:-Repeat step3

 7.2.  Flowchart

CODE

YOUTUBE LINK

10.       CONCLUSION

By implemented this system visually impaired can easily listen whatever they want to listen.  And with   the help of the translation tools he can convert the text to the desired language and then again by using the Google speech recognition tool he can convert that changed text into voice. By that they can be independent. And it is less cost compared to other implementations. Text-to-Speech device can change the text image input into sound with a performance that is high enough and a readability tolerance of less than 2%, with the average time processing less than three minutes for A4 paper size. This portable device, does not require internet connection, and can be used independently by people. Through this method, we can make editing process of books or web pages easier

Recommended For You

About the Author: admin

2 Comments

Leave a Reply

Your email address will not be published.