FACE AND VOICE RECOGNIZING INTELLECTUAL COMPANION

  1. INTRODUCTION
    1. Overview

The human ability to interact with other people is based on their ability of recognition. This innate ability to effortlessly identify and recognize objects, even if distorted or modified, has induced to research on how the human brain processes these images. This skill is quite reliable, despite changes due to viewing conditions, emotional expressions, ageing, added artefacts, of even circumstances that permit seeing only a fraction of the face. Furthermore, humans are able to recognize thousands of individuals during their lifetime. Understanding the human mechanism, in addition to cognitive aspects, would help to build a system for the automatic identification of faces by a machine. However, face recognition is still an area of active research since a completely successful approach or a model has not yet been proposed to solve the face recognition problem. Automated face recognition is a very popular field nowadays. Face recognition can be used in a multitude of commercial and law enforcement applications. For example, a security system could grab an image of a person and the identity of the individual by matching the image with the one stored on the system database.

Similarly, voice recognition is the process of understanding the words that are spoken by humans. It is important to understand the nature of Voice signals before we analyze them. There are many different aspects of Voice that contribute to its complexity. These things include emotion, accent, language, noise, and so on.

Hence it becomes difficult to robustly define a set of rules to analyze Voice signals. But humans are really good at understanding Voice even though it has so many variations. We seem to do it with relative ease. If we want our machines to do the same, we need to help them understand Voice the same way we do.

Researchers work on various aspects and applications of Voice, such as understanding spoken words, identifying who the speaker is, recognizing emotions, identifying accents, and so on.

  1.  Image Processing

Some of the many algorithms used in image processing include convolution (on which many others are based), FFT, DCT, thinning (or skeletonisation), edge detection and contrast enhancement. These are usually implemented in software but may also use special purpose hardware for speed. Image processing contrasts with computer graphics, which is usually more concerned with the generation of artificial images and visualization, which attempts to understand (real world) data by displaying it as an artificial image (e.g. a graph). Image processing is used in image recognition and computer vision. Silicon Graphics manufacture workstations which are often used for image processing. There are a few programming languages designed for image processing, e.g. CELIP, VPL, C++.

1.3 Face Tracking

Face detection and tracking are important in video content analysis since the most important objects in most videos are human beings. Research on face tracking and animation techniques has been improved due to its wide range of application in security, entertainment, industry, gaming, psychological facial expression analysis and human computer interaction. Recent advances in video processing and compression have made face-to-face communication be practical in real world application. However, higher bandwidth is still highly demanded due to the increasing intensive communication. Model based low bit rate transmission with high quality video offers a great potential to mitigate the problem raised by limited communication resources. However, after a decade’s effort, robust and realistic real time face tracking and generation still pose a big challenge. The difficulty lies in a number of issues including the real time face feature tracking under a variety of imaging conditions such as lighting variation, pose change, self-occlusion and multiple non-rigid features deformation and the real time realistic face modeling using a very limited number of feature parameters. Traditionally, the head motion is modeled as a 3D rigid motion with the local skin deformation, the linear motion tracking method cannot represent the rapid head motion and dramatic expression change accurately. Coney the appearance-driven approach requires a significant number of training data to enumerate all of the possible features of appearance. The model based approach assumes the knowledge of a specific object is available, meanwhile the requirement of frontal facial views and constant illumination limited its application. All above tracking methods have shown certain limitations for accurate face feature tracking under complex imaging conditions.

  1.  Speech Processing

Speech recognition systems take audio signals as input and recognize the words being spoken. The speech signals are captured using a microphone and the system tries to understand the words that are being captured. It is important to understand the nature of speech signals before we analyze them. These signals happen to be complex mixtures of various signals. There are many different aspects of speech that contribute to its complexity. These things include emotion, accent, language, noise, and so on. The Google’s voice recognition API takes voice data as input and apply mathematical operations like sampling and fast Fourier transform to get the corresponding text. Also for text to voice conversion it has provided a library called GTTS. The conversion from text to voice quite simpler than voice to text.

1.5 Problem Definition

Nowadays, automatic personal identification in access control has become popular by using biometrics data instead of using cards, passwords or pattern. Most of the biometrics data have to be collected by using special hardware such as fingerprint scanner, palm print scanner, DNA analyzer. And, the target objects have to touch with the required hardware in the stage of data collection. The advantage of this system is that face recognition does not require to be touched with any hardware. Face is detected automatically by using face detection technique and the entire face recognition is completed without touching with any hardware. It requires only to select the person in the frame to recognize. Face detection is the first step of the face recognition system. The performance of the entire face recognition system is influenced by the reliability of the face detection. By using face detection, it can identify only the facial part of an image regardless of the background of this image. In this system, Viola-Jones face detection method is used. Viola-Jones rescale the detector instead of the input image and run the detector many times through the image – each time with a different size. Viola-Jones have devised a scale invariant detector that requires the same number of calculations whatever the size. This detector is constructed using a so-called integral image and some simple rectangular features reminiscent of Haar wavelets. Face recognition commonly includes feature extraction, feature reduction and recognition or classification.

The other feature of this system is voice recognition. Voice recognition is the process of understanding the words that are spoken by humans. The Voice signals are captured using a microphone and the system tries to understand the words that are being captured. Voice recognition is used extensively in human computer interaction, smartphones, Voice transcription, biometric systems, and security and so on. Recognition is used extensively in human computer interaction, smartphones, Voice transcription, biometric systems, and security and so on.

It is important to understand the nature of Voice signals before we analyze them. There are many different aspects of Voice that contribute to its complexity. These things include emotion, accent, language and noise.

Hence it becomes difficult to robustly define a set of rules to analyze Voice signals. But humans are really good at understanding Voice even though it has so many variations. We seem to do it with relative ease. If we want our machines to do the same, we need to help them understand Voice the same way we do.

Researchers work on various aspects and applications of Voice, such as understanding spoken words, identifying who the speaker is, recognizing emotions and identifying accents.

1.6 Objectives

     a) To detect human face in the camera frame

b) To recognize if the person is in the system database

c) To edit the information of the recognized person

d) To control home appliances like fan, lights using voice command

e) To build a prototype of low cost face and voice recognizing system

f) To know about image and voice features extraction

g) To build a system as a first step towards AI

2.  LITERATURE REVIEW

2.1 Face detection and recognition

This project is based on Python programming language and makes use of the Open CV library, developed by Intel and now maintained by Willow Garage and Itseez. Open CV (Open Source Computer Vision) is a library of programming functions for real time computer vision. Open CV provides pre-defined functions for face detection.

Face detection is the process of detecting a face in a frame whereas face recognition is the process of giving name to that face i.e. identifying the person. Haar cascade is used to build an overall classifier that performs with high accuracy. The reason this is relevant is because it helps us circumvent the process of building a single-step classifier that performs with high accuracy [1].

Building one such robust single-step classifier is a computationally intensive process. Consider an example where we have to detect an object like, say, a tennis ball. In order to build a detector, we need a system that can learn what a tennis ball looks like. It should be able to infer whether or not a given image contains a tennis ball. We need to train this system using a lot of images of tennis balls. We also need a lot of images that don’t contain tennis balls as well. This helps the system learn how to differentiate between objects [2].

If we build an accurate model, it will be complex. Hence we won’t be able to run it in real time. If it’s too simple, it might not be accurate. This tradeoff between speed and accuracy is frequently encountered in the world of machine learning. The Viola-Jones method overcomes this problem by building a set of simple classifiers. These classifiers are then cascaded to form a unified classifier that’s robust and accurate. When the face is detected using cascade the features are extracted and system is trained. After that to recognize a face same process is repeated but instead of saving the detected image it is compared it with saved data. When it hits face is recognized otherwise it would return the face with similar features [3].

In order to compute Haar features, we have to compute the summations and differences of many sub-regions in the image. We need to compute these summations and differences at multiple scales, which makes it a very computationally intensive process. In order to build a real time system, we use integral images. Consider the following figure:

If we want to compute the sum of the rectangle ABCD in this image, we don’t need to go through each pixel in that rectangular area. Let’s say OP indicates the area of the rectangle formed by the top left corner O and the point P on the diagonally opposite corners of the rectangle. To calculate the area of the rectangle ABCD, we can use the following formula:

Area of the rectangle ABCD = OC – (OB + OD – OA)

Figure 1. Pixel representation [3]

What’s so special about this formula? If you notice, we didn’t have to iterate through anything or recalculate any rectangle areas. All the values on the right hand side of the equation are already available because they were computed during earlier cycles. We directly used them to compute the area of this rectangle.

2.2 Voice Recognition

When audio files are recorded using a microphone, they are sampling the actual audio signals and storing the digitized versions. The real audio signals are continuous valued waves, which means we cannot store them as they are. We need to sample the signal at a certain frequency and convert it into discrete numerical form. Most commonly, speech signals are sampled at 44,100 Hz. This means that each second of the speech signal is broken down into 44,100 parts and the values at each of these timestamps is stored in an output file. We save the value of the audio signal every 1/44,100 seconds. In this case, we say that the sampling frequency of the audio signal is 44,100 Hz.

By choosing a high sampling frequency, it will appear like the audio signal is continuous when humans listen to it. [4]

The following figure shows the first 50 samples of an input signal having amplitude 1.

Figure 2. Amplitude and Time representation of audio signal [4]

3.  TECHNICAL SPECIFICATION

        3.1 Software Specification

      The spine of this project was the software which had been used. For programming to the server side python IDLE was used mainly. Also for the required app design tkinter of python was used and MySQL for database storage. Similarly to program Node MCU Arduino IDE had been used. Those all software were easily available and easy to use.

        3.2 Hardware

All the development process was carried out with a personal computer, web camera and a microphone. The components used were easy to find, but for the image processing it was necessary to have at least a required minimum capacity.

The technical specifications of the used hardware are:

  • Computer model: Acer Aspire E5
  • CPU: Intel quad core i5
  • RAM: 6 GB
  • Graphics: 2 GB – NVIDIA GEFORCE
  • Camera: Acer Crystal Eye webcam (Integrated)

            a) Microphone

It is a device which take human voice as input and give it to the system. In   this project it plays a vital role. The user gives command through it. Since it takes analog input a considerable amount of noise is also introduced which may create problem during voice detection.

b) Web Cam

This is peripheral device to the computer. It is used to give visual input to computer or any other system. In this project it is used to detect face and provide input to the system. Possibly, it will be used to write document to system simply by scanning through it.

c)  Node MCU

It is a module which works as a combination of Arduino and Wi-Fi module. It is programmed in c language. It can be programmed in Arduino IDE. This is mainly used for IOT based system.

Figure 3: Node MCU (upload.wikimedia.org/nodemcu.png)

3.3 Other Requirement

Noise, a common problem inherent to electronic devices, affects pixel values as well as voice signal and causes them to fluctuate within a certain range. The color images we processed contained non-uniformly distributed noise. According to the specification, the color camera we used had an S/N ratio of 48 dB. The frame grabber had on average 1.5 bad bits of 8 with the least significant bit reflecting more noise than data. The fluctuations are not always visible to a human eye, but we have to take them into account in the context of the considered dichromatic reflection model. In theory, noise introduces thickness of the dichromatic plane and in practice it introduces thickness of the dichromatic surface.

4.  METHODOLOGY

     4.1 Related work and previous research

The recognition process is not simple task. Before comparing faces with the ones in the database, the algorithm needs to select an object inside an image to reduce the area of computation and then treat it for later manipulation.

This process is resumed in three steps: the first step is to get the input image, the second is to find a face on the image and the third is to let the algorithm to try to recognize who this person would be.

Input Image
Face Detection
Face Recognition

 

Figure 4:  Face Detection and Recognition Steps [3]

Similarly, Voice recognition is a complicated task. The completion of it increases in the presence of noise. Before extracting commands and performing necessary actions the speech must be normalized and noise should be reduced. Then the further process is carried out. This process is also resumed in three steps: the first step is to get the user voice as input, the second step is to convert from speech to text format and the third is to let the program extracts command from the text.

Input voice
Voice to Text conversion
Extract Command

Figure 5: Voice Recognition steps [1]

This project is based on detection of human face and voice using web cam and microphone. The web cam detects human face based on the programming in the server. It then shows the information related to the individual stored in database. The information can be updated and shown. In case of voice recognition, microphone acts as input terminal for client. The client gives command as voice and server performs appropriate actions for those commands.

To make a desktop app Python tkinter library has been used. This library facilitates with lots of widgets like Button, Check button, List box, Label, Message etc.

By using face detection and recognition method the FAVRIC app was designed.

Similarly the voice recognition method was used for command purpose like home automation. For home automation Node MCU was used. The python program recognized the user’s voice command using Google voice recognition API and sets or resets home appliance variable in Firebase, the Google’s online database. The Node MCU was connected with nearest Wi-Fi. It then read the state of home appliances in Firebase. Since this project is just a prototype, instead of light and fan we had used LEDs and motor.

The flowchart of the system is shown in following figure:

Figure 6: Flow Chart of the System

4.2 Real-time imaging and picture processing

      4.2.1 Face detection

Face detection is the process that determines the locations and sizes of the human faces in arbitrary (digital) images. It detects facial features and ignores anything else. This is essentially a segmentation problem and in practical systems, most of the effort goes into solving this task.

4.2.2 Face recognition

Face recognition is the process to identify or verify people from a digital image or video frame from a video source. One of the way to do this is to compare facial features selected from the image and from a facial database. The basis of the template matching strategy is to extract whole facial regions (matrix of pixels) and compare these with the stored images of known individuals. Once again Haar Cascades can be used to find the closest match. However, there are far more sophisticated methods of template matching for face recognition. These involve extensive pre-processing and transformation of the extracted grey-level intensity values. For example, the principle Component Analysis, sometimes known as the Eigen-faces approach, uses template matching to pre-process the gray-levels.

4.3 Voice Recognition

The voice recognition method was used for command purpose like home automation. For home automation Node MCU was used. The python program recognized the user’s voice command using Google voice recognition API and sets or resets home appliance variable in Firebase, the Google’s online database. The Node MCU was connected with nearest Wi-Fi. It then read the state of home appliances in Firebase. Since this project is just a prototype, instead of light and fan we had used LEDs and motor. The working of this this part is shown in following phases:

Google voice to text API
Microphone

                                    Voice                                   Text

    Node MCU

Figure 7: Voice Recognition Process

4.4 Main Program

       4.4.1 Launching Program

When user clicks the app icon the launching program with the logo would run on the server or user computer. It is coded in python so to run directly the extension was changed to ‘.pyw’. It automatically runs the program and open login/signup form as default. But if previous user has not logged out yet then it would open the canvas frame as default. The logo remains on the screen for 5 secs then further operations follows. This timing can be changed as required. It is kept so that the background processes would complete their processes and then server focused on this app. The logo frame is shown below:

                                  Figure 8: Logo Frame

        4.4.2 Login / Sign up

This form automatically opens after the logo. But it would not show up if the previous user has not logged out. It provides two options: log in and sign up. The registered user can login with their credentials and the new user can login after registration. After login the main program opens for face detection and recognition.

In login frame, the user is asked to enter his /her username and password. The username of each user is uniquely registered. To keep the password secret it is hidden with asterisk ‘*’ sign. The login frame is shown below:

Figure 9: Login Frame

If the user is new and wants to register, he/she should click the register button at the top. It would open register frame. It contains three entry fields: username, password and re-enter password. The user would choose what to fill in that fields. But to keep username unique, it would pop up error if the username is already available. It would also check whether the password in both fields are matched or not.

Figure 10: Sign up frame

4.4.3 Canvas

After login the system launches the main program, FAVRIC. It has three modes of operation: Normal mode, Detection mode and Standby mode. In normal mode, it simply opens camera and shows user information. In detection mode it detects faces in the frame. In standby mode the camera would be off but the program runs. There’s other widgets also for other tasks like recognition of face, saving and editing data, closing program and logout.

Figure 11: Main Canvas

For the face detection the following processes are followed:

  • Open Camera (web cam)
  • Get continuous frame in the image canvas
  • Read the haar cascade file
  • Apply the file on the frame to detect faces

After the face was detected different samples of the corresponding person’s face would be saved as dataset and an ‘.yml’ file would be created as a classifier.

5. MODEL VERIFICATION

It was important to have the same conditions with the testers to prove how the program responded and to get conclusive results. Once their faces were added into the database the testers became database subjects.

Also once the voice input is recorded using microphone the audio data became processing subject.

  1. Test situation

The test subject were sitting on a chair at around one meter from the camera and microphone, which was always positioned at the same angle. All the tests were done in a well illuminated room. The lighting was regular and white, mainly from the ceiling and from the front. At the moment of saving the images into the database, it was important that the subjects stayed almost motionless, except for the head that had to be slightly turned to the side. The application detected the person and saved the images into the database.

At the moment of recording voice, it was important that the subjects speak each word clearly. The environment was not noiseless but it was not in the amount to be concerned.

  1. Test Procedure

The testers were able to freely interact with the computer and inspect the different parts of the program. The tests were always conducted under the same conditions. The name of the subject was asked to be written onto the appropriate name box, in order of adding the subject face into the database. In case of problems with the interface, the test was repeated from the beginning.

Also when microphone detects subject’s voice it start to record and stops when it does not detect any voice. Then it starts to convert the audio data into text using Google Speech Recognition API. After the audio data is converted into text it is passed through the command extraction program to extract command. If that text does not contain any command or invalid command, the test was repeated from the beginning.

6. LIMITATIONS

After the completion of the project face and voice were successfully recognized and the appliances were successfully operated using voice command.

Although, FAVRIC performs highly detection rate, achieved 80% detection rate for all test case, in this project. However, all the miss detected face and command can be conclude as follows:

  1. Illumination condition
  2. Pose orientation
  3. Facial Expression
  4. Faulty pronunciation
  5. Noisy environment
  •  APPLICATIONS

This system has application in large areas. Some of the areas are mentioned below:

  • It is applicable in home automation requirement
  • It is applicable for those who are unable to type or don’t have knowledge how system works
  • It is mainly applicable for physically disable people like blind or unable to use hands
  • It can be used in research center where there’s no time to type for every search
  • It can be used in classrooms for automatic teaching
  • It can be used for AI purpose too
  • RESULT AND CONCLUSION

To improve the recognition performance, there are many things that can be improved here, some of them being fairly easy to implement. For example, you could add color processing, edge detection etc. You can usually improve the face recognition accuracy by using 30 more input mages, at least 50 person, by taking more photos of each person, particularly from different angles and lighting conditions. If you can’t take more photos, there are several simple techniques you could use to obtain more accuracy. You could create mirror copies of your facial images, so that you will have twice as many training images and it won’t have a bias towards left or right.

You could translate or resize or rotate your facial images slightly to produce many alternative images for training, so that it will be less sensitive to exact conditions. You could add image noise to have more training images that improve the tolerance to noise.

To improve the accuracy of voice commands many things can be implemented too. Since it is using Google API there’s nothing we can do to convert audio to text accurately. Because it is already done by google. We can increase the accuracy rate by implementing better microphone. Also we need to work in less noisy place. To control appliances accurately the command extraction program can be modified even better using switch operation of commands.

To increase the accuracy at nominal rate human intervention is also required. Because this system can show result only up to certain level.

So it’s important to do a lot of experimentation if you want better results, and if you still can’t get good results after trying many things, perhaps you will need a more complicated face recognition and voice recognition algorithm than haar cascade and google API respectively.

  • FUTURE SCOPE

Any system which is designed must be feasible and should be robust so that it can used in future. In addition, our system satisfies the aforementioned requirements. The future scopes of the system are:

  • This system can be extended to perform more complex tasks
  • The decision algorithm can be improved to take better decision and can be used in complex systems
  • It can be improved to perform task related to machine learning
  • It can be extended for Artificial Intelligence
  • For behavior analysis study using object tracking
  • For accessing security accessing purpose

NOTE:

This project is done by :

ER.AMIT KUMAR YADAV   : https://www.linkedin.com/in/heyamitkumar/              

ER.AMIT PRAKASH GHIMIRE         

ER.MANDIP RAI                                   

ER.SUSHANT BHUJEL

Recommended For You

About the Author: admin

4 Comments

Leave a Reply

Your email address will not be published.