Friday, December 6, 2019

Voice Recognition free essay sample

Introduction Inhalant problem How to Compare Recordings Dependence of systems accuracy Algorithm Instruction Source Code Software Requirements Hardware Requirements References The project Attendance through Voice Recognition is a tool that can help an organization or academic Institute to have attendance of their employee or students and also the faculty members. Let also record the time and date at which the member is present. This project allows a organization or academic institute to overcome the problem of proxy to a great extend.Many organization is facing the problem of proxy. Employee may mark their attendance by some other guy and the organization may not detect it because there is no such process of verification and it is difficult to recognize the face or voice of every person. The same situation Is their In academic Institute also. The faculty member may mark their attendance though they are late or absent from the institute by some other colleagues which is a common scenario in a government institute. The faculty members can also get help from this software by detecting proxy of students also.The Project actually describes the process behind Implementing a voice recognition algorithm In MUTUAL. The algorithm utilizes the Discrete Fourier Transform In order to compare the frequency spectra of two voices. Cubbyholes Inequality is then used to determine (with reasonable certainty) whether two voices came from the same person or not. If the two voices matches each other than a present is marked in the attendance register I. E in a database and with that present, the date and time of attendance Is also stored.Speech is a natural mode of communication for people. We learn all the relevant skills during early childhood, without instruction, and we continue to rely on speech immunization throughout our lives. It comes so naturally to us that we dont realize how complex a phenomenon speech is. The human vocal tract and articulators are biological organs with nonlinear properties, whose operation is not Just under conscious control but also affected by factors ranging from gender to upbringing to emotional state.As a result, visualization can vary widely in terms of their accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed; moreover, during transmission, our irregular speech patterns can be further distorted by aground noise and echoes, as well as electrical characteristics (if telephone s or other electronic equipment are used). All these sources of variability make speech recognition, even more than speech generation, a very complex problem.A human can easily recognize a familiar voice however, getting a computer to distinguish a particular voice among others is a more difficult task. Immediately, several problems arise when trying to write a voice recognition algorithm. The majority of these difficulties are due to the fact that it is almost impossible to say a rod exactly the same way on two different occasions. Some factors that continuously change in human speech are how fast the word is spoken, emphasizing different parts of the word, etc Furthermore, suppose that a word could in fact be said the same way on different occasions, then we would still be left with another major dilemma. Namely, in order to analyze two sound files in time domain, the recordings would have to be aligned Just right so that both recordings would begin at precisely the same moment. Frequency Domain Given the difficulties mentioned in the above paragraph, one thing becomes very evident. That is, any attempt to analyze sounds in time domain will be extremely impractical.Instead, this led us to analyze the frequency spectra of a voice which remains predominately unchanged as speech is slightly varied. We then effectively utilized the Discrete Fourier Transform to convert all recording into frequency domain before any comparisons were made. Working in frequency domain eliminates the necessity to exactly align audio tracks in order to make a comparison. Finding a Norm Due to nature of human speech, all data pertaining to frequencies above GHz can feely be discarded.It then followed that our files could actually be regarded as vectors in 600-dimensional Euclidean space. After normalizing our vectors, we then took the norm of the difference of two frequency spectra as way of comparing the comparing and contrasting the use of the L 1, LA, and Lam norms, we concluded that the LA norm most accurately measured how close the frequency spectra were in to different vectors. At this point, all that remained was to decide exactly how small the norm of the difference of two frequency spectra had to be in order to determine that tooth recordings originated from the same person.Cubbyholes Inequality Recall that Cubbyholes Inequality states that in particular, at least 3/4 of all measurements from the same population fall within 2 standard deviations of the mean. Hence, in response to the problem posed at the end of the previous paragraph, we the n formed the following solution: By requiring that the norm of the difference fall within 2 standard deviations of the normal average voice, we were then ensured that at least 3/4 of the time, the algorithm would recognize a voice correctly. Dependence of systems accuracy:The accuracy of voice recognition depends on many factors. A systems accuracy depends on the conditions under which it is evaluated: under sufficiently narrow conditions almost any system can attain human-like accuracy, but its much harder to achieve good accuracy under general conditions. The conditions of evaluation and hence the accuracy of any system can vary along the following dimensions: Vocabulary size and accountability. As a general rule, it is easy to discriminate among a small set of words, but error rates naturally increase as the vocabulary size grows.For example, the 10 digits zero to nine can be recognized essentially perfectly , but vocabulary sizes of 200, 5000, or 100000 may have error rates of 3%, 7%, or 45%. On the other hand, even a small vocabulary can be hard to recognize if it contains confusable words. For example, the 26 letters of the English alphabet (treated as 26 words) are very difficult to discriminate because they contain so many confusable words (most notoriously, the E-set: B, C, D, E, G, P, T, V, Z); a n 8% error rate is considered good for this vocabulary Speaker dependence vs.. Independence.By definition, a speaker dependent system is intended for use by a single speaker, but a speaker independent system is intended for use by any speaker. Speaker independence is difficult to achieve because a systems parameters become tuned to the speaker(s) that it was trained on, and these parameters tend to be highly speaker-specific. Isolated, discontinuous, or continuous speech. Isolated speech means single words; discontinuous speech means full sentences in which words are artificially separated by silence; and continuous speech means naturally spoken sentences.Isolated and discontinuous speech recognition is relatively easy because word boundaries are detectable and the words tend to be cleanly pronounced. Task and language constraints. Even with a fixed vocabulary, performance will vary with the nature of constraints task-dependent (for example, an relinquishing application may dismiss the hypothesis The apple is red); other constraints may be semantic (rejecting The apple is angry), or syntactic (rejecting Red is apple the).Constraints are often represented by a grammar, which ideally filters out unreasonable sentences so that he speech recognizer evaluates only plausible sentences. Grammars are usually rated by their perplexity, a number that indicates the grammars average branching factor (I. E. , the number of words that can follow any given word). The difficulty of a task is more reliably measured by its perplexity than by its vocabulary size. Read vs. . Spontaneous speech. Systems can be evaluated on speech that is either read from prepared scripts, or speech that is uttered spontaneously.Spontaneous speech is vastly more difficult, because it tends to be peppered with discusses like uh and um, false starts, incomplete sentences, stuttering, coughing, and laughter; and moreover, the vocabulary is essentially unlimited, so the system must be able to deal intelligently with unknown words (e. G. , detecting and flagging their presence, and adding them to the vocabulary, which may require some interaction with the user). Adverse conditions. A systems performance can also be degraded by a range of adverse conditions. These include environmental noise (e. G. Noise in a car or a factory); acoustical distortions (e. G, echoes, room acoustics); different microphones (e. G. Close-speaking, unidirectional, or telephone); limited frequency bandwidth (in telephone transmission); and altered speaking manner (shouting, whining, speaking quickly, etc. ). Algorithm Instructions The project contain a folder named Mutual Files contains 10 audio recordings of the person whose voice is to be recognized. Every person should record his 10 voice saying his name. Also, that folder contains two m-files. These two files are project. M and voice. M. Project. Is the voice recognition program that accomplishes the goals of the class project. The script file project. M can be ran in the command window in Mutual. Please ensure that the directory in Mutual is set to the directory that contains project. M and the audio recordings. Once project. M is ran in Mutual, it will then ask you to Enter the name that must be recognized. Then type in the name that has to be recognized but the name that is typed must have its recorded voice in the audio folder. After that, the program will let you know that you have 2 seconds to record yourself saying The name.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.