Avoiding overstating the strength of forensic evidence: Shrunk likelihood ratios / Bayes factors
Morrison, G.S., Poh, N. (2017 submitted).
- Matlab code
- When strength of forensic evidence is quantified using sample data and statistical models, a concern may be raised as to whether the output of a model overestimates the strength of evidence. This is particularly the case when the amount of sample data is small, and hence sampling variability is high. This concern is related to concern about precision. This paper describes, explores, and tests three procedures which shrink the value of the likelihood ratio or Bayes factor toward the neutral value of one. The procedures are: (1) a Bayesian procedure with uninformative priors, (2) use of empirical lower and upper bounds (ELUB), and (3) a novel form of regularized logistic regression. As a benchmark, they are compared with linear discriminant analysis, and in some instances with non-regularized logistic regression. The behaviours of the procedures are explored using Monte Carlo simulated data, and tested on real data from comparisons of voice recordings, face images, and glass fragments.
Forensic speech science.
Morrison, G.S., Enzinger, E., Zhang, C. (2017). In I. Freckelton, & H. Selby (Eds.), Expert Evidence (Ch. 99). Sydney, Australia: Thomson Reuters.
- A revised, updated, and expanded edition of Morrison (2010) “Forensic voice comparison”. It introduces forensic speech science in a relatively non-technical way, assuming a reader who has no prior knowledge of the subject. As with the previous edition, the revised edition provides an introduction to forensic voice comparison and to speaker recognition by laypeople (e.g., earwitnesses). Compared to the previous edition, the revised edition has a heavier focus on automatic approaches to forensic voice comparison.The revised edition also includes coverage of other areas of forensic speech science, particularly disputed utterance analysis.
Empirical test of the performance of an acoustic-phonetic approach to forensic voice comparison under conditions similar to those of a real case
Enzinger, E., Morrison, G.S. (2017). Forensic Science International, 277, 3040.
- In a 2012 case in New South Wales, Australia, the identity of a speaker on several audio recordings was in question. Forensic voice comparison testimony was presented based on an auditory-acoustic-phonetic-spectrographic analysis. No empirical demonstration of the validity and reliability of the analytical methodology was presented. Unlike the admissibility standards in some other jurisdictions (e.g., US Federal Rule of Evidence 702 and the Daubert criteria, or England & Wales Criminal Practice Directions 19A), Australia’s Unified Evidence Acts do not require demonstration of the validity and reliability of analytical methods and their implementation before testimony based upon them is presented in court. The present paper reports on empirical tests of the performance of an acoustic-phonetic-statistical forensic voice comparison system which exploited the same features as were the focus of the auditory-acoustic-phonetic-spectrographic analysis in the case, i.e., second-formant (F2) trajectories in /o/ tokens and mean fundamental frequency (f0). The tests were conducted under conditions similar to those in the case. The performance of the acoustic-phonetic-statistical system was very poor compared to that of an automatic system.
Comments on National Commission on Forensic Science (NCFS) Views on Statistical Statements in Forensic Testimony
What should a forensic practitioner’s likelihood ratio be? II
Morrison, G.S. (2017). Science & Justice, X, xx.
- In the debate as to whether forensic practitioners should assess and report the precision of the strength of evidence statements that they report to the courts, I remain unconvinced by proponents of the position that only a subjectivist concept of probability is legitimate. I consider this position counterproductive for the goal of having forensic practitioners implement, and courts not only accept but demand, logically correct and scientifically valid evaluation of forensic evidence. In considering what would be the best approach for evaluating strength of evidence, I suggest that the desiderata be (1) to maximise empirically demonstrable performance; (2) to maximise objectivity in the sense of maximising transparency and replicability, and minimising the potential for cognitive bias; and (3) to constrain and make overt the forensic practitioner’s subjective-judgement based decisions so that the appropriateness of those decisions can be debated before the judge in an admissibility hearing and/or before the trier of fact at trial. All approaches require the forensic practitioner to use subjective judgement, but constraining subjective judgement to decisions relating to selection of hypotheses, properties to measure, training and test data to use, and statistical modelling procedures to use decisions which are remote from the output stage of the analysis will substantially reduce the potential for cognitive bias. Adopting procedures based on relevant data, quantitative measurements, and statistical models, and directly reporting the output of the statistical models will also maximise transparency and replicability. A procedure which calculates a Bayes factor on the basis of relevant sample data and reference priors is no less objective than a frequentist calculation of a likelihood ratio on the same data. In general, a Bayes factor calculated using uninformative or reference priors will be closer to a value of 1 than a frequentist best estimate likelihood ratio. The bound closest to 1 based on a frequentist best estimate likelihood ratio and an assessment of its precision will also, by definition, be closer to a value of 1 than the frequentist best estimate likelihood ratio. From a practical perspective, both procedures shrink the strength of evidence value towards the neutral value of 1. A single-value Bayes factor or likelihood ratio may be easier for the courts to handle than a distribution. I therefore propose as a potential practical solution, the use of procedures which account for imprecision by shrinking the calculated Bayes factor or likelihood ratio towards 1, the choice of the particular procedure being based on empirical demonstration of performance.
Assessing the admissibility of a new generation of forensic voice comparison testimony
Morrison, G.S., Thompson, W.C. (2017). Columbia Science and Technology Law Review, 18, 326434.
- preprint: https://ssrn.com/abstract=2883767
- preprint: https://www.newton.ac.uk/files/preprints/ni16053.pdf
- This article provides a primer on forensic voice comparison (aka forensic speaker recognition), a branch of forensic science in which the forensic practitioner analyzes a voice recording in order to provide an expert opinion that will help the trier-of-fact determine the identity of the speaker. The article begins with an explanation of ways in which human speech varies within and between speakers. It then discusses different technical approaches that forensic practitioners have used to compare voice recordings, and frameworks of reasoning that practitioners have used for evaluating the evidence and reporting its strength. It then discusses procedures for empirical validation of the performance of forensic voice comparison systems. It also discusses the potential influence of contextual bias and ways to reduce this. Building on this scientific foundation, the article then offers analysis, commentary, and recommendations on how courts evaluate the admissibility of forensic voice comparison testimony under the Daubert and Frye standards. It reviews past rulings such as U.S. v. Angleton, 269 F.Supp 2nd 892 (S.D. Tex. 2003) that found expert testimony based on the spectrographic approach inadmissible under Daubert. The article also offers a detailed analysis of the evidence presented in the recent Daubert hearing in U.S. v. Ahmed, et al. 2015 EDNY 12-CR-661, which included testimony based on the newer automatic approach. The scientific testimony proffered in Ahmed is used to illustrate the issues courts are likely to face when considering the admissibility of forensic voice comparison testimony in the future. The article concludes with a discussion of how proponents of forensic voice comparison testimony might meet a reasonably rigorous application of the Daubert standard and thereby ensure that such testimony is sufficiently trustworthy to be used in court.
Forensic voice comparison
Zhang, C., Morrison, G.S. (2017). In: Sybesma, R., Behr, W., Gu, Y., Handel, Z., Huang, C.-T. J., Myers, J. (Eds.), Encyclopedia of Chinese Language and Linguistics (pp. 256260). Leiden: Brill.
A comment on the PCAST report: Skip the “match”/“non-match” stage
Morrison, G.S., Kaye, D.H., Balding, D.J., Taylor, D., Dawid, P., Aitken, C.G.G., Gittelson, S., Zadora, G., Robertson, B., Willis, S.M., Pope, S., Neil, M., Martire, K.A., Hepler, A., Gill, R.D., Jamieson, A., de Zoete, J., Ostrum, R.B., Caliebe, A. (2016/2017). Forensic Science International, 272, e7e9.
- pre-submission version: http://forensic-evaluation.net/PCAST2016/
- This letter comments on the report “Forensic science in criminal courts: Ensuring scientific validity of feature-comparison methods” recently released by the President’s Council of Advisors on Science and Technology (PCAST). The report advocates a procedure for evaluation of forensic evidence that is a two-stage procedure in which the first stage is “match”/“non-match” and the second stage is empirical assessment of sensitivity (correct acceptance) and false alarm (false acceptance) rates. Almost always, quantitative data from feature-comparison methods are continuously-valued and have within-source variability. We explain why a two-stage procedure is not appropriate for this type of data, and recommend use of statistical procedures which are appropriate.
Score based procedures for the calculation of forensic likelihood ratios scores should take account of both similarity and typicality
Morrison, G.S., Enzinger, E. (2017). Science & Justice, X, xx.
see also: http://geoff-morrison.net/#ICFIS2014
- This article is open access at the publisher’s website
- Matlab code
- Score based procedures for the calculation of forensic likelihood ratios are popular across different branches of forensic science. They have two stages, first a function or model which takes measured features from known-source and questioned-source pairs as input and calculates scores as output, then a subsequent model which converts scores to likelihood ratios. We demonstrate that scores which are purely measures of similarity are not appropriate for calculating forensically interpretable likelihood ratios. In addition to taking account of similarity between the questioned-origin specimen and the known-origin sample, scores must also take account of the typicality of the questioned-origin specimen with respect to a sample of the relevant population specified by the defence hypothesis. We use Monte Carlo simulations to compare the output of three score based procedures with reference likelihood ratio values calculated directly from the fully specified Monte Carlo distributions. The three types of scores compared are: 1. non-anchored similarity-only scores; 2. non-anchored similarity and typicality scores; and 3. known-source anchored same-origin scores and questioned-source anchored different-origin scores. We also make a comparison with the performance of a procedure using a dichotomous “match”/“non-match” similarity score, and compare the performance of 1 and 2 on real data.
Use of relevant data, quantitative measurements, and statistical models to calculate a likelihood ratio for a Chinese forensic voice comparison case involving two sisters
Zhang, C., Morrison, G.S., Enzinger, E. (2016). Forensic Science International, 267, 115124.
- Currently, the standard approach to forensic voice comparison in China is the aural-spectrographic approach. Internationally, this approach has been the subject of much criticism. The present paper describes what we believe is the first forensic voice comparison analysis presented to a court in China in which a numeric likelihood ratio was calculated using relevant data, quantitative measurements, and statistical models, and in which the validity and reliability of the analytical procedures were empirically tested under conditions reflecting those of the case under investigation. The hypotheses addressed were whether the female speaker on a recording of a mobile telephone conversation was a particular individual, or whether it was that individual’s younger sister. Known speaker recordings of both these individuals were recorded using the same mobile telephone as had been used to record the questioned-speaker recording, and customised software was written to perform the acoustic and statistical analyses.
Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01) - Introduction
Morrison, G.S., Enzinger, E. (2016). Speech Communication, 85, 119126.
- This article should be open access. If for any reason you can’t access it at the SPECOM site
- There is increasing pressure on forensic laboratories to validate the performance of forensic analysis systems before they are used to assess strength of evidence for presentation in court. Different forensic voice comparison systems may use different approaches, and even among systems using the same general approach there can be substantial differences in operational details. From case to case, the relevant population, speaking styles, and recording conditions can be highly variable, but it is common to have relatively poor recording conditions and mismatches in speaking style and recording conditions between the known- and questioned-speaker recordings. In order to validate a system intended for use in casework, a forensic laboratory needs to evaluate the degree of validity and reliability of the system under forensically realistic conditions. The present paper is an introduction to a Virtual Special Issue consisting of papers reporting on the results of testing forensic voice comparison systems under conditions reflecting those of an actual forensic voice comparison case. A set of training and test data representative of the relevant population and reflecting the conditions of this particular case has been released, and operational and research laboratories are invited to use these data to train and test their systems. The present paper includes the rules for the evaluation and a description of the evaluation metrics and graphics to be used. The name of the evaluation is: forensic_eval_01
Reply to Hicks et alii (2017) Reply to Morrison et alii (2016) Refining the relevant population in forensic voice comparison A response to Hicks et alii (2015) The importance of distinguishing information from evidence/observations when formulating propositions
Morrison, G.S., Enzinger, E., Zhang, C. (2017).
- The present letter to the editor is one in a series of publications discussing the formulation of hypotheses (propositions) for the evaluation of strength of forensic evidence. In particular, the discussion focusses on the issue of what information may be used to define the relevant population specified as part of the different-speaker hypothesis in forensic voice comparison. The previous publications in the series are: Hicks et al. (2015); Morrison et al. (2016); Hicks et al. (2017). The latter letter to the editor mostly resolves the apparent disagreement between the two groups of authors. We briefly discuss one outstanding point of apparent disagreement, and attempt to correct a misinterpretation of our earlier remarks. We believe that at this point there is no actual disagreement, and that both groups of authors are calling for greater collaboration in order to reduce the likelihood of future misunderstandings.
Refining the relevant population in forensic voice comparison - A response to Hicks et alii (2015) The importance of distinguishing information from evidence/observations when formulating propositions
Morrison, G.S., Enzinger, E., Zhang, C. (2016). Science & Justice, 56, 492497.
- Hicks et al. (2015) propose that forensic speech scientists not use the accent of the speaker of questioned identity to refine the relevant population. This proposal is based on a lack of understanding of the realities of forensic voice comparison. If it were implemented, it would make data-based forensic voice comparison analysis within the likelihood ratio framework virtually impossible. We argue that it would also lead forensic speech scientists to present invalid unreliable strength of evidence statements, and not allow them to conduct the tests that would make them aware of this problem.
Special issue on measuring and reporting the precision of forensic likelihood ratios: Introduction to the debate
Morrison, G.S. (2016). Science & Justice, 56, 371373.
- The present paper introduces the Science & Justice virtual special issue on measuring and reporting the precision of forensic likelihood ratios whether this should be done, and if so how. The focus is on precision (aka reliability) as opposed to accuracy (aka validity). The topic is controversial and different authors are expected to express a range of nuanced opinions. The present paper frames the debate, explaining the underlying problem and referencing classes of solutions proposed in the existing literature. The special issue will consist of a number of position papers, responses to those position papers, and replies to the responses.
What should a forensic practitioner’s likelihood ratio be?
Position Paper in the Science & Justice Virtual Special Issue on on measuring and reporting the precision of forensic likelihood ratios
Morrison, G.S., Enzinger, E. (2016). Science & Justice, 56, 374379.
- Matlab code: combine_imprecise_priors_LR
- We argue that forensic practitioners should empirically assess and report the precision of their likelihood ratios. Once the practitioner has specified the prosecution and defence hypotheses they have adopted, including the relevant population they have adopted, and has specified the type of measurements they will make, their task is to empirically calculate an estimate of a likelihood ratio which has a true but unknown value. We explicitly reject the competing philosophical position that the forensic practitioner’s likelihood ratio should be based on subjective personal probabilities. Estimates of true but unknown values are based on samples and are subject to sampling uncertainty, and it is standard practice to report the degree of precision of such estimates. We discuss the dangers of not reporting precision to the courts, and the problems with an alternative approach which instead reports a verbal expression corresponding to a pre-specified range of likelihood ratio values. Reporting precision as an interval requires an arbitrary choice of coverage, e.g., a 95% or a 99% credible interval. We outline a normative framework which a trier of fact could use to make non-arbitrary use of the results of forensic practitioners’ empirical calculations of likelihood ratios and their precision.
INTERPOL survey of the use of speaker identification by law enforcement agencies
Morrison, G.S., Sahito, F.H., Jardine, G., Djokic, D., Clavet, S., Berghs, S., Goemans Dorny, C. (2016). Forensic Science International, 263, 92100.
- A survey was conducted of the use of speaker identification by law enforcement agencies around the world. A questionnaire was circulated to law enforcement agencies in the 190 member countries of INTERPOL. 91 responses were received from 69 countries. 44 respondents reported that they had speaker identification capabilities in house or via external laboratories. Half of these came from Europe. 28 respondents reported that they had databases of audio recordings of speakers. The clearest pattern in the responses was that of diversity. A variety of different approaches to speaker identification were used: The human-supervised-automatic approach was the most popular in North America, the auditory-acousticphonetic approach was the most popular in Europe, and the spectrographic/auditory-spectrographic approach was the most popular in Africa, Asia, the Middle East, and South and Central America. Globally, and in Europe, the most popular framework for reporting conclusions was identification/exclusion/ inconclusive. In Europe, the second most popular framework was the use of verbal likelihood ratio scales.
Response to forensic science questions posed by President’s Council of Advisors on Science and Technology
Statement regarding the UK Parliamentary Office of Science and Technology 2015 briefing on Forensic Linguistics
Morrison, G.S. (2015-10-21)
A demonstration of the application of the new paradigm for the evaluation of forensic evidence under conditions reflecting those of a real forensic-voice-comparison case.
Enzinger, E., Morrison, G.S., Ochoa, F. (2015/2016) Science & Justice, 56, 4257.
- Audio examples
- The new paradigm for the evaluation of the strength of forensic evidence includes: The use of the likelihood-ratio framework. The use of relevant data, quantitative measurements, and statistical models. Empirical testing of validity and reliability under conditions reflecting those of the case under investigation. Transparency as to decisions made and procedures employed. The present paper illustrates the use of the new paradigm to evaluate strength of evidence under conditions reflecting those of a real forensic-voice-comparison case. The offender recording was from a landline telephone system, had background office noise, and was saved in a compressed format. The suspect recording included substantial reverberation and ventilation system noise, and was saved in a different compressed format. The present paper includes descriptions of the selection of the relevant hypotheses, sampling of data from the relevant population, simulation of suspect and offender recording conditions, and acoustic measurement and statisticalmodelling procedures. The present paper also explores the use of different techniques to compensate for the mismatch in recording conditions. It also examines how system performance would have differed had the suspect recording been of better quality.
Mismatched distances from speakers to telephone in a forensic-voice-comparison case
Enzinger, E., Morrison, G.S. (2015). Speech Communication, 70, 2841.
- In a forensic-voice-comparison case, one speaker (A) was standing a short distance away from another speaker (B) who was talking on a mobile telephone. Later, speaker A moved closer to the telephone. Shortly thereafter, there was a section of speech where the identity of the speaker was in question the prosecution claiming that it was speaker A and the defense claiming it was speaker B. All material for training a forensic-voice-comparison system could be extracted from this single recording, but there was a near-far mismatch: Training data for speaker A were mostly far, training data for speaker B were near, and the disputed speech was near. Based on the conditions of this case we demonstrate a methodology for handling forensic casework using relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. A procedure is described for addressing the degree of validity and reliability of a forensic-voicecomparison system under such conditions. Using a set of development speakers we investigate the effect of mismatched distances to the microphone and demonstrate and assess three methods for compensation.
Calculation of forensic likelihood ratios: Use of Monte Carlo simulations to compare the output of score-based approaches with true likelihood-ratio values
Morrison, G.S. (2015). Research Report.
Stable URL: http://geoff-morrison.net/#ICFIS2014
- Also available at: http://arxiv.org/abs/1612.08165
- Matlab code
- A group of approaches for calculating forensic likelihood ratios first calculates scores which quantify the degree of difference or the degree of similarity between pairs of samples, then converts those scores to likelihood ratios. In order for a score-based approach to produce a forensically interpretable likelihood ratio, however, in addition to accounting for the similarity of the questioned sample with respect to the known sample, it must also account for the typicality of the questioned sample with respect to the relevant population. The present paper explores a number of score-based approaches using different types of scores and different procedures for converting scores to likelihood ratios. Monte Carlo simulations are used to compare the output of these approaches to true likelihood-ratio values calculated on the basis of the distribution specified for a simulated population. The inadequacy of approaches based on similarity-only or difference-only scores is illustrated, and the relative performance of different approaches which take account of both similarity and typicality is assessed.
Critique by Dr Geoffrey Stewart Morrison of a forensic voice comparison report submitted by Mr Edward J Primeau in relation to a section of audio recording which is alleged to be a recording of the voice of Dr Marlo Raynolds
Likelihood ratio calculation for a disputed-utterance analysis with limited available data.
Morrison, G.S., Lindh, J., Curran, J.M. (2014). Speech Communication, 58, 8190.
- Matlab and R code
- We present a disputed-utterance analysis using relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. The acoustic data were taken from an actual forensic case in which the amount of data available to train the statistical models was small and the data point from the disputed word was far out on the tail of one of the modelled distributions. A procedure based on single multivariate Gaussian models for each hypothesis led to an unrealistically high likelihood ratio value with extremely poor reliability, but a procedure based on Hotelling’s T2 statistic and a procedure based on calculating a posterior predictive density produced more acceptable results. The Hotelling’s T2 procedure attempts to take account of the sampling uncertainty of the mean vectors and covariance matrices due to the small number of tokens used to train the models, and the posterior-predictivedensity analysis integrates out the values of the mean vectors and covariance matrices as nuisance parameters. Data scarcity is common in forensic speech science and we argue that it is important not to accept extremely large calculated likelihood ratios at face value, but to consider whether such values can be supported given the size of the available data and modelling constraints.
Forensic strength of evidence statements should preferably be likelihood ratios calculated using relevant data, quantitative measurements, and statistical models a response to Lennard (2013) Fingerprint identification: How far have we come?
Morrison, G.S., Stoel R.D. (2014). Australian Journal of Forensic Sciences, 46, 282292.
- Lennard (2013) [Fingerprint identification: how far have we come? Aus J Forensic Sci. doi:10.1080/00450618.2012.752037] proposes that the numeric output of statistical models should not be presented in court (except ‘if necessary’/‘if required’). Instead, he argues in favour of an ‘expert opinion’ which may be informed by a statistical model but which is not itself the output of a statistical model. We argue that his proposed procedure lacks the transparency, the ease of testing of validity and reliability, and the relative robustness to cognitive bias that are the strengths of a likelihood-ratio approach based on relevant data, quantitative measurements, and statistical models, and that the latter is therefore preferable.
Distinguishing between forensic science and forensic pseudoscience: Testing of validity and reliability, and approaches to forensic voice comparison.
Morrison, G.S. (2014). Science & Justice, 54, 245256.
- In this paper it is argued that one should not attempt to directly assess whether a forensic analysis technique is scientifically acceptable. Rather one should first specify what one considers to be appropriate principles governing acceptable practice, then consider any particular approach in light of those principles. This paper focuses on one principle: the validity and reliability of an approach should be empirically tested under conditions reflecting those of the case under investigation using test data drawn from the relevant population. Versions of this principle have been key elements in several reports on forensic science, including forensic voice comparison, published over the last four-and-a-half decades. The auralspectrographic approach to forensic voice comparison (also known as “voiceprint” or “voicegram” examination) and the currently widely practiced auditoryacousticphonetic approach are considered in light of this principle (these two approaches do not appear to be mutually exclusive). Approaches based on data, quantitative measurements, and statistical models are also considered in light of this principle.
Forensic audio analysis Review: 20102013.
Grigoras, C., Smith, J. M., Morrison, G.S., Enzinger, E. (2013). In: NicDaéid, N. (Ed.), Proceedings of the 17th International Forensic Science Mangers’ Symposium, Lyon (pp. 612637). Lyon, France: Interpol.
Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison female voices.
Zhang, C., Morrison, G.S., Enzinger, E., Ochoa, F. (2013). Speech Communication, 55, 796813.
- In forensic-voice-comparison casework a common scenario is that the suspect’s voice is recorded directly using a microphone in an interview room but the offender’s voice is recorded via a telephone system. Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants, and the second formant is often assumed to be relatively robust to telephone-transmission effects. This study assesses the effects of telephone transmission on the performance of formant-trajectory-based forensic-voice-comparison systems. The effectiveness of both human-supervised and fully-automatic formant tracking is investigated. Human-supervised formant tracking is generally considered to be more accurate and reliable but requires a substantial investment of human labor. Measurements were made of the formant trajectories of /iau/ tokens in a database of recordings of 60 female speakers of Chinese using one human-supervised and five fully-automatic formant trackers. Measurements were made under high-quality, landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions. High-quality recordings were treated as suspect samples and telephone-transmitted recordings as offender samples. Discrete cosine transforms (DCT) were fitted to the formant trajectories and likelihood ratios were calculated on the basis of the DCT coefficients. For each telephone-transmission condition the formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The systems based on human-supervised formant measurement always outperformed the systems based on fully-automatic formant measurement; however, in conditions involving mobile telephones neither the former nor the latter type of system provided meaningful improvement over the baseline system, and even in the other conditions the high cost in skilled labor for human-supervised formant-trajectory measurement is probably not warranted given the relatively good performance that can be obtained using other less-costly procedures.
Reliability of human-supervised formant-trajectory measurement for forensic voice comparison.
Zhang, C., Morrison, G.S., Ochoa, F., Enzinger E. (2013). Journal of the Acoustical Society of America, 133, EL54EL60.
- Acoustic-phonetic approaches to forensic voice comparison often include human-supervised measurement of vowel formants, but the reliability of such measurements is a matter of concern. This study assesses the within- and between-supervisor variability of three sets of formanttrajectory measurements made by each of four human supervisors. It also assesses the validity and reliability of forensic-voice-comparison systems based on these measurements. Each supervisor’s formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient system, and performance was assessed relative to the baseline system. Substantial improvements in validity were found for all supervisors’ systems, but some supervisors’ systems were more reliable than others.
Tutorial on logistic-regression calibration and fusion: Converting a score to a likelihood ratio.
Morrison, G.S. (2013). Australian Journal of Forensic Sciences, 45, 173197.
- typsetting errata
- Logistic-regression calibration and fusion are potential steps in the calculation of forensic likelihood ratios. The present paper provides a tutorial on logistic-regression calibration and fusion at a practical conceptual level with minimal mathematical complexity. A score is log-likelihoodratio like in that it indicates the degree of similarity of a pair of samples while taking into consideration their typicality with respect to a model of the relevant population. A higher-valued score provides more support for the same-origin hypothesis over the different-origin hypothesis than does a lower-valued score; however, the absolute values of scores are not interpretable as log likelihood ratios. Logistic-regression calibration is a procedure for converting scores to log likelihood ratios, and logistic-regression fusion is a procedure for converting parallel sets of scores from multiple forensic-comparison systems to log likelihood ratios. Logistic-regression calibration and fusion were developed for automatic speaker recognition and are popular in forensic voice comparison. They can also be applied in other branches of forensic science, a fingerprint/fingermark example is provided.
Vowel inherent spectral change in forensic voice comparison.
Morrison, G.S. (2013). In G.S. Morrison & P.F. Assmann (Eds.) Vowel inherent spectral change (pp. 263283). Heidelberg, Germany: Springer-Verlag.
- The onset + offset model of vowel inherent spectral change has been found to be effective for vowel-phoneme identification, and not to be outperformed by more sophisticated parametric-curve models. This suggests that if only simple cues such as initial and final formant values are necessary for signaling phoneme identity, then speakers may have considerable freedom in the exact path taken between the initial and final formant values. If the constraints on formant trajectories are relatively lax with respect to vowel-phoneme identity, then with respect to speaker identity there may be considerable information contained in the details of formant trajectories. Differences in physiology and idiosyncrasies in the use of motor commands may mean that different individuals produce different formant trajectories between the beginning and end of the same vowel phoneme. If withinspeaker variability is substantially smaller than between-speaker variability then formant trajectories may be effective features for forensic voice comparison. This chapter reviews a number of forensic-voice-comparison studies which have used different procedures to extract information from formant trajectories. It concludes that information extracted from formant trajectories can lead to a high degree of validity in forensic voice comparison (at least under controlled conditions), and that a whole trajectory approach based on parametric curves outperforms an onset + offset model.
The importance of using between-session test data in evaluating the performance of forensic-voice-comparison systems.
Enzinger, E., Morrison, G.S. (2012). Proceedings of the 14th Australasian International Conference on Speech Science and Technology, Sydney (pp. 137140). Australasian Speech Science and Technology Association.
- In this paper we report on a study which demonstrates the im- portance of using non-contemporaneous test data in evaluating the validity and reliability in forensic-voice-comparison sys- tems. We test four different systems: one MFCC GMMUBM, one vowel formant-trajectory based, one nasal spectra based, and the fusion of the three systems. Each system is tested on the same set of test recordings, including same-speaker and different-speaker pairs. In one condition, the same-speaker pairs are from contemporaneous (within-session) recordings and in the other they are from non-contemporaneous (between-session) recordings. Within-session testing always overesti- mated the performance of the systems compared to between-session testing.
Human-supervised and fully-automatic formant-trajectory measurement for forensic voice comparison Female voices.
Zhang, C., Morrison, G.S., Enzinger, E., Ochoa, F. (2012). Laboratory Report. Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales, Sydney, Australia.
Stable URL: http://geoff-morrison.net/#_2012LabRepFormants
- Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants. Such methods typically depend on human-supervised formant measurement, which is often assumed to be relatively reliable and relatively robust to telephonetransmission- channel effects, but which requires substantial investment of human labor. Fully-automatic formant trackers require minimal human labor but are usually not considered reliable. This study assesses the effect of variability within three sets of formant-trajectory measurements made by four human supervisors on the validity and reliability of forensic-voice-comparison systems in a high-quality v high-quality recording condition. Measurements were made of the formant trajectories of /iau/ tokens in a database of recordings of 60 female speakers of Chinese. The study also assesses the validity of forensic-voice-comparison systems including a human-supervised and five fully-automatic formant trackers under landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions, each of these matched with the same condition and mismatched with the high-quality condition. In each case the formant-trajectory systems were fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The human-supervised systems always outperformed the fullyautomatic formant-tracker systems, but in some conditions the improvement was marginal and the cost of human-supervised formant-trajectory measurement probably not warranted.
Response to Draft Australian Standard: DR AS 5388.3 Forensic analysis - Part 3 - Interpretation
Morrison, G.S., Evett, I.W., Willis, S.M., Champod, C., Grigoras, C., Lindh, J., Fenton, N., Hepler, A., Berger, C.E.H., Buckleton, J.S., Thompson, W.C. , González-Rodríguez, J., Neumann, C., Curran, J.M., Zhang, C., Aitken, C.G. ., Ramos, D., Lucena-Molina, J.J., Jackson, G., Meuwly, D., Robertson, B., Vignaux, G.A. (2012).
Stable URL: http://geoff-morrison.net/#_2012DraftStandResp
Stable URL: http://forensic-evaluation.net/australian-standards/#Morrison_et_al_2012
Database selection for forensic voice comparison.
Morrison, G.S., Ochoa, F., & Thiruvaran, T. (2012). Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, Singapore, 6277.
- Defining the relevant population to sample is an important issue in data-based implementation of the likelihood-ratio framework for forensic voice comparison. We present a logical argument that because an investigator or prosecutor only submits suspect and offender recordings for forensic analysis if they sound sufficiently similar to each other, the appropriate defense hypothesis for the forensic scientist to adopt will usually be that the suspect is not the speaker on the offender recording but is a member of a population of speakers who sound sufficiently similar that an investigator or prosecutor would submit recordings of these speakers for forensic analysis. We propose a procedure for selecting background, development, and test databases using a panel of human listeners, and empirically test an automatic procedure inspired by the above. Although the automatic procedure is not entirely consistent with the logical arguments and human-listener procedure, it serves as a proof of concept for the importance of database selection. A forensic-voice-comparison system using the automatic database-selection procedure outperformed systems with random database selection.
Voice source features for forensic voice comparison an evaluation of the Glottex® software package.
Enzinger, E., Zhang, C., & Morrison, G.S. (2012). Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, Singapore, 7885.
- Errata & Addenda
- GLOTTEX is a software package which extracts informa- tion about voice source properties, including estimates of properties related to physical structures of the vocal folds. It has been proposed that the output of GLOTTEX can be used as part of a forensic-voice-comparison system. We test this using manually labeled segments from a database of voice recordings of 60 female Chinese speakers. Performance was assessed relative to a baseline MFCC GMM-UBM system. GMM-UBM systems based on features extracted by GLOTTEX were combined with the baseline system using logistic-regression fusion. System performance was assessed in three channel conditions: high-quality v high-quality, mobile-to-landline v mobile-to-landline, and mobile-to-landline v high-quality. Substantial improvements over the baseline system were not observed.
What did Bain really say? A preliminary forensic analysis of the disputed utterance based on data, acoustic analysis, statistical models, calculation of likelihood ratios, and testing of validity.
Morrison, G.S., & Hoy, M. C. (2012). Proceedings of the 46th Audio Engineering Society (AES) Conference on Audio Forensics: Recording, Recovery, Analysis, and Interpretation, Denver, CO, 203207.
- This paper presents a preliminary analysis of the disputed utterance in Bain v R  NZSC 16. A likelihood ratio is calculated as a strength-of-evidence statement with respect to the question: What is the probability of getting the acoustic properties of the disputed utterance if Bain had said “I shot the prick” versus if he had said “I can’t breathe”. In particular, an acoustic and statistical analysis is conducted on the first segment of the second word to estimate the probability of getting the acoustics of this segment if it were a postalveolar fricative versus if it were a palatal fricative. The validity of the system is tested and ways to improve the analysis are discussed.
Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice.
Morrison, G.S., Rose, P., & Zhang, C. (2012). Australian Journal of Forensic Sciences, 44, 155167.
- A protocol for the collection of databases of audio recordings for forensic-voice-comparison research and practice is described. The protocol fulfills the following requirements: (1) The database contains at least two non-contemporaneous recordings of each speaker. (2) The database contains recordings of each speaker using different speaking styles which are typical of speaking styles found in casework, and which are elicited as natural speech. (3) The database is usable for research and casework involving recording- and transmission-channel mismatch. The protocol includes three speaking tasks, (1) an informal telephone conversation, (2) an information exchange task over the telephone, and (3) a pseudo-police-style interview. Technical issues are also discussed.
The likelihood-ratio framework and forensic evidence in court: A response to R v T.
Morrison, G.S. (2012). International Journal of Evidence and Proof, 16, 129.
- Erratum: Table 1, fourth line of numbers should read: “100010 000”, not “10010 000”
- In R v T the Court concluded that the likelihood-ratio framework should not be used for the evaluation of evidence except ‘where there is a firm statistical base’. The present paper argues that the Court’s opinion is based on misunderstandings of statistics and of the likelihood-ratio framework for the evaluation of evidence. The likelihood-ratio framework is a logical framework and not itself dependent on the use of objective measurements, databases, and statistical models. The ruling is analysed from the perspective of the new paradigm for forensic-comparison science: the use of the likelihood-ratio framework for the evaluation of evidence; a strong preference for the use of objective measurements, databases representative of the relevant population, and statistical models; and empirical testing of the validity and reliability of the forensic-comparison system under conditions reflecting those of the case at trial.
Forensic voice comparison using Chinese /iau/.
Zhang, C., Morrison, G.S., & Thiruvaran, T. (2011). Proceedings of the17th International Congress of Phonetic Sciences, Hong Kong, China, 22802283.
- An acoustic-phonetic forensic-voice-comparison system extracted information from the formant trajectories of tokens of Standard Chinese /iau/. When this information was added to a generic automatic forensic-voice-comparison system, which did not itself exploit acoustic-phonetic information, there was a substantial improvement in system validity but a decline in system reliability.
Humans versus machine: Forensic voice comparison on a small database of Swedish voice recordings.
Lindh, J., & Morrison, G.S. (2011). Proceedings of the17th International Congress of Phonetic Sciences, Hong Kong, China, 12541257.
- A procedure for comparing the performance of humans and machines on speaker recognition and on forensic voice comparison is proposed and demonstrated. The procedure is consistent with the new paradigm for forensic-comparison science (use of the likelihood-ratio framework and testing of the validity and reliability of the results). The use of the procedure is demonstrated using a small database of Swedish voice recordings.
Measuring the validity and reliability of forensic likelihood-ratio systems.
Morrison, G.S. (2011). Science & Justice, 51, 9198.
- Matlab Code: CI_calcs 2011-03-30.
- Throughout 2015 and 2016 this was ranked as the most cited paper published in Science & Justice within the previous 5 years.
- There has been a great deal of concern recently about validity and reliability in forensic science. This paper reviews for a broad target audience metrics of validity and reliability (accuracy and precision) which have been applied in forensic voice comparison and which are potentially applicable in other branches of forensic science. The metric of validity is the log likelihood-ratio cost (Cllr), and the metric of reliability is an empirical estimate of credible intervals. A revised procedure for the calculation of credible intervals is introduced.
A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: Multivariate kernel density (MVKD) versus Gaussian mixture model - universal background model (GMM-UBM).
Morrison, G.S. (2011). Speech Communication, 53, 242256.
- Matlab Code: MVKD_v_GMM-UBM 2010-02-19
- Two procedures for the calculation of forensic likelihood ratios were tested on the same set of acousticphonetic data. One procedure was a multivariate kernel density procedure (MVKD) which is common in acousticphonetic forensic voice comparison, and the other was a Gaussian mixture modeluniversal background model (GMMUBM) which is common in automatic forensic voice comparison. The data were coefficient values from discrete cosine transforms fitted to second-formant trajectories of /aI/, /eI/, /ou/, /au/, and /OI/ tokens produced by 27 male speakers of Australian English. Scores were calculated separately for each phoneme and then fused using logistic regression. The performance of the fused GMMUBM system was much better than that of the fused MVKD system, both in terms of accuracy (as measured using the log-likelihood-ratio cost, Cllr) and precision (as measured using an empirical estimate of the 95% credible interval for the likelihood ratios from the different-speaker comparisons).
An issue in the calculation of logistic-regression calibration and fusion weights for forensic voice comparison.
Morrison, G.S., Thiruvaran, T., & Epps, J. (2010). Proceedings of the 13th Australasian International Conference on Speech Science and Technology, Melbourne, 7477.
- Logistic regression is a popular procedure for calibration and fusion of likelihood ratios in forensic voice comparison and automatic speaker recognition. The availability of multiple recordings of each speaker in the database used for calculation of calibration/fusion weights allows for different procedures for calculating those weights. Two procedures are compared, one using pooled data and the other using mean values from each speaker-comparison pair. The procedures are tested using an acoustic-phonetic and an automatic forensic-voicecomparison system. The mean procedure has a tendency to result in better accuracy, but the pooled procedure always results in better precision of the likelihood-ratio output.
Forensic voice comparison.
Morrison, G.S. (2010). In I. Freckelton, & H. Selby (Eds.), Expert Evidence (Ch. 99). Sydney, Australia: Thomson Reuters.
- As part of the Expert Evidence series the 100-page Forensic Voice Comparison chapter is aimed first at lawyers, judges, police officers, and potential jury members; however, it is hoped that this chapter will also be of interest to forensic scientists, phoneticians / speech scientists, speech-processing engineers, and students of all these disciplines. It introduces forensic voice comparison in a relatively non-technical way, assuming a reader who has no prior knowledge of the subject. The focus is on the understanding of concepts and the provision of basic knowledge.
- “Morrison has a very nice writing style and I think he has phrased some of the fundamental matters in a way that is more clearly put than I have ever seen. I think he has done a masterly job.”
- Dr John S. Buckleton, Principle Scientist, ESR Forensics, Auckland, New Zealand
- “It is very informative and at the same time easy to read a rare combination. It’s a great book.”
- Dr Michael Jessen, Senior Scientist, Department of Speaker and Audio Analysis, Federal Criminal Police Office, Wiesbaden, Germany
Estimating the precision of the likelihood-ratio output of a forensic-voice-comparison system.
Morrison, G.S., Thiruvaran, T., & Epps, J. (2010). Proceedings of Odyssey 2010: The Speaker and Language Recognition Workshop, Brno, 6370.
- Matlab Code: CI_calcs 2011-03-30.
- The issues of validity and reliability are important in forensic science. Within the likelihood-ratio framework for the evaluation of forensic evidence, the log-likelihood-ratio cost (Cllr) has been applied as an appropriate metric for evaluating the accuracy of the output of a forensic-voice-comparison system, but there has been little research on developing a quantitative metric of precision. The present paper describes two procedures for estimating the precision of the output of a forensic-comparison system, a non-parametric estimate and a parametric estimate of its 95% credible interval. The procedures are applied to estimate the precision of a basic automatic forensic-voice-comparison system presented with different amounts of questioned-speaker data. The importance of considering precision is discussed.
An empirical estimate of the precision of likelihood ratios from a forensic-voice-comparison system.
Morrison, G.S., Zhang, C., & Rose, P. (2011). Forensic Science International, 208, 5965.
- An acousticphonetic forensic-voice-comparison system was constructed using the time-averaged formant values of tokens of 61 male Chinese speakers’ /i/, /e/, and /a/ monophthongs as input. Likelihood ratios were calculated using amultivariate kernel density formula. A separate set of likelihood ratios was calculated for each vowel phoneme, and these were then fused and calibrated using linear logistic regression. The system was tested via cross-validation. The validity and reliability of the results were assessed using the log-likelihood-ratio-cost function (Cllr, a measure of accuracy) and an empirical estimate of the credible interval for the likelihood ratios from different-speaker comparisons (ameasure of precision). The credible interval was calculated on the basis of two independent pairs of samples for each different-speaker comparison pair.
Comparación forense de la voz y el cambio de paradigma.
Morrison, G.S. (2011). CSIC/UIMP Posgrado Oficial en Estudios Fónicos Cuadernos de Trabajo, 1, 138. [Translation by Curiá C. of: Morrison, G.S. (2009). Forensic voice comparison and the paradigm shift. Science & Justice, 49, 298308.]
- Nos encontramos en medio de un proceso de cambio de paradigma en las ciencias relacionadas con la comparación forense de la voz. El nuevo paradigma puede caracterizarse como una implementación cuantitativa del marco de la relación de verosimilitud y de la evaluación cuantitativa de la validez y la fiabilidad de los resultados. Durante los años 90 este nuevo paradigma se adoptó ampliamente en la comparación de los perfiles de ADN, y se ha ido extendiendo gradualmente a otras ramas de las ciencias forenses, incluyendo la comparación forense de la voz. El presente artículo describe en primer lugar el nuevo paradigma y, a continuaci ón, expone la historia de su adopción en la comparación forense de la voz durante la última década. El cambio de paradigma es un proceso todavía incompleto, y aquellos que trabajan en él todavía representan una minoría entre la comunidad dedicada a la comparación forense de la voz.
Forensic voice comparison and the paradigm shift.
Morrison, G.S. (2009). Science & Justice, 49, 298308.
- We are in the midst of a paradigm shift in the forensic comparison sciences. The new paradigm can be characterised as quantitative data-based implementation of the likelihood-ratio framework with quantitative evaluation of the reliability of results. The new paradigm was widely adopted for DNA profile comparison in the 1990s, and is gradually spreading to other branches of forensic science, including forensic voice comparison. The present paper first describes the new paradigm, then describes the history of its adoption for forensic voice comparison over approximately the last decade. The paradigm shift is incomplete and those working in the new paradigm still represent a minority within the forensicvoice-comparison community.
Comments on Coulthard & Johnson’s (2007) portrayal of the likelihood-ratio framework.
Morrison, G.S. (2009). Australian Journal of Forensic Sciences, 41, 155161.
- In their recent introduction to forensic linguistics, Coulthard & Johnson (2007) include a portrayal of the likelihood-ratio framework for the evaluation of forensic comparison evidence (pp. 203207). This portrayal includes a number of inaccuracies. The present letter attempts to correct these inaccuracies.
A reponse to the UK position statement on forensic speaker comparison.
Rose, P., & Morrison, G.S. (2009). International Journal of Speech, Language and the Law, 16, 139163.
Likelihood-ratio-based forensic speaker comparison using parametric representations of vowel formant trajectories.
Morrison, G.S. (2009). Journal of the Acoustical Society of America, 125, 23872397.
- Non-contemporaneous speech samples from 27 male speakers of Australian English were compared in a forensic likelihood-ratio framework. Parametric curves (polynomials and discrete cosine transforms) were fitted to the formant trajectories of the diphthongs /aI/, /eI/, /oU/, /aU/, and /OI/. The estimated coefficient values from the parametric curves were used as input to a generative multivariate-kernel-density formula for calculating likelihood ratios expressing the probability of obtaining the observed difference between two speech samples under the hypothesis that the samples were produced by the same speaker versus under the hypothesis that they were produced by different speakers. Cross-validated likelihood-ratio results from systems based on different parametric curves were calibrated and evaluated using the log-likelihood-ratio cost function (Cllr). The cross-validated likelihood ratios from the best-performing system for each vowel phoneme were fused using logistic regression. The resulting fused system had a very low error rate, thus meeting one of the requirements for admissibility in court.
Automatic-type calibration of traditionally derived likelihood ratios: Forensic analysis of Australian English /o/ formant trajectories.
Morrison, G.S., & Kinoshita, Y. (2008). Proceedings of Interspeech 2008 (pp. 15011504). International Speech Communication Association.
- A traditional-style phonetic-acoustic forensic-speakerrecognition analysis was conducted on Australian English /o/ recordings. Different parametric curves were fitted to the formant trajectories of the vowel tokens, and cross-validated likelihood ratios were calculated using a single-stage generative multivariate kernel density formula. The outputs of different systems were compared using Cllr, a metric developed for automatic speaker recognition, and the crossvalidated likelihood ratios were calibrated using a procedure developed for automatic speaker recognition. Calibration ameliorated some likelihood-ratio results which had offered strong support for a contrary-to-fact hypothesis.
Forensic speaker recognition of Chinese /i/ and /y/ using likelihood ratios.
Zhang, C., Morrison, G.S., & Rose, P. (2008). Proceedings of Interspeech 2008 (pp. 19371940). International Speech Communication Association.
- A likelihood-ratio-based forensic speaker discrimination was conducted using the mean formant frequencies of Standard Chinese /i/ and /y/ tokens produced by 64 male speakers. The speech data were relatively forensically realistic in that they were relatively extemporaneous, were recorded over the telephone, and were from three non-contemporaneous recording sessions. A multivariate-kernel-density formula was used to calculate cross-validated likelihood ratios comparing all possible same-speaker and different-speaker combinations across sessions. Results were comparable with those previously obtained with laboratory speech in other languages. In general, greater strength of evidence was obtained for recording sessions separated by one week than for recording sessions separated by one month.
Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /aI/.
Morrison, G.S. (2008). International Journal of Speech, Language and the Law, 15, 249266.
Incorrect versions of Figures 3 and 4 were printed in the paper version. These have been corrected in the online vesion.
- Earlier studies have indicated that information regarding speaker identity can be extracted from the dynamic spectral properties of diphthongs. Some studies have conducted likelihood-ratio analyses based on simple models of the dynamic formant properties of diphthongs (e.g., dual-target model), and others have used more sophisticated polynomial curve fitting models but have not conducted likelihood-ratio analyses. The present study examines the strength of evidence which can be produced by a likelihood-ratio analysis based on the coefficients of polynomial curves fitted to the formant trajectories of Australian English /aI/ tokens. A cubic polynomial model offers a substantial improvement over the dual-target model.