Computational linguistics can be used to uncover mysteries in text which are not always obvious to visual inspection. For example, the computer analysis of writing style can show who might be the true author of a text in cases of disputed authorship or suspected plagiarism. The theoretical background to authorship attribution is presented in a step by step manner, and comprehensive reviews of the field are given in two specialist areas, the writings of William Shakespeare and his contemporaries, and the various writing styles seen in religious texts. The final chapter looks at the progress computers have made in the decipherment of lost languages. This book is written for students and researchers of general linguistics, computational and corpus linguistics, and computer forensics. It will inspire future researchers to study these topics for themselves, and gives sufficient details of the methods and resources to get them started.
Preface ix
Chapter 1 Author identification 1 (58)
1 Introduction 1 (4)
2 Feature selection 5 (6)
2.1 Evaluation of feature sets for 8 (3)
authorship attribution
3 Inter-textual distances 11 (19)
3.1 Manhattan distance and Euclidean 12 (2)
distance
3.2 Labbe and Labbe's measure 14 (1)
3.3 Chi-squared distance 15 (1)
3.4 The cosine similarity measure 16 (2)
3.5 Kullback-Leibler Divergence (KLD) 18 (1)
3.6 Burrows' Delta 18 (5)
3.7 Evaluation of feature-based measures 23 (3)
for inter-textual distance
3.8 Inter-textual distance by semantic 26 (2)
similarity
3.9 Stemmatology as a measure of 28 (2)
inter-textual distance
4 Clustering techniques 30 (17)
4.1 Introduction to factor analysis 31 (4)
4.2 Matrix algebra 35 (3)
4.3 Use of matrix algebra for PCA 38 (6)
4.4 PCA case studies 44 (1)
4.5 Correspondence analysis 45 (2)
5 Comparisons of classifiers 47 (3)
6 Other tasks related to authorship 50 (8)
6.1 $tylochronometry 50 (3)
6.2 Affect dictionaries and psychological 53 (5)
profiling
6.3 Evaluation of author profiling 58 (1)
7 Conclusion 58 (1)
Chapter 2 Plagiarism and spam filtering 59 (40)
1 Introduction 59 (3)
2 Plagiarism detection software 62 (24)
2.1 Collusion and plagiarism, external and 63 (1)
intrinsic
2.2 Preprocessing of corpora and feature 63 (1)
extraction
2.3 Sequence comparison and exact match 64 (1)
2.4 Source-suspicious document similarity 65 (1)
measures
2.5 Fingerprinting 66 (1)
2.6 Language models 67 (1)
2.7 Natural language processing 68 (2)
2.8 Intrinsic plagiarism detection 70 (3)
2.9 Plagiarism of program code 73 (1)
2.10 Distance between translated and 74 (2)
original text
2.11 Direction of plagiarism 76 (2)
2.12 The search engine-based approach used 78 (3)
at PAN-13
2.13 Case study 1: Hidden influences from 81 (2)
printed sources in the Gaelic tales of
Duncan and Neil MacDonald
2.14 Case study 2: General George Pickett 83 (1)
and related writings
2.15 Evaluation methods 84 (1)
2.16 Conclusion 85 (1)
3 Spam filters 86 (12)
3.1 Content-based techniques 87 (1)
3.2 Building a labeled corpus for training 87 (1)
3.3 Exact matching techniques 88 (1)
3.4 Rule-based methods 89 (1)
3.5 Machine learning 90 (2)
3.6 Unsupervised machine learning approaches 92 (1)
3.7 Other spam-filtering problems 93 (1)
3.8 Evaluation of spam filters 94 (1)
3.9 Non-linguistic techniques 94 (3)
3.10 Conclusion 97 (1)
4 Recommendations for further reading 98 (1)
Chapter 3 Computer studies of Shakespearean 99 (50)
authorship
1 Introduction 99 (2)
2 Shakespeare, Wilkins and "Pericles" 101(7)
2.1 Correspondence analysis for "Pericles" 105(3)
and related texts
3 Shakespeare, Fletcher and "The Two Noble 108(2)
Kinsmen"
4 "King John" 110(1)
5 "The Raigne of King Edward III" 111(7)
5.1 Neural networks in stylometry 111(2)
5.2 Cusum charts in stylometry 113(3)
5.3 Burrows' Zeta and Iota 116(2)
6 Hand D in "Sir Thomas More" 118(14)
6.1 Elliott, Valenza and the Earl of Oxford 118(3)
6.2 Elliott and Valenza: Hand D 121(1)
6.3 Bayesian approach to questions of 122(5)
Shakespearian authorship
6.4 Bayesian analysis of Shakespeare's 127(4)
second person pronouns
6.5 Vocabulary differences, LDA and the
authorship of Hand D 13o
6.6 Hand D: Conclusions 131(1)
7 The three parts of "Henry VI" 132(1)
8 "Timon of Athens" 132(1)
9 "The Puritan" and "A Yorkshire Tragedy" 133(1)
10 "Arden of Faversham" 134(2)
11 Estimation of the extent of Shakespeare's 136(5)
vocabulary and the authorship of the "Taylor"
poem
12 The chronology of Shakespeare 141(6)
13 Conclusion 147(2)
Chapter 4 Stylometric analysis of religious 149(58)
texts
1 Introduction 149(41)
1.1 Overview of the New Testament by 151(2)
correspondence analysis
1.2 Q 153(16)
1.3 Luke and Acts 169(2)
1.4 Recent approaches to New Testament 171(4)
stylometry
1.5 The Pauline Epistles 175(13)
1.6 Hebrews 188(1)
1.7 The Signs Gospel 188(2)
2 Stylometric analysis of the Book of Mormon 190(8)
3 Stylometric studies of the Qu'ran 198(8)
4 Condupion 206(1)
Chapter 5 Computers and decipherment 207(52)
1 Introduction 207(17)
1.1 Differences between cryptography and 208(1)
decipherment
1.2 Cryptological techniques for automatic 209(3)
language recognition
1.3 Dictionary approaches to language 212(1)
recognition
1.4 Sinkov's test 212(1)
1.5 Index of coincidence 213(1)
1.6 The log-likelihood ratio 214(1)
1.7 The chi-squared test statistic 215(1)
1.8 Entropy of language 215(3)
1.9 Zipf's Law and Heaps' Law coefficients 218(1)
1.10 Modal token length 219(1)
1.11 Autocorrelation analysis 220(1)
1.12 Vowel identification 221(3)
2 Rongorongo 224(19)
2.1 History of Rongorongo 224(2)
2.2 Characteristics of Rongorongo 226(1)
2.3 Obstacles to decipherment 227(1)
2.4 Encoding of Rongorongo symbols 227(1)
2.5 The "Mamari" lunar calendar 228(1)
2.6 Basic statistics of the Rongorongo 228(1)
corpus
2.7 Alignment of the Rongorongo corpus 229(2)
2.8 A concordance for Rongorongo 231(2)
2.9 Collocations and collostructions 233(1)
2.10 Classification by genre 234(3)
2.11 Vocabulary richness 237(4)
2.12 Podzniakov's approach to matching 241(2)
frequency curves
3 The Indus Valley texts 243(9)
3.1 Why decipherment of the Indus texts is 243(1)
difficult
3.2 Are the Indus texts writing? 244(4)
3.3 Other evidence for the Indus Script 248(1)
being writing
3.4 Determining the order of the Markov 248(1)
model
3.5 Missing symbols 249(1)
3.6 Text segmentation and the 249(2)
log-likelihood measure
3.7 Network analysis of the Indus Signs 251(1)
4 Linear A 252(3)
5 The Phaistos disk 255(1)
6 Iron Age Pictish symbols 256(1)
7 Mayan glyphs 256(1)
8 Conclusion 257(2)
References 259(22)
Index 281