Text miningText mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al.
Textual criticismTextual criticism is a branch of textual scholarship, philology, and literary criticism that is concerned with the identification of textual variants, or different versions, of either manuscripts (mss) or of printed books. Such texts may range in dates from the earliest writing in cuneiform, impressed on clay, for example, to multiple unpublished versions of a 21st-century author's work. Historically, scribes who were paid to copy documents may have been literate, but many were simply copyists, mimicking the shapes of letters without necessarily understanding what they meant.
Content analysisContent analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the key advantages of using content analysis to analyse social phenomena is their non-invasive nature, in contrast to simulating social experiences or collecting survey answers. Practices and philosophies of content analysis vary between academic disciplines.
Document processingDocument processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or a document to obtain a , but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor.
Qualitative researchQualitative research is a type of research that aims to gather and analyse non-numerical (descriptive) data in order to gain an understanding of individuals' social reality, including understanding their attitudes, beliefs, and motivation. This type of research typically involves in-depth interviews, focus groups, or observations in order to collect data that is rich in detail and context. Qualitative research is often used to explore complex phenomena or to gain insight into people's experiences and perspectives on a particular topic.
Historical criticismHistorical criticism, also known as the historical-critical method or higher criticism, is a branch of criticism that investigates the origins of ancient texts in order to understand "the world behind the text". While often discussed in terms of Jewish and Christian writings from ancient times, historical criticism has also been applied to other religious and secular writings from various parts of the world and periods of history.
Speech recognitionSpeech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Masoretic TextThe Masoretic Text (MT or M; Nūssāḥ Hammāsōrā, lit. 'Text of the Tradition') is the authoritative Hebrew and Aramaic text of the 24 books of the Hebrew Bible (Tanakh) in Rabbinic Judaism. The Masoretic Text defines the Jewish canon and its precise letter-text, with its vocalization and accentuation known as the mas'sora. Referring to the Masoretic Text, masorah specifically means the diacritic markings of the text of the Hebrew scriptures and the concise marginal notes in manuscripts (and later printings) of the Tanakh which note textual details, usually about the precise spelling of words.
Automatic summarizationAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data. Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.
Optical character recognitionOptical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of s of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast).
Document classificationDocument classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science.
Text corpusIn linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In search technology, a corpus is the collection of documents which is being searched.
Computational complexityIn computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations) and memory storage requirements. The complexity of a problem is the complexity of the best algorithms that allow solving the problem. The study of the complexity of explicitly given algorithms is called analysis of algorithms, while the study of the complexity of problems is called computational complexity theory.
Handwriting recognitionHandwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available.
DiplomaticsDiplomatics (in American English, and in most anglophone countries), or diplomatic (in British English), is a scholarly discipline centred on the critical analysis of documents: especially, historical documents. It focuses on the conventions, protocols and formulae that have been used by document creators, and uses these to increase understanding of the processes of document creation, of information transmission, and of the relationships between the facts which the documents purport to record and reality.
Language modelA language model is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. Large language models, as their most advanced form, are a combination of feedforward neural networks and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.
Heritage tourismCultural heritage tourism (or just heritage tourism) is a branch of tourism oriented towards the cultural heritage of the location where tourism is occurring. The National Trust for Historic Preservation in the United States defines heritage tourism as "traveling to experience the places, artifacts and activities that authentically represent the stories and people of the past", and "heritage tourism can include cultural, historic and natural resources".
Complexity classIn computational complexity theory, a complexity class is a set of computational problems "of related resource-based complexity". The two most commonly analyzed resources are time and memory. In general, a complexity class is defined in terms of a type of computational problem, a model of computation, and a bounded resource like time or memory. In particular, most complexity classes consist of decision problems that are solvable with a Turing machine, and are differentiated by their time or space (memory) requirements.
Post-productionPost-production is part of the process of filmmaking, video production, audio production, and photography. Post-production includes all stages of production occurring after principal photography or recording individual program segments. The first part of the post-production process is the traditional non-linear (analog) film editing at the outset of post-production has mostly been replaced by digital or video editing software that operates as a non-linear editing (NLE) system.
Pattern recognitionPattern recognition is the automated recognition of patterns and regularities in data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, , information retrieval, bioinformatics, data compression, computer graphics and machine learning.