juhta@removethisida.liu.se), Laboratory for Intelligent Information Systems (IISLAB)
Categorization of electronic documents is discussed and the results of a preliminary study of techniques that people use for categorizing electronic mail letters are presented. The results, mainly descriptive, are used as a basis for a discussion of the feasibility of an adaptive approach to text categorization and understanding. The adaptability is supposed to come from a design of a system that uses knowledge about documents in general, rather than any specific type of document, to categorize documents. One implication of this approach is a text categorization system which can easily be adapted to categorizing different types of documents and perform automatic text categorization.
Categorization is a problem that cognitive psychologists have dealt with for many years [2][8]. In the process of categorization of electronic documents, categories are typically used as a means of organizing and getting an overview of the information in a collection of several documents. Folders in electronic mail (e-mail) and topics in Usenet News are a couple of concrete examples of categories in computer-mediated communication. Further-more, the information within a document is normally organized by some kind of structure, either physical (layout) or logical (context), or (at the best) both.
Text categorization is in this paper defined as an information retrieval task in which one or more category labels are assigned to a document (cf. [7]).
Automatic categorization of electronic documents is dependent on the structure of the documents---documents can be partitioned into categories based on, e.g., differences of their layouts (cf. [12], p. 115). Keywords or phrases are often very useful for categorizing electronic documents, but only to a certain degree (cf. [4]). The physical appearance of a document is not enough for a good categorization of the document: the documents in one domain, for example the domain "tax forms", may have a well-known and predefined structure while they in other domains, for example e-mail letters, may have less structure and, as is typical for e-mail, be more "anarchic" by nature.
Computer-mediated letter writing is neither a typically written form of communication, nor entirely like a spoken dialogue. It is a form of communication which potentially allows for a great deal of freedom, and which encourages the use of informal, unplanned language by the characteristics of the medium. -- K. Severinson Eklundh [10]
E-mail letters have properties that resemble spoken language, although the letters may seem simple and compact. In a study by K. Severinson Eklundh [10] of the COM system (a computerized conference system developed by the Swedish National Defense Research Institute, FOA), some of the properties that have been found in letters are often associated with spoken language (cf. [10], p. 143): the medium is interactive (cf. [14]), the letters are highly context-bound (they refer to other letters), there are frequent occurrences of other linguistic strategies usually associated with spoken language (e.g., ellipsis), the letters often contain only one topic at a time (cf. [14]), and very few of the letters contain an explicit greeting or initiation (or termination) part. In other words, people using electronic letters and spoken dialogue use similar linguistic strategies [10], especially the oral strategies or "those (aspects) by which maximal background information and connective tissue are made explicit" (see [15], p. 3).
There are, of course, factors that make electronic letters different from spoken language, e.g., the fact that every letter may be stored (often in folders) for later retrieval. According to Malone et al. [5], the users' categorization of letters in folders by subject is indicative of that templates should be natural to introduce in e-mail. Although templates would give e-mail letters more inherent structure, they would also give less freedom to the e-mail user. Furthermore, e-mail communication is much like writing a note and the letters tend to be short and informative, according to many users [14]. Therefore, if automatic text categorization could be done on the less well-structured e-mail letters, the e-mail users would be very pleased.
E-mail is clearly an interesting means of communication [14][10] and is therefore used as a basis for the preliminary investigation of text categorization techniques in this paper.
There are two general and basic principles for creating categories: cognitive economy and perceived world structure [8]. The principle of cognitive economy means that the function of categories is to provide maximum information with the least cognitive effort. The principle of perceived world structure means that the perceived world is not an unstructured set of arbitrary or unpredictable attributes.
The attributes that an individual will perceive, and thus use for categorization, are determined by the needs of the individual. These needs change over time and with the physical and social environment. In other words, a system for automatic text categorization should in some way "know" both the type of text and the type of user.
The maximum information with least cognitive effort is achieved if categories map the perceived world structure as closely as possible [8].
Coding by category is fundamental to mental life because it greatly reduces the demands on perceptual processes, storage space, and reasoning processes, all of which are known to be limited. -- E. E. Smith [11]
Psychologists agree that similarity play a central role in placing different items into a single category. Furthermore, people want to maximize within-category similarity while minimizing between-category similarity [11].
Searching and retrieving information can be treated as a process in which query information is mapped onto the categories (the structure of the filing system) in order to retrieve a specific document [2]. Relatively unstructured documents, as in the case of e-mail letters, and unstructured queries might require some transformation on the query information.
A preliminary study, consisting of two experiments with two different sets of stimuli (documents), has been conducted. The goal of the study was to better understand how people categorize information, especially e-mail letters, and also how people use the categories for searching information. This was all done with the hypothesis that better, and more effective, automatic text categorization systems may be designed by examining the empirical data from studying people and their categorization techniques.
A central question is: Given a set of e-mail letters with different degrees and types of similarity, how does a person group them into a finite number of categories? What features (syntactic and semantic) are used for determining the membership in a category?
The experiments are based on the techniques utilized in the experiments conducted by Cañas, Safayeni, and Conrath [2].
Two different sets of objects (documents) were used in the study: Swedish proverbs and e-mail letters. Proverbs were selected as one set of documents, to be used for comparison with the experiments conducted by Cañas, Safayeni, and Conrath [2]. The significant characteristic of proverbs is that they do not have any commonly known classification scheme, as opposed to e-mail letters.
To limit the time for the experiments in the study the number of documents in each set was set to 46---this number was determined after a pilot experiment. The proverbs were taken randomly from a book of proverbs [13], while the e-mail letters were taken from the internal distribution list (idaint) used at the department. The set of e-mail letters used in the study consisted of approximately nine days worth of (consecutive) letters.
Five persons were used in the study. Its length (the time for the experiment with proverbs varied from 60 to 120 minutes each, and for the experiment with e-mail letters from 30 to 60 minutes each) and complexity prevented the use of a larger sample. The subjects were drawn among graduate students, working at the Department of Computer and Information Science, Linköping University, all subscribers of the distribution list and with Swedish as their mother tongue.
The documents (proverbs and the e-mail letters, respectively) were divided into two sets of 18 and 28 for categorization in rounds 1 and 2 of each experiment in the study. The documents were given to the subjects each on sequentially numbered cards. Extra copies of each of the documents were available, together with blank cards for labeling the categories.
The subjects were asked to provide a long and a short description for each category that they created or modified.
After each round the subjects were asked to retrieve appropriate documents for a given set of queries. The queries were generated by the experimenter in different ways, based on the documents, and with a controlled set of responses (cf. [2]): based on rewording of a single document (which generated a single response), based on situation where a document was an appropriate response (which generated a single response), based on classifications from books---for proverbs---and the pilot study---for e-mail letters (which generated multiple responses), based on a document not among the 46 (which generated no response), and based on information in a document other than its meaning, e.g., a specific word in it (which generated one or multiple responses).
After the retrieval part the subjects were asked to compare documents within some categories (selected by the experimenter), and also, documents between categories. The comparison of two documents were made based on the "typicality" of the documents. The most and least representative document of a category were compared (for a within-category measurement), and also the most representative documents of two categories (for a between-category measurement---see section 2.1 on p. 2). The typicality (or semantic distance, cf. [2]) was measured on a scale from 0 to 10 for each document. This type of measurement has been shown to be very successful [2].
The data that was collected during the study includes the time to complete the categorization, the category descriptions, contents, and the most and least representative documents of each category (together with the semantic distances). When responding to the queries, the responses (the numbers of the retrieved documents) were recorded and also the categories that were searched and how the subjects reasoned. Furthermore, the subjects were asked to record the contents of categories before they were divided or joined during categorization.
Finally, some statistics from the subjects' e-mail tools were collected. This was done with the intention of getting a connection to the author's forthcoming report [14].
The analysis of the data is still in its early stages. The results of this preliminary study will be used for further investigation of any discernible patterns of categorization that might surface in the analysis of the data collected.
One subject in the proverbs experiment and two subjects in the e-mail letters experiment created a hierarchical structure but none any deeper than two levels. The average number of categories in the experiments was 9,00 and 7,40, respectively, for the two experiments in the study (with standard deviation, SD, of 1,58 and 2,07) with range from 7 to 11 categories and 5 to 10 categories respectively. The average number of documents per category was 5,60 for proverbs (SD of 3,97) and 6,41 for e-mail letters (SD of 5,52). Four subjects used extra copies in the categorization of proverbs, while one subject only used extra copies in the categorization of e-mail letters---no one copied a document more than once, i.e., put one document in more than two categories.
The average typicality (semantic distance) within categories for the documents and the distances between categories have not been analysed at this stage yet. In a very preliminary analysis, though, it seems that there is a shorter semantic distance (on the scale from 0 to 10) within categories than between categories.
The retrieval of documents was measured as the ratio of the number of retrieved correct documents over the number of pre-defined correct responses (cf. [2]). The average retrieval performance for proverbs was better for rewording-type queries than for the other types of queries when proverbs were concerned (see table 1).
For e-mail letters the differences of retrieval performance between the queries were less clear (table 2).
The statistics collected from the subjects' e-mail tools are summarized in table 3.

Table 1: Retrieval performance for proverbs

Table 2: Retrieval performance for e-mail letters

Table 3: E-mail statistics collected
The mind is a pattern making system. The mind creates patterns out of the environment and then recognizes and uses such patterns. This is the basis of its effectiveness. -- Edward de Bono, "Lateral Thinking" (1970)
A preliminary study of people's categorization of different types of e-mail letters, compared to proverbs, has been performed, in a step to learn more about how to automate text categorization in e-mail and what to look for in, e.g., the (both physical and logical) structure of e-mail letters. The analysis of the data from the study is still in its early stages. One apparent distinction between the two experiments is that the subjects relied more on their memories when retrieving e-mail letters than when retrieving proverbs---in other words, the subjects did not use the structure as much in the e-mail part as in the proverbs part of the study.
One of the goals in the author's current research is to find a means for automatically measuring something similar to the typicality (see section 3 on p. 2) of an electronic document (e-mail letter), so that the typicality can be used for measuring dissimilarity between a pair of documents or between a document and a summary of category members. The typicality of a letter has to be defined---a combination of information retrieval and simple natural language processing techniques may be feasible to use to extract important information from letters [1][6].
A future scenario with automatic categorization supplied in an e-mail tool might look like this: the user initializes the tool with some basic categories (folders) and some typical messages in each category. The typicality of each arriving e-mail letter is then measured and compared to each of the categories and put in the category that contains the least dissimilar letters. When the letters in a category reaches a too a wide range of typicality, i.e., when there is a too large semantic distance between the least and the most typical letter in a category, the category is split into two.
Text categorization can be seen as the basis for organization and retrieval of information. Automatic categorization may be used to organize relatively unstructured documents and make it easier to use other methods of creating and presenting the documents, e.g., associative methods such as hypertext, which are more natural to a user [3][9].
Adding something like measurement of typicality to an individual letter may give e-mail a new "sense". This might make the (manual or automatic) categorization of letters easier, hopefully more natural, and also more like categorization of letters in traditional postal mail.
The author is grateful to Nahid Shahmehri for fruitful discussions and the subjects for their time, and he would also like to thank the reviewers for their useful comments.
[13] F. Ström, Svenska ordspråk. Albert Bonniers förlag, Stockholm, 1929, 74 p.
Note! Most of the author's publications are available via
http://www.ida.liu.se/~juhta/publications.html