“Computation in Linguistics: A Case Book”
APPLIED PROBLEMS—CONTENT PROCESSING
A Modifiable Routine for Connecting Related Sentences of English Text
1.0 INTRODUCTION
This is a report on a computational approach to the relations between the sentences of a text. These relations organize the text into paragraph-like groups of sentences, each group dealing with a separate topic. To make the discussion concrete, a computer application which produces text abstracts is described. This application makes use of the relations between sentences to extract sentences from the sentence groups in order to form an abstract. The emphasis will be on the way in which these relations organize the sentences of the text rather than on the programming details required to set up the application on an actual computer.
The easiest way to follow a computational approach is to perform the computations. First we will discuss how and why the computations work on the text. Then, an example will be given of the application of the computational rules to a short text. The computation makes use of a dictionary and rules for when and how to use the dictionary. The text determines when to use the dictionary.
The computational task is in some ways similar to reading a text in a foreign language with the help of a dictionary. In both cases, the text determines when the dictionary is to be used. When several alternatives are given by the dictionary, the text determines which one is the appropriate choice. The result of using our dictionary however is not an understanding of the sentences of the text, but an explicit diagram of the significant relations between sentences of the text. This diagram is used to produce abstracts.
2.0 PRACTICAL AND THEORETICAL MOTIVATION
For purposes of this discussion, an abstract is a selection of sentences from a text which is sufficient to allow a reader to decide if he should read the entire text. When this selection is done by a computer whose processing is directed and organized entirely by the words and sentences of the text and a computer program, the result will be called an auto-abstract.
The most obvious way to prepare an abstract would be to ask a person capable of understanding the text to select a set of sentences which would represent it. The author of the text is such a person. The practical goal of auto-abstracting1 is to make it unnecessary to use a person who understands the text. Instead, auto-abstracting procedures make use of the organization imposed on the text by its author. This organization is based on the clues given by the author to help the reader follow the text. The various auto-abstracting procedures which have been developed make use of different aspects of this organization. But all assume that there is sufficient organization of the text, either by vocabulary statistics or sentence arrangements, to recognize the important sentences which represent the text.
The manual equivalent of auto-abstracting would be to go through a text on an unfamiliar topic and underline those sentences which seem important. Even when the subject matter is unfamiliar, it is still possible to recognize the sentences that hold the text together.
The procedure to be described here focuses on those clues which indicate that groups of sentences should be considered together and independently of other sentences. The result is very much as if the text had been reparagraphed, but the paragraphing is much more detailed than usual. Each group of sentences includes either a topic or summary sentence which is identified by the way the sentences are grouped together. By extracting these special sentences, an auto-abstract is produced.
The relations between sentences which make possible this sort of organization of text are discussed in the body of the paper. The point of practical interest is that they can be recognized without understanding the text as a whole or any individual sentence. This means that an abstractor need not understand the text or be familiar with the subject matter in order to select out the important sentences for representing the text. If clues to these relations can be precisely stated, a computer program can be written to select them.
However, this claim must be qualified. No auto-abstracting procedure can guarantee that the sentences it selects out will be interesting to a specialist in the subject matter. Important is defined entirely by the particular text and its organization. Selection of interesting sentences requires knowledge of the subject matter, in other words, information relating the sentences from more than a single text. Insofar as the author treated interesting sentences as important they will be in the auto-abstract. If the requirement were that the auto-abstract include interesting sentences instead of important sentences, there might be no sentences extracted since it is quite possible that there are no interesting sentences in the text.
The theoretically pertinent point for linguists is that the sentences of the text need not be understood in order to recognize the important sentences. Just as the task of syntactic analysis does not require knowledge of the meaning of the individual words in a sentence, the analysis of a text does not require knowledge of the meaning of the individual sentences which are related to each other. Since the most obvious way to recognize the important sentences is to understand them, this suggests that at least part of the meaning of a sentence depends on the overall organization of the text rather than the other way around.
This dependence is made computationally explicit by the decision to select certain sentences for the auto-abstract. The computation can be varied to determine the consequences of adding, deleting, or otherwise reordering and reinterpreting the clues to the organization of the text. This makes auto-abstracting procedures useful experimental devices for testing specific ideas of how text is organized.
3.0 HOW TEXT IS ORGANIZED
The auto-abstracting procedure to be discussed links the sentences of the text on the basis of clues which occur in the sentences. Figure 11.1 is an example of the diagram which results. Because of the way the links are set up, it is possible to construct a variety of paths through the sentences of the text by traversing the links in different ways. Just as a reader may decide to skip individual sentences or whole sections of a text, these alternative paths go through less than the complete set of sentences of the text.
Links are assigned so that there is always a principal path through the text. Every sentence of the text is either on the principal path or is linked to a sentence on the principal path, directly or indirectly. These links to the principal path connect side paths which link to the skipped sentences. The principal path can be enlarged to include these sentences by restoring the side paths.
This way of organizing sentences is familiar to anyone who has written from a sentence outline. The procedure is to write out a list of sentences covering the main topics to be discussed. The list is then interrupted to insert additional sentences. These additional sentences expand on particular sentences in the outline and may themselves be expanded on. By repeating the process of expansion, the entire text is eventually produced. However, the original basic list is never added to. 2
Consider, for example, the following section from a sentence outline for description of a diplomatic crisis:
In this outline, sentences of equal importance are placed directly under each other. Less important sentences are indented and placed under the sentence to which they are related. The same relations between sentences as are represented in the sentence outline can be recognized in text using the procedures described in sections 4 through 8. Although the diagram resulting from these procedures takes a form different from that of a sentence outline, the same structures occur.
The principal path is the set of sentences to which all the other sentences are related. Accordingly, sentences I, II, III are on the principal path, sentences 1, 2, 3 are a related sidepath, and 2a, 2b, 2c is a sidepath related to the first sidepath. The addition or deletion of sidepaths in going through the text represents decisions as to how deep to go in the outline for sentences.
It is not possible to freely insert sentences in a sentence outline when preparing text. If sentences were freely expanded, the organization of the outline as an outline would be destoryed. The same restrictions on using a sentence outline as a basis for preparing text also apply to the diagram of Figure 11.1. They are stated in detail in section 7.
Basically, there are two different ways to use a sentence outline for the preparation of text. The first treats the outline as an incomplete sketch of the text. It takes an incomplete sentence outline and expands it to increase the number of sentences included in the actual text. This is the equivalent of the procedure for restoring sentences to an auto-abstract as described in section 9.
In terms of this procedure, a sentence can be ‘expanded’ by placing an additional sentence directly under it so as to align the pair of sentences. This was done when sentence 2b was placed under sentence 2a in the outline. Alternatively, a sentence can be expanded by indenting another sentence under it as was done with sentence 2a under 2. Any sentence on the outline can be expanded by indenting any number of times to add additional sentences to the outline. In addition, any sentence, except those on the basic list, can be expanded by aligning any number of times to add additional sentences to the outline. Both of these methods of expansion are represented by ‘vertical paths’ in the diagram.
The second way to use a sentence outline takes it as the source of the sentences available for inclusion in the actual text. Applying this second technique, the sentence outline may include more sentences than occur in the actual text. It is the equivalent of the procedure for forming an abstract as described in section 8. As each sentence in the outline is encountered, a decision is made whether or not to include it in the text and where to look in the outline for the next sentence. The sentence chosen can prevent the inclusion of other sentences from the outline.
The freedom to expand on sentences in the outline does not apply to the use of expansions in the text: (1) Whenever a sentence is not included in the text from the outline, no expansion on it may be included in the text. Thus if sentence 1 is not included in the text, sentence la cannot be included either. (2) As soon as a sentence is included which is an expansion by alignment of a preceding sentence and is not immediately adjacent to that sentence, no additional expansions can be made on the preceding sentences which have not already been included in the text. Thus the list of sentences which expand sentence 2 by indenting to include sentences 2a-2b-2c cannot be further expanded as soon as sentence 3 is included in the text. The sentence sequence 2-2a-3-2b cannot occur, but the sequence 2-2a-2b-3 may. As a result of the list 2-2a-2b-2c in the text being closed to the addition of more sentences, no sentence following 3 can be a member of the same list of sentences which includes 2c unless the list also includes II. This restriction is the basis for constructing ‘bow paths’ in the diagram. The concept of bow path is explained in section 7.
The sentence outline requires that every sentence be either on the principal path or related to it. In text, however, this is not a necessary requirement. The outline corresponding to the text may include another outline which is not connected to it and which produces an independent text. The following sentence could serve as such an additional sentence outline for our example:
The private life of the leader of Nation Xau is notorious.
There is no obvious sentence to which it could be related in our outline. If it were used in the actual text, it would probably function as an excursus which adds nothing about the diplomatic crisis and may entertain the reader.
The bow paths in the sentence diagram function as such an extra outline. Thus sentences B-4-8-11-12 in the diagram may be treated as a principal path for an independent text which has been inserted in the text. In order to treat this independent outline as part of the overall outline, the ‘dummy sentence’ was inserted. The dummy sentence is a sentence number that acts as a structural element in the diagram but does not correspond to any actual sentence in the text. It functions exactly like the footnote convention in scholarly prose. Although it does not occur as a sentence in the actual text, it preserves the requirement of the outline organization that every sentence is either on the principal path or related to a sentence on the principal path. Sentences on a bow path have only an indirect relation to the other sentences in the text. This is through a ‘dummy sentence’ which has been inserted to allow that connection.
4.0 THE COMPONENTS OF THE COMPUTATION
The various components of the sentence-linking computation are:
1)A dictionary of clues (cf. sections 5, 6 and dictionary appendix) The clues in a sentence instruct the computational device to
a) make a prediction, and
b) set up a test for the prediction.
If a prediction is satisfied, a link is constructed between the two sentences.
2)A routing procedure (cf. sections 6 and 7, Figures 11.2, 11.3, 11.4)
This instructs the computational device
a) to continue or terminate testing the current prediction;
b) to assign links to pairs of sentences and modify previously constructed links where necessary; and
c) where to look for the next clue which the procedure designates.
3)Text to be processed (cf. Figure 11.5)
As the sentences of the text are processed, various kinds of information are required. Syntactical information is required so that clues can be properly used in making predictions. The information associated with the clues in the dictionary is also utilized in making predictions. Finally, as links are constructed between the sentences, they are recorded as intermediate results which are utilized by the processing along with subsequent clues to determine what action to take next.
The text functions as a kind of scratchpad in which are recorded:
a) the results of a syntax analysis which brackets noun phrases and indicates the main verbs of sentences for use in making predictions;
b) the values of the clues as supplied by the dictionary; and
c) the links constructed as the result of successful prediction.
The sample text given in Figure 11.5 has had the results of a syntax analysis and the values of the clues in the dictionary inserted for the convenience of the reader.
4) A procedure for producing a principal path using the text diagram clues (cf. sections 7 and 8, Figure 11.1.)
5.0 CLUES
Consider the following set of sentences:
1)Nitrogen is an element.
2)Nitrogen is an element of the inert gases.
3)Nitrogen is not an element of the inert gases.
As additional words are inserted into the first sentence of the set, it becomes increasingly unable to stand alone without additional sentences to provide it with context. The first sentence is an assertion but the last sentence, because of the qualifications placed on the assertion, requires more discussion to make its occurrence meaningful. Though still a grammatical sentence of the same kind (in some significant way), it has changed from an independent sentence to a text-dependent sentence. Because it is now a text-dependent sentence, 3 the reader expects additional sentences as explication.
The following set of sentences starts with a text-dependent sentence:
1)However, this is merely a psychological theory.
2)This is merely a psychological theory.
3)This is a psychological theory.
4)Gestalt theory is a psychological theory.
As words are deleted or replaced, the sentence becomes increasingly independent. The first sentence in the set is a dependent sentence which definitely requires additional text. The last sentence is a relatively independent assertion of fact.
The words which were deleted, replaced, or added are clues. These clues signal the independent or dependent status of sentences. In addition, they make it possible to recognize other sentences in the text to which a dependent sentence is especially closely related. These two functions make it possible to organize the text into larger units than individual sentences.
In general, a potential clue is either
1)A word or construction which can be replaced by any one of a small list of items of which it is a member, such that this replacement does not modify the sentence structure in a significant way. An example would be the word this which can be replaced by the word that in the first sentence of the preceding example.
2)A word or construction which can be deleted without destroying the grammatical function of at least one of the constructions in which it occurs. Examples would be the words however and merely in the second set of sentences. Their deletion still leaves a grammatical sentence.
The clues seem to serve to relate the clauses of the text to other clauses in the text. Clues are not significantly involved in the grammatical structure of the clauses and phrases in which they occur since they can always be either deleted or replaced without disrupting the structures. Clues relate a dependent sentence to the other sentences in the text in two ways: (1) they arouse the expectation that there is another sentence in the text to which a dependent sentence is especially related, and (2) they permit the reader to make a prediction about this other sentence so he can recognize it if he encounters it in the text.
Predictions are either predictions about the recurrence of nouns which occur in noun phrases marked by clues or about the recurrence of a clause type as specified by a clue marking a clause. The predictions based on noun phrases are quite specific. For example, the phrase this task suggests that there has been an earlier use of the word task. Predictions based on clauses are much less specific though they are also testable. An example is a clause introduced by the clue however. This suggests only that there was an earlier clause to which the clause is related. While this is a relatively unspecific prediction, it can fail since there may be no clause preceding in the text.
6.0 MAKING PREDICTIONS
Predictions are used to set up paths through the text. If a prediction is satisfied, a link is set up connecting the two sentences concerned. The following information is supplied in the dictionary for each class of clues, as the basis for making a prediction: The clues tell where to look for related sentences, how to decide whether or not the related sentence has been encountered, and when to give up trying to find the related sentence.
Each prediction is based on the occurrence of a clue and the phrase or clause structure in which it occurs. If the structure marked by the clue is a noun phrase, the clue determines what part of the noun phrase must reoccur and under what conditions it must reoccur in another sentence in order to be treated as a match. When a match is made, a link is constructed joining the two sentences as related. If the marked structure is a clause, the clue determines what clause type must reoccur and under what conditions it should be matched in another sentence to set up the link. In addition, for both noun phrases and clauses, the clue determines how far and in what direction the text must be searched for a match.
As stated above, the information used to make these predictions is contained in a dictionary. An abridged version of the dictionary is given in the appendix. Its entries include only those clues used in processing the sample text. This dictionary was abridged from a larger dictionary which has been used to process other texts.
The dictionary contains an entry for each clue; each field of the dictionary entry is devoted to a single piece of information used in setting up and testing the prediction. The following is the organization of the dictionary entry or format into separate fields of information. Field 1 specifies the conditions under which the occurrence of a clue should be ignored. Fields 2 and 3 specify what must be matched to satisfy the prediction. Fields 4 and 5 specify the direction and distance in which the text must be searched to test the prediction. The general format for the information in the dictionary entries is:
Field 1: | Field 2, 3: | Field 4, 5: |
Conditions when clue is ignored | Specify item to be matched to satisfy prediction | Determine direction and distance of text search for item |
The information placed in these different fields can be interpreted as instructions to the computational device. Each field of the format acts as a subinstruction. Thus, the dictionary entry for the clue would is:
NOT -1 /SELF /T YPE - Ø/ LEFT /NEXT
which tells the computational device to do as follows:
Field-1 NOT-1 | do not specify a prediction for this clue which makes use of TYPE unless the clue precedes the main verb of the sentence. |
Field-2 SELF | make a match on the clause type. Clause type is determined by the clue itself and specified by Field-3 |
Field-3 TYPE | the clause-type specified by the clue as TYPE-Ø |
Field-4 LEFT | search left in the text to make the match |
Field-5 NEXT | do not search beyond the next sentence in the appropriate direction |
‘Making a match’ on clause type means finding a clause having the specified type. Similarly, making a match on a noun means finding a noun phrase containing the specified noun.
An abridged version of the dictionary and a list of the subinstructions is given in the dictionary appendix.
7.0 TESTING PREDICTIONS AND CONSTRUCTING PATHS
Properly interpreted, the dictionary information serves as an instruction to the computational device to search the text in order to test whether a given prediction is satisfied. The test is based on both the dictionary entry and a record of the previous predictions which had been satisfied. If a prediction made in one sentence is satisfied in another sentence, the pair of sentences is connected by a link. In this way, a record of the success or failure of these predictions is used to construct a diagram from these links.
The failure of a prediction is also used in constructing the diagram. A prediction can fail under two circumstances. Either the text is searched without satisfying the predictions so that no link is constructed, or the text search specified by the dictionary may be aborted before the text has been completely searched. In the latter case, predictions which might have been satisfied if more of the full text had been searched do not result in a link.
The searches to test predictions are aborted in order to prevent links from being constructed which could not be used by our auto-abstracting procedure. The procedure requires that a special kind of path be formed from the links produced by the searches. This path is called the principal path. The principal path is formed of links between nonadjacent sentences. Only those links are used which do not connect sentences occurring between linked nonadjacent sentences. The sentences connected by these links serve to organize the text into major subunits just as did the sentences on the principal path in the sentence outline of section 3. Further complications in the as signment of sentences to the principal path are discussed in section 8.
The principal path has the property that every sentence of the text is either:
1)a sentence on the path;
2)directly linked to a sentence on the path; or
3)indirectly linked to a sentence on the path (i. e. a link to a sentence which is not a sentence on the path).
The restrictions which abort searches arise as follows:
1)the predictions in each sentence are tested before moving to the right in the text to the next sentence;
2)when two nonadjacent sentences are connected, the sentences occurring between them are called ‘bracketed sentences’. As soon as the sentences have been bracketed, no new link can be constructed between them and the other sentences of the text. However, the bracketed sentences may have links constructed between them, and these links, or any other links constructed before the sentences were bracketed, are preserved.
The results of previous searches must be available each time a new prediction is tested. This is because as a result of these restrictions a previously successful search may be used to abort a later search.
Additional restrictions are imposed on the links after all the searches have been completed. In order to construct a principal path, various links connecting sentences are shifted or deleted. The result is that the diagram of the links between the sentences is modified. The modifications are:
1)If a sentence is connected by searches to immediately neighboring sentences to its left and right, the link to the left is deleted.
2)If two nonadjacent sentences are linked, a dummy sentence is inserted on the link connecting them. As was stated above, a dummy sentence is a sentence number that acts as a structural element in the diagram but does not correspond to any actual sentence in the text. The function of the dummy sentence is to provide a sentence to which the (bracketed) sentences occurring between the nonadjacent sentences can eventually be linked, if they have not previously been linked to a sentence outside the brackets. This is necessary since (1) it would be arbitrary to connect these sentences with either of the nonadjacent sentences rather than the other, (2) every sentence in the text is required to be connected either directly or indirectly to a sentence on the principal path. The sentences ‘bracketed’ by the nonadjacent sentences and the dummy sentence are linked together in a closed loop or ‘bow’. The sentences on the bow are connected to the principal path through the dummy sentence.
3)If a sentence has not been linked to any other sentence or is only linked as the result of searches from adjacent sentences (i. e. by vertical links in the diagram), it is not linked either directly or indirectly to a sentence on the principal path. Therefore an additional link is constructed which directly or indirectly connects it to the principal path. The sentence is placed when possible on the path which includes both its nearest neighbors, so that it occurs between these neighbors. The more complex conditions which may arise are dealt with in the flowchart of figures 11.2, 11.3, and 11.4. There-suiting additional link or links are represented in the diagram by dotted line(s). (See examples in Figure 11.1.)
The detailed way in which these restrictions apply to processing is shown in the flowchart of figures 11.2, 11.3, and 11.4. The flowchart is intended as a summary of this section. It determines on the basis of the current clue and whatever processing has already been done: (1) whether to continue or terminate testing a prediction, (2) how to record this decision in the diagram of the text, (3) what the next prediction for testing should be.
Some additional possible restrictions on the processing are given in section 9.
8.0 PROCESSING TEXT PATHS
8.1 Finding the principal path. The diagram produced by-processing a short text according to the flowchart of figures 11.2, 11.3, and 11.4 is shown in Figure 11.1. This summarizes the results of making and testing predictions for the text shown in Figure 11.5. The diagram is a record of the paths connecting the sentences of the text.
The diagram is a set of paths constructed out of the three types of links produced by processing the text: horizontal links which connect nonadjacent sentences, bow links which connect sentences bracketed by nonadjacent sentences, vertical links which connect adjacent sentences. A path is a connected set of links, all of the same type. Each of these paths has an interpretation in terms of the relative information content of the sentences on it.
In terms of information content, the most important kind of path is the principal path. This is the set of horizontal links connecting nonadjacent sentences in the diagram of the text. Sentences occurring between a pair of nonadjacent linked sentences are called ‘bracketed sentences’. Sentences are assigned to the principal path only if the sentences have not been connected to either of the bracketing sentences prior to the construction of the horizontal link causing the bracketing. The principal path connects only those linked nonadjacent sentences which are not also bracketed sentences. The informational importance of the principal path derives from two facts: (1) its sentences are not dependent on other sentences of the text, (2) all the other sentences of the text are directly or indirectly dependent on the sentences of the principal path. (It should be noted that these properties also characterize those sentences of the text which are not connected by any kind of link to other sentences of the text as a result of the processing described in section 7. For this reason, these unconnected sentences are also assigned to the principal path.)
Any pair of linked nonadjacent sentences has a more significant information content in terms of the text than the sentences they bracket. This is because there is no sentence in the text on which bracketed sentences can be clearly shown dependent. As a result the bow path on which the bracketed sentences occur can be deleted without altering the basic organization of the text. The bow path provides a kind of detour in the text. The sentences on the bow path take up a side issue or comment on the sentences which bracket them. They are not needed by the discussion which includes the sentences which bracket them.
Vertical links connect a sentence to another sentence which is either more directly linked to the principal path or is on the principal path. Sentences on a vertical path elaborate the content of the sentence which is more directly linked to the main path.
If this interpretation of the relative information content of the sentences on the different kinds of paths is correct, dropping all bow paths and vertical paths should produce a reduced version of the text which would provide the basic information content of the text. This was done to the diagram in Figure 11. lb to produce the diagram in Figure 11. lc. Figure 11. lc has reduced the set of paths given in Figure 11. lb to provide only the principal path. The sentences on the principal path can be read as a plausible summary or abstract of the content of the 28 sentence stretch excerpted from D. H. Lawrence in Figure 11.5:
(1) But in Germany, in weird post-war Germany, he seemed snuffed out again. (15) Now, however, some of the coldness of numbed Germany seemed to have got into her breast too. (16) Another world! (21) Phillip shivered and looked yellower. (27) And she felt quite cold about Phillip’s shivering. (28) Let him shiver.
This version would be more readable if clues relating the included sentences to sentences in the full text were omitted. The resulting version is:
1. He seemed snuffed out again.
15. Some of the coldness of numbed Germany seemed to have got into her breast too.
18. It was another world !
21. Phillip shivered and looked yellower.
27. She felt quite cold about Phillip’s shivering.
28. Let him shiver.
8.2 Restoring paths. An important test of our interpretation of the relative information content of the different kinds of paths is that it can be used to produce successively more informative versions of the text using only the information provided by the diagram of the principal path in Figure 11. lc. In this diagram, arrowheads were used to indicate a link to a sentence from a deleted vertical path and dummy sentences were used to indicate a link to a deleted bow path. According to our interpretation of the relative information content:
1)vertical paths are more informative than bow paths; and
2)sentences which are more directly linked to the principal path are more informative than sentences which are less directly linked.
This interpretation suggests the following cycle for gaining access to and restoring paths. The cycle proceeds through the principal path from left to right and:
1)restores all vertical paths whose arrowheads are encountered;
2)restores all bow paths whose dummy sentence is encountered;
3)restores all arrowheads for vertical paths not yet restored on the restored paths; and
4)repeats the above cycle on the enlarged path until the original set of paths is restored.
As the cycle is repeated, first the elaborations and side remarks or detours on the sentences of the principal path supplied, and then the elaborations and detours on the referent, are restored. The result is a very natural way to restore the information content of the text which suggests that the diagram accurately organizes the sentences of the text in relation to their information value.
9.0 EXTENSION OF PROCESSING TO MATCHES ON NONIDENTICAL SPELLING
As we go through the restoring cycle, each new path as it is brought in has a list of sentences on it. These lists have some special properties in terms of information content. Each list is informationally highly coherent and relatively independent of the other lists of sentences brought back by the restoring cycle. A list deals with a specific topic and can be read informatively without considering the other lists of the text.
Consider, for example, the lists which can be brought back and defined by the restoring cycle for sentences 3-14. First, the restoring cycle brings back a bow path with sentences 4-8-11-12 on it. These sentences deal with Phillip’s feeling of unreality. They function as a principal path in relation to the lists of sentences brought back by the next cycle and provide the equivalent of topic sentences for these lists. Next, the following vertical paths are restored: sentences 4-5-6-7, dealing with his personal need for Kathy; sentences 11-10-9, dealing with the consequences of her presence for his feelings of unreality; sentences 12-13-14, dealing with the effect of his declarations on Kathy. Finally, another bow path is restored containing sentence 3 which describes the feelings of unreality.
The coherence of lists makes possible an extension of processing to construct links between words of different spelling. This was done informally when matching for clause type, but a specific extension of processing to extend the conditions under which matches are plausible is illustrated in what follows.
After lists have been constructed additional searches seem reasonable within lists using the rules already presented in figures 11.2, 11.3, and 11.4. The plan for extending the processing is to attempt to validate plausible matches between differently spelled words for a specific text. If it is possible to construct a link between two vocabulary items which have been treated as equivalent in other texts, the match is considered to be validated.
The processing to execute this plan is:
1)make searches on words with unsatisfied predictions. Use the same procedure that was used originally, i. e. that described in the flowchart of figures 11.2, 11.3, and 11.4;
2)if a search encounters a word which has been treated as its lexical equivalent in other texts, construct a supplementary link;
3)delete this link if it would connect two previously constructed lists.
In order to accomplish this processing, a list of lexical equivalents from other texts would be required. How to produce such lists will not be discussed here. A thesaurus is an example of such a list. In terms of any specific text, it simply represents a set of hypotheses to be tested and validated.
The fact of list coherence is the basis for validating these entries for a particular text. The reasoning is that pairs of equivalents characterize text for particular subject matter areas. Therefore, if two words have previously been treated as equivalent, we are less surprised if they are equivalent in sentences of the same text than if the sentences in which they occur are drawn from more than one document. This is because sentences from a single text are more likely to be dealing with the same subject area. This likelihood is greatly increased when they are in the same list, because this would lead us to expect them to be dealing with the same topic. Accordingly, if two such words of different spelling can be used to construct a link within a list, their equivalence has been validated for the text in which they occur.
One result of contructing this supplementary link is to introduce into the text diagram additional structure which might otherwise have gone unrecognized. A second result is to provide information about the semantic organization of the text. The lexical equivalents which can be validated for the text lead us to certain assumptions about the topics it deals with. These help us in interpreting it. For example, if the pair of words strand and filament are validated as lexical equivalents for a specific text by this procedure, it will not be surprising that the text deals with botany, but it will be surprising if the text deals with geography.
The dictionary does not provide information about lexical equivalence for the clues used in matching on clause type. These clues merely mark a clause as dependent and indicate the direction in which text should be searched in order to test the prediction that there is another clause on which it is dependent. Unlike the noun clues, no way is provided for deciding whether or not the clause has been encountered. The first clause encountered in the appropriate direction is treated as satisfying the prediction. Since all clauses are teated as being of the same type for satisfying the prediction, matches on clause type should strictly be considered as tentative. Additional information could have been provided in the dictionary to distinguish the different kinds of clauses marked for clause type. This would have required the following additional step in the processing in order to validate the tentative links. If the link previously constructed as a tentative link fails because of the dictionary information, delete the link and replace it with a special link to indicate a weaker tie. The result of this processing would be to subdivide the lists further. The special link would indicate the breaks within lists.
When this extension of processing is applied to nouns, it provides semantic information as well as additional list organization to the text. The fact that certain pairs of lexical items are validated for a specific text is significant information about the semantic organization of the text. In the case of nouns, such lexical equivalents are traditionally called synonyms. Although it is not conclusive, by the same reasoning the lack of match between the words ‘creature’ (sentence 13) and ‘beast’ (sentence 16) in the sample text is also information about the semantic organization of the text. The two words are synonyms in some texts but in this text they do not function as synonyms since they cannot be validated. Although they refer to the same individual, in fact, ‘creature’ is used in affectionate retrospect while ‘beast’ refers to the harsh present of the story.
10.0 CONCLUSION
A computational approach has been used here to recognize the important or significant sentences of a text without knowing their meaning. It was first necessary to construct a description of the organization of the text. The application of the same computational approach to other texts provides a good test of insights into text organization. The prime significance of the computational approach is that it allows us to convert our insights into processing which has testable consequences. The computational method described in this paper has been experimentally applied to a variety of texts, both fiction and technical. Portions of the technique have been programmed for use in a computer information retrieval system.
The computational approach can be extended to other texts only if additional entries are made to the abridged dictionary provided here. Each dictionary entry represents a minor hypothesis about how text is organized. The flowchart represents a major hypothesis about how these minor hypotheses work together to organize the text. Hypotheses can be entered into the processor either as dictionary entries or modifications of the flowchart.
If a hypothesis can be inserted into the dictionary without changing the number of fields in the format or the instructions which occur in particular fields, it is a minor hypothesis. If its entry into the dictionary causes wrong processing, the minor hypothesis is invalidated. If the hypothesis is not a minor hypothesis, it is a major hypothesis which may lead to modification in the flowchart as well as the dictionary.
It is by such a process of trial and error within a developing framework of hypotheses that the technique discussed in this paper was developed.
DICTIONARY APPENDIX
The functions of the dictionary are described in section 6. The dictionary used in processing the sample text is given at the end of this section. The dictionary is arranged by type of processing and thus avoids repetition because all the clues listed under a particular format have the same information and processing rules associated with them.
The unabridged version of the dictionary includes two items not entered into the abridged version. Neither of these items is used in the processing of the sample text. The information omitted is from Field-3, which additionally specifies the matches proposed by Field-2:
1)Only the single entry TYPE-Ø is given for TYPE, although in fact a number of clause types could be distinguished and utilized in different ways in the processing. A possible extension of the processing which utilizes a greater variety of clause types is discussed in section 9.
2)Not all nouns make equally good matches for linking sentences as the dictionary might suggest. There is a class of nouns whose major function is to provide something to be modified. An example would be the noun issue in the expression clarification of the nuclear issue. Here the modifier is lexically significant while the noun itself is relatively insignificant. The unabridged dictionary distinguishes between these two kinds of nouns in order to organize the processing so that these more significant matches are attempted before the less significant matches. Although this type of processing is necessary for technical prose, it was not needed in the D. H. Lawrence selection and so is not entered into the dictionary.
The formats are named with different letters of the alphabet. Under each format the clues are listed in alphabetical order. The format letters have been inserted into the text to the right of the clue they are associated with. A single clue may have more than one dictionary format associated with it. In that case the convention applies that the earlier alphabet symbol should be tried first. If it succeeds the other is not tried. Format letters have been assigned so that noun matches will be tried before clause-type matches.
Slashes separate the different fields for each entry. An instruction field is ignored in the processing when marked with a dash. As soon as a match is made on a given search, that search is terminated. Instructions in a given field to search in both directions are executed. The order in which searches are made is arbitrary. In the case of a match, only the first successful search in that direction is completed.
The subinstructions are:
Field 1: | NOT-1 | do not specify a prediction for this clue which makes use of TYPE-Ø unless the clue precedes the main verb of the sentence |
DASH | No instruction given | |
Field 2: | SELF | match on clause type. Clause type is determined by the clue and specified in Field- 3 |
NPL | match on a noun of the same spelling as the noun to the left of the clue in the noun phrase containing the clue | |
NPR | match on a noun of the same spelling as the noun to the right of the clue in the noun phrase containing the clue | |
Field 3: | TYPE-Ø | the clause type specified by the clue is TYPE-Ø. This is equivalent to a match on the adjacent sentence if there is any. This is true because in this version of the dictionary all clauses have been assigned to this type. |
DASH | No instruction given | |
Field 4: | LEFT | make a search through the sentences to the left of the current sentence |
RIGHT | make a search through the sentences to the right of the current sentence | |
Field 5: | END | make a search to the end of the text in order to test the prediction of Fields 2 and 3 |
NEXT | make a search no further than the end of the ‘next’ sentence of the text where ‘next’ is determined by Field 4 |
In any field, a dash means that there is no instruction of any kind given by that field.
The formats are:
Miscellaneous:
In addition the dictionary may include information of the kind discussed in section 9. An example of such thesaurus information is the entry he = him, which is to be interpreted as he matches him when searches are made on him.
NOTES
1. The more usual term for this process in information retrieval is auto-extracting. Auto-abstracting normally implies that the sentences extracted from the text have been altered to increase their informational value. Nevertheless I have preferred to use this term because of its more familiar associations.
2. Programming languages for dealing with such list structures have been developed. A good introduction to the subject is Newell, A. et al., Information Processing Language—V Manual, The Rand Corporation (Englewood Cliffs, New Jersey, 1964). A programming language for a similar application to the one described in this paper is reported on under the name CORAL in Sutherland, I. E. sketchpad: A Man-machine Communication System, Lincoln Report TR-396 (Cambridge, Massachusetts: MIT, Electronic System Laboratory, January 196 3).
3. A review of work on similar dependency phenomena is giver in Viola Water house, ‘Independent and Dependent Sentences’, International Journal of American Linguistics 29: 1. 45-54 (1963). Rules dealing with similar dependencies are discussed in Z. S. Harris, ‘Co-occurrence and Transformation in Linguistic Structure’, Lg. 33:3.28 3-340 (1957), see especially the section entitled Transformations in Sentence Sequences. An account of dependencies, also embedded in a computational framework, is presented in John C. Olney, ‘Some Patterns observed in the Contextual Specialization of Word Senses’, Information Storage and Retrieval 2.79-101.
Figure 11.1 Notational Conventions for Paths Through Text
Figure 11.2 FLOWCHART FOR CONSTRUCTING A TEXT
Figure 11.3
Figure 11.4
Figure 11.5. Sample Text Excerpted from the short story ‘The Border Line’ by D. H. Lawrence
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.