Branka Kosovac, PhD candidate,
University of British Columbia;
email:
branka@civil.ubc.ca
Dana J. Vanier, PhD,
National Research Council Canada;
email:
Dana.Vanier@nrc.ca
Thomas M. Froese, PhD,
University of British Columbia;
email:
tfroese@civil.ubc.ca
SUMMARY: The paper describes a method used to collect terms needed for the development of a thesaurus in the roofing domain. This work is part of a larger effort to investigate the potential of thesauri as an aid in product modeling and as a tool for information management in model-based systems. Extractor, a software module that extracts keyphrases from documents, was used for collecting candidate thesaurus terms from Internet sources. The principal advantage of the Internet as a source of candidate terms is that it reflects the language that is actually used in communications concerning buildings and that it covers the widest range of different views on the domain. The advantage of using Extractor or similar software is that it allows processing huge text corpora available on the Internet while eliminating irrelevant terms. The methodology used was found to be highly useful, although it was not sufficient by itself for constructing a thesaurus for the architecture, engineering, construction and facilities management industries, as considerable human intervention was required. Some possibilities for customizing the software and for partially automating a thesaurus construction process are suggested.
KEYWORDS: thesauri, Internet, automatic indexing software, thesaurus construction.
Extractor (Extractor 2000) is a machine-learning-based software module, developed by the Interactive Information Group of the National Research Council Canada, that scans an electronic document and extracts keyphrases best describing the document's subject matter. In this project, the authors used Extractor 2.0 as a support tool for collecting and selecting terms to be included in the proposed thesaurus.
In the process described, Extractor was used for a specific task and under given circumstances. Time and resource constraints did not allow full exploitation of Extractor's capabilities. These constraints also precluded testing corpora (document collections) large enough to provide statistically valid results. Therefore, the work described is of explorative nature and cannot be considered as a study that evaluates the performance of Extractor. However, the patterns noticed in the analysis of the results can point to possible use of the software for related purposes and suggest possibilities for further research and development.
Since the mid-eighties, there have been considerable research efforts related to automatic extraction of semantic relationships and automatic construction of structured thesauri, e.g. Chodorow et al. (1985), Fox et al (1988). Despite some claims of successful R&D projects, such as MindNet (Richardson et al., 1998), wide use of full automation in thesaurus construction is still unrealistic. One simple but important reason is that the procedure requires special corpora -- in almost all projects machine-readable dictionaries have been used. As a comprehensive, authoritative and up-to-date source for a certain subject domain is rarely available, it can be expected that computers will still be used only as support tools in this part of the process for a time to come.
Thesauri address the general task of mapping related concepts, a task that arises frequently in model-based systems. They provide formalisms and approaches for mapping concepts, such as the addition of relative specificity of relationships (e.g., one term might be described as related to, but more narrowly defined than, another term). Thesauri can be used to map not only related words and phrases, but to map a wider range of representational elements such as semantic models, heterogeneous forms of computer-based data, etc. Roles in which thesauri may be useful in model-based systems include the following:
On the other hand, the TC/CS has a thoroughly elaborated structure and well defined inter-term relationships that facilitate the addition of new terms, given that their exact meaning is known. Furthermore, it is available in electronic form on the World Wide Web ( http://www.cisti.nrc.ca/irc/thesaurus/. ) thus allowing easy searching for known terms. For these reasons the "bottom up" approach suggested earlier seemed to be the most appropriate approach for this work.
As one of the primary intended uses of the proposed thesaurus is indexing and retrieval of Internet/intranet sources, the most useful source of terms would be corpora available on the Internet. The Extractor 2.0 documentation pointed to the suitability of the software for performing "literature scanning" of Internet sources as it integrates HTML and e-mail filters and permits processing of large corpora by extracting only relevant terms.
First, the query initially used to test harvesting corpora using automatically generated queries that combine synonyms of the top term (summum genus) and its immediate narrower terms was as follows:
("low-slope roof*" OR "flat roof*") AND ("built up" OR BUR OR "multi ply" OR "single ply")
(where BUR is an abbreviation for built-up roofs). This query did not provide optimum recall and precision within some search services as processing a sample large enough to compensate for the deficiencies was unrealistic. The query was thus modified to the following:
("flat roof*" OR "low slope") AND ("built up" OR BUR OR roofing OR membrane*)
This better reflected the content and the language of the relevant documents. Although the new query did not eliminate all the "noise", nor did it ensure absolute recall, the retrieved sets of documents seemed acceptable for the purpose.
Second, the initial strategy was to process documents using the lowest setting for the number of keyphrases, identify terms that should be added as stop phrases (i.e., terms such as "roofs" or "roofing" that appeared as keyphrases in most of the documents preventing extraction of more specific terms), cluster documents based on the rest of the keyphrases, process them with the highest setting for the number of keyphrases, and repeat the process until the desired level of specificity was achieved. However, the inability to frequently customize the software by adding new stop phrases and the labourious task of processing the same documents more than once made this strategy unfeasible. Using the new release with the maximum setting for the number of keyphrases derived and simply removing the most frequent terms proved to yield satisfactory results.
It was observed that long documents, which tended to abound with very specific terms, yielded only very general terms in the Extractor output. Although keyphrases derived by Extractor reflected well the subject of the documents, they did not include specific terms that would be more useful for the purpose. Efforts to automatically divide long texts into meaningful sub-documents did not prove feasible with most of the Web documents. It was, therefore, done only on a small number of scientific papers that tended to be well organized and have a better HTML structure, thus allowing easy division by searching for heading tags.
("flat roof*" OR low-slope) AND (built-up OR BUR OR roofing OR membrane*)
The services were searched in alphabetical order, taking care to avoid duplicates. Where relevant documents from a certain site were grouped together, only the first one was used in order to avoid language of one author and frequent appearance of the same corporate names and trademarks.
Messages from newsgroups archives were not processed separately but a small number of this type of document was included in AltaVista hits.
After the identification of exact matches, terms from the list were also searched for occurrence as:
The results from the general search services were then compared with those from selected collections. There were no significant differences noted in the relevance of extracted keywords that would justify the laborious task of searching, evaluating, and selecting sources. The quantity and diversity of documents that can be easily retrieved by general search services can successfully compensate for the quality of selection. As the first 20 documents from the compiled collection described in Section 2.3.1 did not bring new terms to the list, this collection was not further processed and it is not included in the final results. However, the documents have been saved for later comparison of scholarly and natural language terms and for exploration of specific sub-areas.
The final list consists of 1054 terms (2423 occurrences) extracted from 176 documents. Almost half of the terms extracted were single words (49 %). They accounted for almost all of the top 4% most frequent terms with only two exceptions that would normally be included in stop-phrases. The usual practice of treating such terms separately was not followed as most of the terms were also identified as single word terms in the TC/CS.
A huge number of single occurrences of a term (78%) can be explained by the insufficient size of the sample. In order to extract relevant terms from this group, they were searched for component words and stems that could also be found in other terms. Terms found in this way were ranked higher in the list as more relevant. Rough scanning of the remaining single-occurrence terms found very few terms that were relevant to the field. For that reason, these terms were excluded from further processing. It is important to note, however, that this group contained a substantially smaller percentage of terms marked as ill-defined, significantly contributing to the overall average of only 1% ill-defined terms.

FIG.1: List of terms with more than two occurrences compared with TC/CS
Fig. 2 shows the breakdown of terms with more than two occurrences that had no
matches or close-matches found in the existing thesaurus -- 12% (or 3% of all
the terms extracted) were proper names, 8% (2% of all the terms) were acronyms,
and 19% (or 5% of all the terms) were marked as ill defined.

Figure 2: Breakdown of terms not found in TC/CS
The majority of mismatches, however, do not indicate irrelevance of the terms
but more often the outdatedness of the TC/CS. The frequent occurrence of the
term "membranes" for example, and the phrases containing "membranes" that are
not found in TC/CS reflects changes in the field of low slope roofing and its
terminology. The noise-making terms, i.e. extracted terms that do not belong to
the low-slope roofing domain, come from specific kinds of documents, mostly
glossaries and book catalogues. Such documents can be easily excluded from the
beginning by modifying the initial query.
The relatively high coincidence of terms may, on the other hand, indicate a lack of more specific terms that would be required for developing a microthesaurus. Whether this is the case can be established only later in the process.
The matches were found in all semantic classes and in various hierarchies, showing a broad coverage of the domain. Completeness of coverage, however, is yet another problem that cannot be properly evaluated at this point but only after organizing terms into hierarchies (Petersen 1990).
The number of phrases marked as ill-defined (see the explanation in Section 2.3.3) was 1% of all the extracted terms. Since the total automation of the thesaurus constructing process is not considered and since ill-defined phrases cause no serious consequences, this percentage can be considered negligible. Therefore, increasing the number of keyphrases even above the maximum that was possible in version 2.0 (in order to retrieve more specific terms) would probably be safe. In most cases the lack of more specific terms in the Extractor output will not represent a deficiency; terms too specific to be the subject of a document are rarely included in a thesaurus. In addition, their presence might make one of the most important decisions in thesaurus constructing -- where to stop -- even more difficult. However, if constructing a microthesaurus, or if for any other reason more specific terms are needed, these terms may be obtained by processing larger corpora or by narrowing the searches for the Internet documents to be processed.
Since the time of the study, several new versions of Extractor have been released. Anyone interested in implementing the method described in this paper should review new features of the software by visiting the Extractor Web site or by contacting the software's authors.
Original data and full results of the study are available from the authors of this paper.
Biegel, S. (1989) . Roofing Materials, Encyclopedia of Architecture, Design, Engineering & Construction. Vol. 4, American Institute of Architects, 314-319.
Chodorow, M., Byrd R. and Heidorn G. (1985) . Extracting semantic hierarchies from a large on-line dictionary, Proceedings of the 23rd Annual Meeting of the ACL, 299-304.
Extractor (2000) . National Research Council of Canada, Interactive Information Group, Ottawa, Canada. Available from: http://extractor.iit.nrc.ca. [Accessed February 7, 2000]
FacilitiesNet (1998) . Trade Press Publishing, Milwaukee, WI, USA. Available from: http://www.facilitiesnet.com. [Accessed November 27, 1999]
Fox, E.A., Nutter J. T., Ahlswede T., Evens M. and Markowitz J. (1988) . Building a large thesaurus for information retrieval, 2nd Conference on Applied Natural Language Processing, Association for Computational Linguistics, (Ballard B., ed.), Bell Communications Research, Morristown, NJ, 101 -108.
Gilchrist, A. (1971) . The Thesaurus in Retrieval, Aslib, London.
Kosovac, B (1998) . Internet/Intranet and Thesauri, Canadian Institute for Scientific and Technical Information, Internal Report, National Research Council Canada, Ottawa, Canada. Available from: http://www.nrc.ca/irc/thesaurus/roofing/report_b.html [Accessed November 27, 1999]
Lancaster, F.W. (1986) . Vocabulary Control for Information Retrieval, Information Resources Press, Arlington, VA, USA.
Petersen, T. (1990) . Developing a new thesaurus for art and architecture, Library Trends, Vol. 38, No. 4, 644-658.
Richardson, S. D., Dolan W. B. and Vanderwende L. (1998) . MindNet: acquiring and structuring semantic information from text. Microsoft Research Technical Publications (MSR-TR-98-23). Available from ftp://ftp.research.microsoft.com/pub/tr/tr-98-23.doc [Accessed January 31, 2000]
Roofing Resources (1998) . National Research Council of Canada, Institute for Research in Construction, Ottawa, Canada. Available from: http://www.nrc.ca/irc/roofing/roofing.html. [Accessed November 27, 1999]
TC/CS (1978) . Canadian Thesaurus of Construction Science and Technology, Department of Industry, Trade and Commerce, Government of Canada, Ottawa. Available from: http://www.cisti.nrc.ca/irc/thesaurus/. [Accessed November 27, 1999]
Vanier D.J. (1994). Canadian thesaurus of construction science and technology: A hypercard stack, Proceedings of the Joint CIB Workshops on Computers and Information in Construction (Montreal, Que., Canada), (CIB Proceedings, Vol. 165), 559-564.
TERM
|
=
|
~
|
GT
|
+
|
PH
|
Q
|
A
|
N
|
*
|
PL
|
$
|
NOTE
| |
| 158
|
roofing(s)
|
=
|
PL
|
$
|
|||||||||
| 139
|
roof(s)
|
=
|
|||||||||||
| 74
|
membrane(s)
|
PH
|
Q
|
||||||||||
| 50
|
materials
|
=
|
|||||||||||
| 32
|
installation
|
=
|
PH
|
PL
|
installation(activity)
| ||||||||
| 32
|
products
|
=
|
PH
|
products(agents)
| |||||||||
| 27
|
performance
|
=
|
PH
|
||||||||||
| 27
|
water
|
=
|
PH
|
||||||||||
| 26
|
design
|
=
|
PH
|
PL
|
|||||||||
| 25
|
insulation
|
PH
|
|||||||||||
| 25
|
roofing
system(s)
|
||||||||||||
| 24
|
requirements
|
=
|
GT
|
PH
|
|||||||||
| 22
|
asphalt
|
=
|
|||||||||||
| 19
|
deck(s)
|
PH
|
|||||||||||
| 19
|
repair
|
~
|
repairing
| ||||||||||
| 17
|
maintenance
|
=
|
~
|
+
|
maintenance(restoring)
| ||||||||
| 17
|
manufacturer
|
=
|
|||||||||||
| 16
|
flashing(s)
|
=
|
+
|
||||||||||
| 14
|
coating(s)
|
=
|
PL
|
coating(process)
| |||||||||
| 13
|
fasteners
|
PH
|
Q
|
||||||||||
| 13
|
inspection
|
=
|
GT
|
||||||||||
| 13
|
joint(s)
|
=
|
+
|
joints(junctions)
| |||||||||
| 13
|
specifications
|
=
|
GT
|
PH
|
|||||||||
| 12
|
BUR
|
A
|
built
up roofings found
| ||||||||||
| 12
|
costs
|
=
|
|||||||||||
| 11
|
construction
|
||||||||||||
| 11
|
felts
|
=
|
PH
|
$
|
|||||||||
| 11
|
structures
|
=
|
structures(buildings) structures(construction)
structures(non building)
| ||||||||||
| 11
|
waterproofing
|
PH
|
|||||||||||
| 11
|
wind
|
=
|
PH
|
||||||||||
| 10
|
flat
roof(s)
|
=
|
|||||||||||
| 10
|
properties
|
=
|
GT
|
PL
|
property(quality)
| ||||||||
| 9
|
components
|
=
|
Q
|
||||||||||
| 9
|
moisture
|
PH
|
|||||||||||
| 9
|
sheets
|
=
|
+
|
sheets(shape)
| |||||||||
| 9
|
slope(s)
|
=
|
Q
|
||||||||||
| 9
|
standards
|
=
|
PH
|
||||||||||
| 8
|
industry
|
=
|
PH
|
||||||||||
| 8
|
install
|
*
|
|||||||||||
| 8
|
projects
|
GT
|
PH
|
||||||||||
| 8
|
replacement
|
PH
|
replacement
value
| ||||||||||
| 8
|
resistance
|
=
|
PH
|
non-descriptor
| |||||||||
| 8
|
walls
|
=
|
PH
|
||||||||||
| 8
|
warranties
|
=
|
PH
|
||||||||||
| 7
|
air
|
=
|
|||||||||||
| 7
|
ballast
|
=
|
$
|
ballast(gravel)
| |||||||||
| 7
|
building
owners
|
=
|
|||||||||||
| 7
|
flexible
membrane
|
||||||||||||
| 7
|
market(s)
|
=
|
|||||||||||
| 7
|
modified
bitumen(s)
|
||||||||||||
| 7
|
panel(s)
|
=
|
|||||||||||
| 7
|
seams
|
PH
|
|||||||||||
| 6
|
applications
|
=
|
PH
|
$
|
|||||||||
| 6
|
built-up
roofing
|
=
|
|||||||||||
| 6
|
consideration
|
||||||||||||
| 6
|
drain(s)
|
PH
|
|||||||||||
| 6
|
evaluation
|
=
|
GT
|
||||||||||
| 6
|
installing
|
*
|
|||||||||||
| 6
|
life
|
=
|
GT
|
PH
|
|||||||||
| 6
|
load(s)
|
=
|
PH
|
||||||||||
| 6
|
protected
membranes
|
||||||||||||
| 6
|
roofing
contractor(s)
|
contractors
| |||||||||||
| 6
|
shingles
|
=
|
out
of scope
| ||||||||||
| 6
|
substrate
|
||||||||||||
| 6
|
temperature
|
=
|
GT
|
||||||||||
| 6
|
vapour
barrier(s)
|
=
|
+
|
||||||||||
| 5
|
assemblies
|
=
|
assemblies(components)
| ||||||||||
| 5
|
base
flashing(s)
|
=
|
$
|
||||||||||
| 5
|
Bitumen(s)
|
=
|
|||||||||||
| 5
|
built-up
roof
|
~
|
built
up roofings
| ||||||||||
| 5
|
condition(s)
|
GT
|
PL
|
||||||||||
| 5
|
drainage
|
=
|
PH
|
||||||||||
| 5
|
EPDM
|
~
|
epdm
rubber
| ||||||||||
| 5
|
experience
|
=
|
$
|
ND
for "skill"
| |||||||||
| 5
|
facilities
|
PH
|
Q
|
||||||||||
| 5
|
leak
|
~
|
leakage
| ||||||||||
| 5
|
low
slope
|
||||||||||||
| 5
|
pressure
|
PH
|
|||||||||||
| 5
|
preventative
maintenance
|
but
prevention & maintenance
| |||||||||||
| 5
|
protection
|
=
|
GT
|
||||||||||
| 5
|
roof
decks
|
~
|
roof
deckings
| ||||||||||
| 5
|
single
|
*
|
|||||||||||
| 5
|
thermosets
|
~
|
thermosetting
resin (ND)
| ||||||||||
| 4
|
built-up
|
*
|
|||||||||||
| 4
|
coal(-)tar
|
=
|
|||||||||||
| 4
|
damage
|
=
|
|||||||||||
| 4
|
flow
|
=
|
flow(fluids)
| ||||||||||
| 4
|
layer
|
PH
|
$
|
||||||||||
| 4
|
movement
|
PH
|
ND
for motion
| ||||||||||
| 4
|
recommendations
|
=
|
GT
|
||||||||||
| 4
|
reinforcing
|
~
|
PH
|
$
|
|||||||||
| 4
|
requiring
|
~
|
$
|
||||||||||
| 4
|
Roofing
Contractors Association
|
N
|
|||||||||||
| 4
|
roofing
membranes
|
||||||||||||
| 4
|
thermal
insulation
|
=
|
|||||||||||
| 4
|
vapour
retarders
|
||||||||||||
| 3
|
Aduron
|
N
|
|||||||||||
| 3
|
architect(s)
|
=
|
|||||||||||
| 3
|
barrier
|
PH
|
$
|
ND
| |||||||||
| 3
|
base
coat(s)
|
$
|
|||||||||||
| 3
|
building
envelope
|
||||||||||||
| 3
|
Canada
|
N
|
|||||||||||
| 3
|
coating
system
|
but
"coating(process)"
| |||||||||||
| 3
|
condensation
|
=
|
PH
|
||||||||||
| 3
|
contract(s)
|
=
|
|||||||||||
| 3
|
control
|
=
|
GT
|
||||||||||
| 3
|
counter(-)flashing(s)
|
||||||||||||
| 3
|
cracks
|
=
|
cracks(fissures)
ND
| ||||||||||
| 3
|
differential
movement(s)
|
=
|
ND
| ||||||||||
| 3
|
edge(s)
|
=
|
GT
|
PH
|
edge(boundary)
| ||||||||
| 3
|
equipment
|
=
|
|||||||||||
| 3
|
flat
|
PH
|
*
|
||||||||||
| 3
|
foot
|
PH
|
|||||||||||
| 3
|
gravel
|
=
|
| ||||||||||
| 3
|
installer
|
PH
|
|||||||||||
| 3
|
liner
|
||||||||||||
| 3
|
measurement(s)
|
=
|
GT
|
PH
|
|||||||||
| 3
|
metal
roof
|
~
|
metal
roofings
| ||||||||||
| 3
|
minimum
|
PH
|
|||||||||||
| 3
|
National
Research Council
|
N
|
|||||||||||
| 3
|
owner(s)
|
=
|
PH
|
||||||||||
| 3
|
parapet
|
PH
|
|||||||||||
| 3
|
polymer
|
=
|
ND
for "polymeric materials"
| ||||||||||
| 3
|
reinforcement
|
PH
|
|||||||||||
| 3
|
review
|
irrelevant
| |||||||||||
| 3
|
SPF
|
A
|
|||||||||||
| 3
|
steel
|
=
|
|||||||||||
| 3
|
surfacing
|
=
|
|||||||||||
| 3
|
waterproofing
membrane
|
~
|
waterproof
membranes (ND)
| ||||||||||
PL terms may have a different meaning when used in plural = exact match found in TC/CS ~ close match found in TC/CS GT term represents a general term in TC/CS, i.e. term that does not have a fixed hierarchical level but can be associated with terms of varying degrees of specificity + term has further developed hierarchy of narrower and part terms in TC/CS PH word found as a part of phrases included as terms in TC/CS Q term used as a qualifier in TC/CS A acronym N proper name * term "ill-defined", i.e. cannot stand alone as a thesaurus term $ matches that have the same form but different meaning
ND non-descriptor, i.e. term pointing to a descriptor (authorized
term) used as a thesaurus entry
|