Utilization of text is essential to the cataloging process. In particular, every image and video on the Web has a unique Web address and possibly other HTML tags, which provide for valuable interpretation of the visual information. We process the Web addresses, or URLs, and HTML tags in the following ways to index the images and videos:
Images and videos are published on the Web in two forms: inlined and referenced. The HTML syntax differs in the two cases. To inline, or embed, an image or video in a Web document, the following code is included in the document: <img src=URL alt=[alt text]>, where URL gives the relative or absolute address of the image or video. The optional alt tag specifies the text that may appear in place of the image or video when the browser is loading the image/video or has trouble finding or displaying the visual data. Alternatively, images and videos may be referenced from parent Web pages using the following code: <a href=URL>[hyperlink text]</a>, where the optional [hyperlink text] provides the high-lighted text that describes the object pointed to by the hyperlink, in this case, an image or video.
The terms are extracted from the image and video URLs, alt tags and hyperlink text by chopping the text at non-alpha characters. For example, the URL of an image or video has the following form
where [...] denotes an optional argument. For example, several typical URLs are
Terms are extracted from the directory and file strings using
and
where
where
. For example,
For one, the terms allow text-based searching via string-matching. After extracting the terms, the system indexes the images and videos directly using inverted files. The process of file-inversion is illustrated in Tables 2. For example, if the user enters the query term ``animal'', the images and videos with IMID = 259503 and 106441 are retrieved, respectively. In addition, certain terms, key-terms, are used to map the images and videos to subject classes, as we explain shortly.
| IMID | Terms |
| 121216 | nasa, clipart |
| 259503 | animal, dog |
| 151285 | astronomy, nasa |
| 106441 | animal, clipart |
| Terms | IMID |
| animal | 259503, 106441 |
| astronomy | 151285 |
| clipart | 121216, 106441 |
| dog | 259503 |
| nasa | 121216, 151285 |
A directory name is a phrase extracted from the URLs that groups images and videos by location on the Web. The directory name consists of the directory portion of the URL, namely,
. For example,
. The directory names are also used by the system to map images and videos to subject classes.
A key-term is a manually identified term that corresponds to one or more subject classes. The key-term dictionary contains the set of key-terms and their corresponding mappings to subject classes. We build the key-term dictionary in a semi-automated process. In the first stage, the term histogram for the image and video archive is computed. Then the terms are ranked by frequency and are presented for manual assessment. Ranking the terms in order of highest frequency prioritizes them for inspection. The goal of the manual assessment is to determine if a term can be assigned to the key-term dictionary. To make the decision, we consider the descriptive ability of the term and its possible correspondence to one or more subject classes. Terms with multiple meanings make poor key-terms. For example, the term ``rock'' is a not a good key-term due to its possible disparate references to either stone, or rock music, or several other things. Once a term and its mappings are added to the key-term dictionary, it applies to all existing and new images and videos.
| Descriptive key-terms and mappings | ||
| key-term | count | mapping to subject |
| planet | 1175 | astronomy/planets |
| music | 922 | entertainment/music |
| texture | 831 | graphics/textures |
| aircraft | 458 | transportation/aircraft |
| travel | 344 | travel |
| astronomy | 320 | astronomy |
| gorilla | 273 | animals/gorillas |
| starwars | 204 | entertainment/movies/films/starwars |
| soccer | 195 | sports/soccer |
| dinosaur | 180 | animals/dinosaurs |
| porsche | 139 | transportation/automobiles/porsches |
From the initial experiments of cataloging 500,000 images and videos, the terms listed in Table 3 are a sample of those extracted. Notice in Table 3(a) that some of the most common terms are not sufficiently descriptive of the visual information, i.e., terms ``image'', ``picture''. However, the terms in Table 3(b) clearly indicate the subject of the images and videos, i.e., terms ``aircraft'', ``gorilla'', ``porsche''. These key-terms are extremely useful for classifying the images and videos into subject classes. For example, we added the key-terms and corresponding subject mappings illustrated in Table 3(b) to the key-term dictionary.
In a similar process, the directory names are inspected and manually mapped to subject classes. Very often an entire directory of images/videos corresponds to a particular topic and can be mapped to one or more subject classes. Similar to the process for key-term identification, the system computes the histogram of directory names and presents it for manual inspection. A directory that sufficiently groups images and videos related to a particular topic is then mapped to the appropriate subject classes.
In Section 6.1, we demonstrate that these methods of key-term and directory name identification and subject mapping provide excellent performance in classifying the images and videos by subject. We also hope that by incorporating some results of natural language processing [9], in addition to using visual features, we can further improve and automate the subject classification process.
A subject class or subject is an ontological concept that represents the semantic content of an image or video, i.e., ``basketball''. A subject taxonomy is an arrangement of subject classes into an is-a hierarchy. We are developing a new subject taxonomy for image and video subject matter, a portion is illustrated in Figure 4, in the process of inspecting the terms for key-term mappings, as described above. For example, when a new and descriptive term, such as ``basketball'' is detected and added to the key-term dictionary, we add a corresponding subject class to the taxonomy if it does not already exist, i.e., ``sports/basketball''.
Figure 4: Portion of the image and video subject taxonomy.
As described above, each retrieved image and video is processed and the following information tables are populated:
where special (non-alphanumeric) data types are given as follows:
The automated assignment of TYPE to the images and videos using visual features is explained in Section 5.2. Queries on the database tables: IMAGES, TYPES, SUBJECTS and TEXT are performed using standard relational algebra. For example, the query: Give me all records with TYPE = ``video'', SUBJECT = ``news'' and TERM = ``basketball'' can be carried in SQL as follows:
SELECT IMID
*
FROM TYPES, SUBJECTS, TEXT
*
WHERE TYPE = ``video'' AND SUBJECT = ``news'' AND TERM = ``basketball''.
However, content-based queries, which involve table FV, require special processing, which is discussed in more detail in Sections 4.2 and 5.