Effective Internet Search

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates
 

Effective Internet Search: Excerpts

Note: Internal book hyperlinks, except those internal to this page, have been deactivated below.

Computer programmers will recognize that term variation operators are included in what are known as regular expressions in programming languages such as PERL. (Actually, the term "regular expression" comes from mathematical set theory.) In general, Internet search engines do not yet have the ability to handle all the kinds of term variations that can be handled by regular expressions.

More on regular expressions: ugrad-www.cs.colorado.edu/unix/regex.html

Term variation filters [2]

  1. Wildcards, substrings, stemming [3]:
  1. Wildcards: are operators that act as placeholders for yet-to-be-determined characters or groups of characters in a word.
An asterisk (*) placed at the end of a word can find from 0 to a certain number of characters in place of the asterisk. Thus, in various search engines, the term sound* will correspond to terms such as sounding, soundproof, sounded, and so on.
The Find commands of word processors have many more sophisticated wildcard options than those found in most of today's Internet search engines.
In certain search engines, the asterisk can be used in the middle of a word, for, say, any 0 to 3 characters in that position in the string of characters.

Sp*l will match words such as spoil, spill, and spool.

In some search engines, a single character, such as the question mark (?) or the percent (%) character, can be used as a wildcard that corresponds to an individual character in a specific place in a word string.

In the string so??d, any characters can be used in the third and fourth positions. Possible matches include solid and sowed.

Typing the asterisk at the end of the word is also known as stubbing, which is a particular kind of wildcarding. The word stemming may also be used, in its more restrictive usage. Truncation is another word used to indicate wildcarding.

You can place wild cards anywhere in the search string, and you can use multiple wild cards in a single word.

Type an asterisk at the start or end of a word particle to obtain words that either end with or start with the specified characters. Thus, the query *man returns documents containing the words man, woman, Spiderman, Oman, and so on.

Type ? (question mark) to match a single positional character. Thus, the query car? will return documents containing words like cart, card, care, and Cary.

  1. Substrings: are like wildcards before and/or after a particular part of a word because the match is made on a subset of the characters in a word.
The substring oma occurs inside the word woman. Using wildcards, with the asterisk taking the place of 0 to n characters, this same substring could be represented as *oma*.
  1. Stemming: Stemming operators are somewhat similar to wildcards at the ends of words. In fact, this is how some search engines appear to define stemming, in which case the term stubbing also finds some usage.
Of the search engines featured in this book, only MSN Search has a word stemming choice on its advanced search form. Google has automatic stemming, in the sense of word stubbing. Stemming can also be suppressed where desirable.

In a broader sense, however, stemming allows finding other kinds of variations on the same word, due to differences in tense or mood, or a word being the verb equivalent of a particular noun.

When the word think is entered as the search term, stemming will cause the search engine to find various connected nouns and verbs, such as thought, thinker, and thoughtless, as well.

You will get the word flew when searching for the word fly, along with flies, flying, flight, and so on.

  1. Different spellings or phonetic matching [4]:
  1. Different spellings: The ability to automatically suggest and even automatically include different spellings of the same word helps to increase the number of relevant findings in some cases.
Type matherboard into an appropriate text box, and the Google findings page will respond with: "Did you mean: motherboard?"
This type of spelling assistance to different versions of English was not noticed in any of the search engines featured in this book.
Convert between American and British English spellings, as in behavior vs. behaviour, or humor vs. humour.

More on spelling assistance: www.brightplanet.com/deepcontent/tutorials/Search/part7.asp#topic27

  1. Phonetic matching: This is matching based on the sound of the word, rather than on the spelling, based on some dialect or pronunciation. The search engines sampled in this book do not support phonetic matching, except perhaps when it is connected with spelling correction.

Entering Baylin with phonetic matching will cause the like-sounding words Bailin and Beilin to give rise to findings as well.

  1. Formatting masks [5]: Formats are often used in programs to cause data to be displayed to the user in a way that enhances readability. Formatting masks refer to the "superficial" appearance characteristics of terms.

A North American phone number consists of ten digits, in the form "(999) 999-9999", where '9' is a placeholder for any of the digits 0 to 9. In theory, this formatting mask could be used to select data on the Internet, where only documents containing a string of text consisting of exactly ten successive digits formatted with brackets, space, and dash, as in this example, would be found.

Entering a word in quotes, like a phrase, will cause certain search engines to become case sensitive and thereby distinguish between uppercase and lowercase letters.

Entering "Idea" as opposed to idea will result in matches to documents only when the "I" is capitalized.

In practical terms, formats are seldom used when searching for text in document files. They are basically absent from all search engines, except perhaps to find uppercase letters when required in certain positions of a search term.

Some search engines also respect the usage of diacritical marks (symbols placed above or below individual letters in a word) for letters from certain alphabets (non-English ones, of course). They use the same characters as in English, but add diacritical marks (signs, accents, cedillas, etc.) to indicate different sounds or values of a letter, or to add a particular vowel before or after a consonant. These marks could be considered as special letter formats, selected using "formatting masks."

However, formatting or appearance is sometimes used to find matches for multi-media files.

Advanced image search interfaces generally allow matches to be made by image color, background pattern, or screen resolution (pixel density).

Practice Exercise:

The word satellite must occur with an uppercase "S," as in Satellite.
AlltheWeb Case sensitivity is unavailable.
AltaVista It is unclear whether case sensitivity is available.
Copernic Case sensitivity is available as a checkbox for search within results.
Google Case sensitivity is unavailable.
MSN Search Case sensitivity is unavailable.

 
  1. Ignored words or characters [6]: Often called stop words, they are words that are ignored when matching terms to documents. They usually include articles - a, the; prepositions - at, to, in; various forms of the verb "to be" - been, is; other "parts of speech." More examples: how, which, if, la, de, on, who, where, and single letter words.
In addition to stop words, one can refer to "stop punctuation," or, more generally, to "stop characters." They cause certain words, punctuation with special keyboard characters, or numerical digits to be ignored during the match process.

The colon (:) and digit in the phrase overview: conclusion 2, may be ignored, and treated as if they did not exist. Thus, the search is really just against the phrase overview conclusion, without the : and the 2.

Search engines often do not allow you to control these features, although they are automatically applied. Many do not list their stop characters or words either, such as the search engines featured in this book. However, this is easily verified by entering the word as a search filter.

Some search engines allow you to override the disregard of stop words by placing a plus sign (+) in front of the stop word.

+the in Google will cause the search engine to include the word the when making matches.

If of and the are stop words in a given search engine, and punctuation characters and digits are ignored, then the phrase hello world will be treated as equivalent to the phrase hello to the world, in +/-2020. This occurs, since the following will all be ignored:

  • preposition: to
  • article: the
  • special characters: comma (,), plus (+), slash (/), and minus (-)
  • digits: 2020

Once you remove the above from hello to the world, in +/-2020, you end up reducing the string to just hello world.


Internal Book Cross-Links
  1. Cross-links for this section:
    • Reference Section 6.1: further explains how the five featured search engines apply concepts from this section
    • Chapter 4: provides a high-level explanation of the search filter entry interfaces
  2. Cross-links with: Reference Section 6.1: Term variation filters
  3. Links for wildcards or substrings or stemming:
  4. Cross-links with: Reference Section 6.1: Different spellings or phonetic matching
  5. Cross-links with: Reference Section 6.1: Formatting masks
  6. Cross-links with: Reference Section 6.1: Ignored words or characters

 

FREE SEARCH HELP

On Site Resources

Search Tool Guide

BUY THE BOOK
ABOUT THE BOOK
FAQ's
Audiences
User Benefits
Overview & Contents
Book Excerpts
Awards-Reviews
Updates
OTHER
Contact Us
Authors
Discussion Topics
Sales Affiliates


Effective Internet Search: E-Searching Made Easy!           Baylin Systems, Inc., 2006