Black Box Betty by Frederick Rustam

Tell us what you thought of the December 2007 issue.

Moderator: Editors

Post Reply
User avatar
Robert_Moriyama
Editor Emeritus
Posts: 2379
Joined: December 31, 1969, 08:00:00 PM
Location: Toronto, Ontario, Canada
Contact:

Re: Black Box Betty by Frederick Rustam

Post by Robert_Moriyama »

...If there was one piece that amazed me was the concept of a secret law. I almost laughed at the absurdity of this until I googled it. There are, in fact, secret laws in the U.S. rationalized in the battle against Terrorism. My gleeful face then turned into a visage of horror. How this could ever be constitutional amazes me to no end. I am, in a word, gobsmacked.
Silly boy. BushCo is only suspending your rights to protect your freedom. And anyway, the President is allowed to violate any law (and apparently, the President and Congress are allowed to implement any law) so long as he / they says the magic words "9-11" and / or "terror", whether the issue at hand is remotely connected to terrorism or not. (Fortunately, "9-11" has failed as a magic incantation to win the Presidency for Rudy Giuliani. Dubya wore it out last time around (with the "changing horses in midstream" thing).

(In Canada, some terrorism suspects have been in jail for years without trial. Neither they, nor their lawyers, have been allowed to know any details of their alleged crimes as this might compromise the ability of the authorities to use the same information sources and methods in the future. I mention this so you know the U.S. isn't the only place where this kind of lunacy has been implemented.)

RM
You can't wait for inspiration. You have to go after it with a club.

Jack London (1876-1916)
User avatar
Robert_Moriyama
Editor Emeritus
Posts: 2379
Joined: December 31, 1969, 08:00:00 PM
Location: Toronto, Ontario, Canada
Contact:

Re: Black Box Betty by Frederick Rustam

Post by Robert_Moriyama »

"Sir, you are under arrest!"

"Me? Why?!?"

"I can't tell you. It's a secret..."
Actually, in Kafka's unpublished masterwork, "Trial by Metamorphosis", the protagonist was prosecuted as a suspected cockroach sympathizer. This was based upon roach-like behavior reported by his neighbors, whose long-running feud with him over an untrimmed trivet hedge was considered by the authorities to be irrelevant. Also, there were rumors that his birth name had a distinctly roachish ring to it, until he had it changed to Smythe-Baddely-Dunne-By.

Of course, none of this could be disclosed to Mr. Smythe-Baddely-Dunne-By (born Helmut Blattella) because it would endanger the patriotic informants who had turned him in.

RM

RM

(RM ... RM? Oh my God -- I think this Forum topic is being tapped! Wait ... someone is pounding on the door! I think it's
Last edited by Robert_Moriyama on January 10, 2008, 12:39:12 PM, edited 1 time in total.
You can't wait for inspiration. You have to go after it with a club.

Jack London (1876-1916)
User avatar
Robert_Moriyama
Editor Emeritus
Posts: 2379
Joined: December 31, 1969, 08:00:00 PM
Location: Toronto, Ontario, Canada
Contact:

A Note from "Betty's" author

Post by Robert_Moriyama »

Dear Editor:

I was inspired to write "Black Box Betty" after I read "Searching versus Finding: Why Systems Need Knowledge to Find What You Really Want" by W. A. Woods of Sun Microsystems Laboratories (it's on the Web). Dr. Woods nicely sums up our Web search frustrations:
-----
Nearly everyone is familiar with the experience of searching with a Web search engine and with using a [browser's text search ] to search a particular Web site once you get there.... After you have a list of hits, you typically spend a significant amount of time following links, waiting for pages to download, reading through a page to see if it has what you want, deciding that it doesn't, backing up to try another link, deciding to try another way to phrase your request, etc. Eventually you may find what you want, or you may ultimately give up and decide that you can't find it. Why is this so difficult?
-----

Why, indeed. Woods has developed a complex system which locates specific document passages which are relevant to keyword subject searches. He writes that his system is applicable only to enterprise documents (pages in a single website or text files in an intranet) and that it's not scalable up to entire-Web searching. However, one of the algorithmic features of his system could be used to improve our subjective judgment of relevance in entire-Web search results.

That algorithmic feature is based upon a principle which has been known for decades: the closer the searchwords of a multiple-searchword search occur to each other as textwords within text, the more likely they are to be semantically correlated -- i.e., subject-related to each other rather than merely co-occurring unrelatedly in the text.

(Consider a search for "West Virginia coal mining state safety regulations." This complex search is, so to speak, a jewel with three facets: (1) coal mining, (2) West Virginia, (3) state safety regulations. In a Web search for this subject, we seek webpages in which all three facets appear together -- either as the webpage's main subject, or in a passage of at least sentence or paragraph length (see below). We don't want to retrieve a webpage in which the some aspect of West Virginia occurs in one place and the subject of "Pennsylvania coal mining safety regulations" occurs in another place. That's an example of co-occurrence without correlation. Yet, when we google our best for <coal>, we retrieve the above "false drop" webpage because it matches our searchword prescription. Yes, we could exclude Pennsylvania from our search -- but how about all the other coal mining states? Google allows a maximum of ten words per search.)

SPECIFIC-PASSAGE RETRIEVAL
Current relevance evaluation of retrieved webpages by the Web search engines (Google, etc.) relies chiefly upon backlink analysis, a calculation of the quality and quantity of the links from other webpages to the retrieved webpages. This technology favors retrieved webpages which are wholly or mostly devoted to the searched-for subject. (Who links directly to a passage in a webpage?) But most of the webpages retrieved in a subject search contain only a passage about the searched-for subject. A passage may be a single sentence, a paragraph, or more. A passage may not be directly related to the overall subject of the webpage, although it's usually related in some way.

Web searchers must often be satisfied with specific passages because entire webpages about their sought subjects aren't available. However, even a brief mention of a sought subject may be useful if it verifies a searcher's assumption or contains an unanticipated-but-useful fact.

A complex system which can algorithmically retrieve specific passages within webpages probably won't be offered by the Web search engines anytime soon. But something the engines could now offer searchers would facilitate the identification of relevant passages in webpages: better "snippets." A snippet is that familiar brief excerpt of those parts of a retrieved webpage's text which contain our searchwords. Many snippets do a poor job of showing us what their webpages contain that is relevant to our searches.

We've all made successful Web searches, but we've also made unsuccessful searches which leave us thinking, "It's got to be there, somewhere." We've become desperate to retrieve any webpage passage which will answer our question. Given the broad subject coverage and immense size of the Web, it's difficult for us to face the reality that our sought subject may not exist within the indexed Web docuverse. We don't expect snippets to answer our search questions by themselves (although they can), but we want them to clearly point to relevant webpage passages. Too many snippets don't do this very well.

All the search engines use the same snippet format: a few words of context for each of our searchwords. These contextual excerpts are taken from the first place in the webpage text where each searchword occurs. But a snippet may not contain excerpts for all our searchwords because snippets have a maximum-allowed length, and this length may exclude excerpts for some of our searchwords. Google lets us use up to ten searchwords in a query, but there's no way that several searchword excerpts can be included in a Google snippet.

Within a snippet, excerpts are separated by "dumb" ellipses ("..."). A dot-dot-dot doesn't tell us anything except that the two excerpts it separates are from different sentences. But where are those sentences located in the text? This information is important for our judgment of relevance.

Two snippet improvements could be offered as options on the Advanced Search pages of the Web search engines. The first improvement is longer snippets. The second improvement is indication of the occurrence proximity of our searchwords in the texts of retrieved webpages. For proximity indication, we need "smart" ellipses, not dumb ones.

SNIPPETS WITH EXPANDED CONTEXT
For the first occurrence of each searchword, the entire text sentence in which a searchword appears could be excerpted for the snippet, instead of the few contextual words now excerpted. This expanded context would be more informative than the current brief, arbitrarily-sized excerpts. Yes, longer snippets would be more for us to read, but we'd trade off less snippet-reading time for better subject representation in our search results. Isn't this change worth a trial to prove its worth?

SEARCHWORD PROXIMITY INDICATION (SMART ELLIPSES)
For multiword searches, the occurrence proximity of our searchwords to each other in text indicates the text's probable relevance to our search. In too many retrieved webpages, our searchwords co-occur but are not semantically correlated. If you search <"new orleans" sewage disposal>, you'll get a lot of webpages which mention New Orleans and the words "sewage" and "disposal," but those pages aren't about New Orleans's sewage disposal. The huge size of Web indexes makes such co-occurrence without correlation a frequent search result. It's a phenomenon as natural as rain.

I reiterate: a specific passage can often provide the sought answer to a question -- such as "Where does New Orleans dispose of its treated sewage effluent?" -- and we don't need an entire webpage about it or even most of a webpage. In Web searches, we quickly evaluate the few algorithmically-calculated, "highly-relevant" webpages at the top of the search results (none may actually be relevant, though). Then, we deal with the rest of the results -- the "long tail" phenomenon -- a long stream in which flakes of gold may lie somewhere.

Without searchword proximity indication, we can't know how our searchwords co-occurred in the retrieved webpages, and we have to examine webpage after webpage in which there's actually no correlation between our searchwords. Some snippets do obviously indicate this subject nonrelevance, but too many don't.

In snippets, proximity labels (smart ellipses) could be used to separate the sentences in which all our searchwords first occur as textwords. The Web search engines could do this even for the brief excerpts they now offer searchers. These proximity labels could be used:
(1) ...P... would separate two *non-adjacent* excerpts from the same paragraph (if they were adjacent, no proximity indicator would be necessary).
(2) ...8P... would indicate that the next searchword excerpt occurred eight paragraphs beyond (for example).

EXPANDED SNIPPETS AND PROXIMITY INDICATION
Both expanded snippets and occurrence-proximity indication would improve the subjective relevance judgment of searchers by giving them a better picture of retrieved-but-unseen webpages. Searchers could give more attention to those snippets with close occurrence proximity and not waste time on the snippets in which their searchwords probably occur non-correlatively.

Of course, close searchword proximity isn't always an indication of semantic correlation and relevance. In the above New Orleans sewage disposal search, I retrieved a webpage containing this metaphorical passage:
-----
The infrastructure is still ravaged -- there is little reliable electricity, water, sewage disposal, gas or garbage collection -- Kabul feels like a huge, dusty campground version of New Orleans.
-----
But it's a rare document in which close occurrence proximity isn't an indication of a passage's relevance to our searchwords.

INNOVATION
Do the Web search engines view snippet improvement as an acceptable innovation? I've offered the above suggestions to each of the Big Three search engines. I received no responses; but on the feedback forms they warned me that might be the outcome. The Big Three are busy developing new services and absorbing other firms. Current Web search technology is bringing in big bucks -- why rock the boat? (When Yahoo! bought the AltaVista search engine, it eliminated proximity searching in AltaVista, probably because that useful feature would keep AltaVista competitive with Yahoo!'s house engine.)

The search engines have made steady advancements in algorithmic relevance determination, but their common snippet format seems immutable. Could this be because they want our restless eyes to stray to the moneymaking keyword ads on the right sides of their results pages? Perhaps the search-engine PhD.s believe that longer snippets would be too much for dumbed-down searchers to handle. Okay. Give most searchers short snippets, but give advanced searchers longer ones, and we'll see which format proves to be the winner.

The bottom line is this: too often, there's good info hiding somewhere in those 100 pages of search results which the Web search engines allow us to access. But where the heck is it?

Frederick Rustam
You can't wait for inspiration. You have to go after it with a club.

Jack London (1876-1916)
Post Reply

Return to “December 2007”