Monday, March 20, 2006
The Government Metadata Controversy
At the end of last year, Government Computer News published the article, "Metadata Not Essential For Search", based on the results of a Request for Information asking about "Efficient and Effective Information Sharing".
The report said that the results "overwhelmingly" support the hypothesis that “..for the majority of government information, exposing it to indexing with commercial search technology is sufficient to meet the information categorization, dissemination, and sharing needs of the public and as required by law and policy.”
Of course, the word "overwhelmingly" was used because 56% of the respondents supported the hypothesis.. One respondent stated that "Search technology has progressed far enough so that manual categorization and metadata tagging of textural documents is no longer necessary and any perceived gain in accessibility does not justify the cost of categorization."
This is the "Google as a Silver Bullet" argument, but brings about some interesting thoughts.
Precise vs. "Good Enough" Results - Can or should the government settle for "good enough" results when indexing their non-marked-up data with COTS tools out of the box?
Availability of Data - The metadata markup process (when it is not automated) may limit how soon that data is available. This is an issue that I think the government is painfully aware of.
Hybrid Solutions? A Focus on Rules for Pattern Recognition, Auto-Tagging? - This is what I call "guess metadata" (metadata that is determined by a computer process, and not a man-in-the-loop). One thing that the report didn't really focus on was that much time and effort is spent defining pattern recognition rules for concepts/keywords for automated markup of metadata in the search engine indexing process. These rules are utilized at indexing time, and increase the likelihood of good results vs. a "search engine out of the box" solution.
From a business perspective, looking at the results of the respondents in this paper, we do need to recognize the need to get the data out faster with minimal impact to organizations making data available. In doing so, we can focus on rules and a pattern-recognition process at pre-indexing time. At indexing time, this process can tag the data with agreed-upon metadata standards, even tying elements of the data to classification taxonomies and ontologies. Of course, IMHO, any automated process is still "guess metadata". Is "guess metadata" good enough?
The report said that the results "overwhelmingly" support the hypothesis that “..for the majority of government information, exposing it to indexing with commercial search technology is sufficient to meet the information categorization, dissemination, and sharing needs of the public and as required by law and policy.”
Of course, the word "overwhelmingly" was used because 56% of the respondents supported the hypothesis.. One respondent stated that "Search technology has progressed far enough so that manual categorization and metadata tagging of textural documents is no longer necessary and any perceived gain in accessibility does not justify the cost of categorization."
This is the "Google as a Silver Bullet" argument, but brings about some interesting thoughts.
From a business perspective, looking at the results of the respondents in this paper, we do need to recognize the need to get the data out faster with minimal impact to organizations making data available. In doing so, we can focus on rules and a pattern-recognition process at pre-indexing time. At indexing time, this process can tag the data with agreed-upon metadata standards, even tying elements of the data to classification taxonomies and ontologies. Of course, IMHO, any automated process is still "guess metadata". Is "guess metadata" good enough?
?
Comments:
<< Home
Elsewhere in the government, this article was a real pain, because it was even further misconstrued by some as proof that metadata tagging wasn't necessary for the government to do.
All of this stemmed from an RFI that was issued. The impression I got was that many of the search vendors responded that extra metadata wasn't necessary because in fact their products are used to working in environments where it isn't available.
The question in the RFI was whether or not their products required such metadata. The conclusion was substantially broader than just search-related tools. I think this is a simple misunderstanding that comes from broadly drawn conclusions.
All of this stemmed from an RFI that was issued. The impression I got was that many of the search vendors responded that extra metadata wasn't necessary because in fact their products are used to working in environments where it isn't available.
The question in the RFI was whether or not their products required such metadata. The conclusion was substantially broader than just search-related tools. I think this is a simple misunderstanding that comes from broadly drawn conclusions.
Yep, I agree - the RFI results were very much vendor-driven. If you look at the paper, not a lot of government sources responded, but a lot of "industry" responded.
In your posting, you stated "One thing that the report didn't really focus on was that much time and effort is spent defining pattern recognition rules for concepts/keywords for automated markup of metadata in the search engine indexing process."
The EEIRS RFI response analysis report did indeed address this, specifically in appendix B. Moreover, as appendix B of the report pointed out, many of the approaches for automated categorization or tagging are based on fully automated statistical analysis of text and do not require manual creation of pattern recognition rules. This was one of the more interesting findings of the report.
For additional information, this IEEE Computer journal article also shows how it is possible to extract high-quality metadata from a corpus using automated processes based solely on statistical analysis of text.
The EEIRS RFI response analysis report did indeed address this, specifically in appendix B. Moreover, as appendix B of the report pointed out, many of the approaches for automated categorization or tagging are based on fully automated statistical analysis of text and do not require manual creation of pattern recognition rules. This was one of the more interesting findings of the report.
For additional information, this IEEE Computer journal article also shows how it is possible to extract high-quality metadata from a corpus using automated processes based solely on statistical analysis of text.
Relating to the last posting, thanks for your comments! I mentioned that the report didn't really focus on it -- it was mentioned in Appendix B, but if you look at the survey, the question of "preparation of content" seemed to be the focus of the questions and of the results of the survey, and not so much related to preparation of rule-based automated tagging. I can certainly see that some automated categorization can happen, but I can also see the value of manual categorization for specialized applications and communities of interest.
Thank you for the IEEE article as well - this is a subject that I am very interested in.
Post a Comment
Thank you for the IEEE article as well - this is a subject that I am very interested in.
<< Home