A Description of the Collaborative Research Program's Information Management Project Cluster goes here.
Project 1: Filtering Internet Information
Project 2: WikiTrust
Filtering Internet Information for Use in Biothreat Scenarios
Members: Lanbo Zhang, Yi Zhang, Carla Kuiken
A financial analyst wants to be alerted of any news that may affect the price of the stock he/she is tracking; a homeland security officer wants to be alerted of any information related to potential terror attacks; a public health analyst wants to know about the most important and urgently needed information that will help make critical decisions on what to do next when an outbreak of an infectious pathogen happens. In all these examples, people need an agent that can automatically identify their desired information from a large volume of information. Adaptive information filtering is a technique that aims to achieve this goal. In an adaptive filtering system, users have relatively stable information needs, and the filtering system needs to determine whether to recommend some documentst to a user or not, based on the match of the document and the user profile (information needs). Users can give feedback on the delivered documents to help the system improve performance. The core of an adaptive filtering system is the learning of user profiles, which is a challenge for new users who have provided very little feedback (training data). To help deal with this challenge, we proposed a novel user interaction mechanism based on faceted feedback, which allows users to give feedback on document facets rather than documents themselves. Our experimental results have shown this mechanism is effective in improving filtering performance. In real filtering systems, an individual user may have multiple interests, and different users may have overlapped interests. Existing filtering approaches based on standard machine learning models fail to capture these characteristics explicitly. The Discriminative Factored Prior Models (DFPM) we proposed aim to model the multiple interests of each individual user and borrow discriminative criteria from other users when learning a particular user's profile. The performance comparison results of DFPMs and existing approaches demonstrate the advantages of our models.
WikiTrust Turning Wikipedia Quantity into Quality
Members: Ian Pye, Bo Adler, Luca de Alfaro, Shelly Spearing, Jorge Roman
WikiTrust is a reputation system for Wikipedia authors and content. WikiTrust computes three main quantities: edit quality, author reputation, and content reputation. The edit quality measures how well each edit, that is, each change introduced in a revision, is preserved in subsequent revisions. Authors who perform good quality edits gain reputation, and text which is revised by several high-reputation authors gains reputation. Since vandalism on the Wikipedia is usually performed by anonymous or new users (not least because long-time vandals end up banned), and is usually reverted in a reasonably short span of time, edit quality, author reputation, and content reputation are obvious candidates as features to identify vandalism on the Wikipedia. Indeed, using the full set of features computed by WikiTrust, we have been able to construct classifiers that identify vandalism with a recall of 83.5%, a precision of 48.5%, and a false positive rate of 8%, for an area under the ROC curve of 93.4%. If we limit ourselves to the set of features available at the time an edit is made (when the edit quality is still unknown), the classifier achieves a recall of 77.1%, a precision of 36.9%, and a false positive rate of 12.2%, for an area under the ROC curve of 90.4%. Using these classifiers, we have implemented a simple Web API that provides the vandalism estimate for every revision of the English Wikipedia. The API can be used both to identify vandalism that needs to be reverted, and to select high-quality, non-vandalized recent revisions of any given Wikipedia article. These recent high-quality revisions can be included in static snapshots of the Wikipedia, or they can be used whenever tolerance to vandalism is low (as in a school setting, or whenever the material is widely disseminated).