sunSunday, May 6, 2007Pre-conference W3C AC MeetingmonMonday, May 7, 2007Pre-conference W3C AC Meeting and MoretueTuesday, May 8, 2007Building the WebwedWednesday, May 9, 2007The Global WebthuThursday, May 10, 2007Mining the WebfriFriday, May 11, 2007Weaving the WebsatSaturday, May 12, 2007The World Wide Web and You am18:30am to 10:00ambreak1breakam210:30am to 12 noonlunchlunch breakpm11:30pm to 3:00pmbreak2breakpm23:30pm to 5:00pmpm35:00pm to 5:30pmeveningEvening HM1Hotel, Mezzanine 1HM2Hotel, Mezzanine 2CC1Conference Centre, Level 1CC2Conference Centre, Level 2 vanhorneVanHorneCC2hybrid832shaughnShaughnessyCC2classroom90beattyBeattyCC2classroom90colemanColemanCC2classroom90theatreTheatreCC1theatre252mtstephenMt StephenHM1rounds150petrakPetrakHM2rounds45newbrunswickNew BrunswickHM2classroom112albertaAlbertaHM2classroom176cascadeCascadeHM2classroom300oakOakHM1?32-50norquayNorquayHM1?32-50frontenacFrontenacHM1?24-40champlainChamplainHM1?40-50mckenzieMcKenzieHM1?20-40palliserPalliserHM1?24-40empressEmpressHM1?16-28lacombeLacombeHM1?12-28alhambraAlhambraHM2rounds200grizzlyGrizzly HouseOffsiterestaurant60 W3CACW3C AC MeetingYellowW4AW4A ConferenceAquamarinewowWorld Organization of WebmastersOrangetutorialTutorialsPinkworkshopWorkshopsLightCoralplenaryPlenariesAquapaperRefereed PapersLightSkyBluepanelPanelsTurquoisepostersPostersLightCyansponsoredSponsored TalksPaleGoldenroddevelDevelopers TrackMediumSpringGreenW3CW3C TrackSalmonhistoryWeb History TrackMediumSlateBluemuseumWeb History DisplaySilverexhibitionSponsor ExhibitionVioletbofBOFsFuchsiamiscMiscellaneousWhiteSmoke mtstephensunW3CAC6:30pm to 9:30pm0W3C AC ReceptioneveningcascadesunW3CACOrientation for new members0W3C AC Meetingpm1pm2alhambramonW3CAC0W3C AC LunchlunchcascademonW3CAC(Day 1 of 2)0W3C AC Meetingam1am2pm1pm2albertamonwowhttp://www2007.org/wow.phpCurrent Best Practices in Web Development and Design Tutorial (with WOW Web Professional Certification Exam Option)0World Organization of Webmasters (WOW)am1am2pm1pm2albertamonwow(Petrak or Riverview Lounge)0WOW LunchlunchnewbrunswickmonW4Ahttp://www.w4a.info/(Day 1 of 2)0Web Accessibility (W4A) Conferenceam1am2pm1pm2newbrunswickmonW4A0W4A LunchlunchgrizzlymonW4AOffsite at Grizzly House, 6:00pm to 8:00pm0W4A Banqueteveningempresstuetutorialhttp://www2007.org/tutorial-T0.php(Full-day tutorial)0Robust Management of Web and Grid Ontologies and MetadataT0tutorialam1am2pm1pm2shaughntueworkshophttp://airweb.cse.lehigh.edu/2007/(Full-day workshop)0AIRWeb 2007: Adversarial Information Retrieval on the WebW1workshopam1am2pm1pm2frontenactueworkshophttp://neuroweb.med.yale.edu/hcls/(Full-day workshop)0Health Care and Life Sciences Data Integration for the Semantic WebW2workshopam1am2pm1pm2mckenzietueworkshophttp://okkam.dit.unitn.it/i3/(Full-day workshop)0I3: Identity, Identifiers, Identifications -- Entity-Centric Approaches to Information and Knowledge Management on the WebW3workshopam1am2pm1pm2newbrunswicktueW4Ahttp://www.w4a.info/(Day 2 of 2)0Web Accessibility (W4A) Conferenceam1am2pm1pm2pallisertueworkshophttp://www.research.att.com/%7Erjana/mobea2007.htm(Full-day workshop)0MobEA V: Mobile Web in the Developing WorldW5workshopam1am2pm1pm2cascadetueW3CAC(Day 2 of 2)0W3C AC Meetingam1am2pm1pm2beattytueworkshophttp://querylogs2007.webir.org/(Full-day workshop)0Query Log Analysis: Social and Technological ChallengesW6workshopam1am2pm1pm2albertatueworkshophttp://km.aifb.uni-karlsruhe.de/ws/ckc2007/(Full-day workshop)0Social and Collaborative Construction of Structured KnowledgeW7workshopam1am2pm1pm2colemantueworkshophttp://www2007.redlog.net/(Full-day workshop)0Tagging and Metadata for Social Information OrganizationW8workshopam1am2pm1pm2champlaintueworkshophttp://opim-sun.wharton.upenn.edu/ssa3/index.htm(Full-day workshop)0Sponsored Search AuctionsW9workshopam1am2pm1pm2theatretuetutorialhttp://www2007.org/tutorial-T1.php(Half-day tutorial)0Foundations and Challenges of Web AdvertisingT1tutorialam1am2theatretuetutorialhttp://www2007.org/tutorial-T5.php(Half-day tutorial)0Learning to Rank in Vector Spaces and Social NetworksT5tutorialpm1pm2oaktuetutorialhttp://www2007.org/tutorial-T2.php(Half-day tutorial)0User-Centric Identity for Web ApplicationsT2tutorialam1am2oaktuetutorialhttp://www2007.org/tutorial-T6.php(Half-day tutorial)0An Integrated Approach to Evaluating Web Accessibility: Automated, Manual and User-based TestingT6tutorialpm1pm2lacombetuetutorialhttp://www2007.org/tutorial-T3.php(Half-day tutorial)0Cost-Effective Engineering of Web Application Product Lines (Reuse Beyond Components: Exploiting Similarity Patterns)T3tutorialam1am2norquaytuetutorialhttp://www2007.org/tutorial-T7.php(Half-day tutorial)0Deploying Web-scale Mash-ups by Linking Microformats and the Semantic WebT7tutorialpm1pm2norquaytuetutorialhttp://www2007.org/tutorial-T4.php(Half-day tutorial)0Next Generation Web Services in PracticeT4tutorialam1am2lacombetuetutorialhttp://www2007.org/tutorial-T8.php(Half-day tutorial)0Web Workloads: Characterization, Modeling and ApplicationT8tutorialpm1pm2albertatuehistoryhttp://www2007.org/webhistory.php(5:00pm to 8:00pm, starting in Riverview Lounge)1Web History Receptioneveningvanhornewedplenaryhttp://www2007.org/berners-lee.phpThe Two Magics of Web Science (Van Horne Ballroom)1WWW2007 Opening Ceremonies and Plenary Speaker: Tim Berners-Lee (MIT/W3C)am1vanhornewedexhibition(Van Horne C)0Sponsor Exhibition Hallam2pm1pm2vanhornethuexhibition(Van Horne C)0Sponsor Exhibition Hallam2pm1pm2vanhornefriexhibition(Van Horne C)0Sponsor Exhibition Hallam2pm1pm2petrakwedmuseumExhibit0Web History Displayam2pm1pm2petrakthumuseumExhibit0Web History Displayam2pm1pm2petrakfrimuseumExhibit0Web History Displayam2pm1pm2petraksatmuseumExhibit0Web History Displayam2pm1pm2shaughnwedhistoryhttp://www2007.org/webhistory.phpPioneers of the Web and E-Commerce0Web History Trackam2pm1pm2vanhornewedmisc(Van Horne Ballroom and Foyer, 6:00pm to 8:00pm)1WWW2007 Welcome Receptioneveningbeattywedpaper1 of 2 (Communication in Developing Regions)0Technology for Developing Regionspapersam2beattywedpaper2 of 2 (Networking Issues in the Web)0Technology for Developing Regionspaperspm2colemanwedpanelhttp://www2007.org/panel1.phpPhillip Hallam-Baker (Verisign)0Web Sciencepanelam2beattywedpanelhttp://www2007.org/panel2.phpAmit Nanavati (IBM India)0Web Delivery Models for Developing Regionspanelpm1colemanwedtutorialhttp://www2007.org/tutorial-T9.php(Half-day tutorial)0Semantic Digital LibrariesT9tutorialpm1pm2theatrewedsponsoredTBD0Sponsored Talk or Panelam2theatrewedsponsoredhttp://www2007.org/industrytopic1.phpSearch and Data0Industry Talkspm1theatrewedsponsoredTBD0Sponsored Talk or Panelpm2newbrunswickwedpaper1 of 2 (Querying and Transforming XML)0XML and Web Datapapersam2newbrunswickthupaper2 of 2 (Parsing, Normalizing, and Storing XML)0XML and Web Datapapersam2newbrunswickwedpaper1 of 2 (Personalization)0Browsers and User Interfacespaperspm1newbrunswickwedpaper2 of 2 (Smarter Browsing)0Browsers and User Interfacespaperspm2albertawedpaper1 of 7 (Search Potpourri)0Searchpapersam2albertawedpaper2 of 7 (Crawlers)0Searchpaperspm1albertawedpaper3 of 7 (Web Graphs)0Searchpaperspm2cascadewedW3Chttp://www2007.org/prog-W3CTrack.php#wednesdayMaking Mobile Web Browsing Better0W3C Trackam2cascadewedW3Chttp://www2007.org/prog-W3CTrack.php#wednesdayRich Web Applications0W3C Trackpm1cascadewedW3Chttp://www2007.org/prog-W3CTrack.php#wednesdayThe Future of the Web Page0W3C Trackpm2vanhornethuplenaryhttp://www2007.org/raghavan.phpWeb N.0: What sciences will it take? (Van Horne Ballroom)1WWW2007 Announcements and Plenary Speaker: Prabhakar Raghavan (Yahoo! Research)am1vanhornethumisc(offsite at Brewster's MountView Barbecue; bus transportation provided)1WWW2007 Banquet (6:00pm to 9:00pm)eveningshaughnthupaper1 of 2 (Scalable Systems for Dynamic Content)0Performance and Scalabilitypapersam2beattythupaper1 of 1 (Pervasive Web and Mobility)0Pervasive Web and Mobilitypapersam2newbrunswickthupaper1 of 1 (IPE)0Industrial Practice and Experiencepaperspm1colemanthudevelhttp://www2007.org/panel7.phpPaul Miller (Talis)0Semantic Web with Datapanelam2shaughnthupanelhttp://www2007.org/panel3.phpArun Iyengar (IBM Research)0Scalability and Performancepanelpm1shaughnthupaper2 of 2 (Performance Engineering of Web Applications)0Performance and Scalabilitypaperspm2beattythututorialhttp://www2007.org/tutorial-T10.php(Half-day tutorial)0Model Driven Semantic Web EngineeringT10tutorialpm1pm2colemanthudevelhttp://www2007.org/prog-Developers.php#thursdayProgramming the Web0Developers Trackpm1colemanthudevelhttp://www2007.org/prog-Developers.php#thursdayWeb Tools0Developers Trackpm2theatrethusponsoredTBD0Sponsored Talk or Panelam2theatrethusponsoredhttp://www2007.org/industrytopic2.phpMobile Web and Ajax0Industry Panelpm1albertathupaper1 of 5 (Identifying Structure in Web Pages)0Data Miningpapersam2albertathupaper2 of 5 (Mining Textual Data)0Data Miningpaperspm1cascadethuW3Chttp://www2007.org/prog-W3CTrack.php#thursdayAdvances in Semantic Web0W3C Trackam2cascadethuW3Chttp://www2007.org/prog-W3CTrack.php#thursdayWeb of Services for Enterprise Computing0W3C Trackpm1cascadethuW3Chttp://www2007.org/prog-W3CTrack.php#thursdaySecurity and Usability on the Web0W3C Trackpm2vanhornefriplenaryhttp://www2007.org/buxton.phpSocial Networking and Web Communities (Van Horne Ballroom)1WWW2007 Announcements and Plenary Speaker: Bill Buxton (Microsoft Research/University of Toronto)am1shaughnfripaper1 of 2 (Web Modeling)0Web Engineeringpapersam2shaughnfripaper2 of 2 (End-User Perspective and Measurement in Web Engineering)0Web Engineeringpaperspm1beattyfripaper1 of 2 (Orchestration and Choreography)0Web Servicespaperspm1shaughnfripaper1 of 3 (Defending Against Emerging Threats)0Security, Privacy, Reliability, and Ethicspaperspm2shaughnfripaper(Defending Against Emerging Threats)(cont'd)0Security, Privacy, Reliability, and Ethicspaperspm3beattyfripaper2 of 2 (SLAs and QoS)0Web Servicespaperspm2beattyfripaper(SLAs and QoS) (cont'd)0Web Servicespaperspm3colemanfridevelhttp://www2007.org/prog-Developers.php#fridayLinked Data0Developers Trackam2colemanfritutorialhttp://www2007.org/tutorial-T11.php(Half-day tutorial)0Semantic Web: Technologies and Applications for the Real-WorldT11tutorialpm1pm2theatrefrisponsoredhttp://www2007.org/industrytopic3.phpWeb Issues0Industry Talksam2theatrefridevelhttp://www2007.org/prog-Developers.php#fridayEmerging Web Platforms0Developers Trackpm1theatrefridevelhttp://www2007.org/prog-Developers.php#fridayNext Generation Web Servers0Developers Trackpm2newbrunswickfripaper1 of 5 (Applications)0Semantic Webpapersam2newbrunswickfripaper2 of 5 (Similarity and Extraction)0Semantic Webpaperspm1newbrunswickfripaper3 of 5 (Query Languages and DBs)0Semantic Webpaperspm2newbrunswickfripaper(Query Languages and DBs) (cont'd)0Semantic Webpaperspm3albertathupaper4 of 7 (Search Quality and Precision)0Searchpaperspm2albertafripaper5 of 7 (Advertisements and Click Estimates)0Searchpaperspm1albertafripaper6 of 7 (Knowledge Discovery)0Searchpaperspm2albertasatpaper7 of 7 (Personalization)0Searchpapersam2albertafripaper3 of 5 (Similarity Search)0Data Miningpapersam2cascadefriW3Chttp://www2007.org/prog-W3CTrack.php#fridayA Multimodal Web to Expand Universal Access0W3C Trackam2cascadefriW3Chttp://www2007.org/prog-W3CTrack.php#fridayArchitectural Integration0W3C Trackpm1cascadefriW3Chttp://www2007.org/prog-W3CTrack.php#fridayQuery, Interchange, and Access with XML!0W3C Trackpm2vanhornesatplenaryhttp://www2007.org/hardt.phpAn Identity Story (Van Horne Ballroom)1WWW2007 Announcements and Plenary Speaker: Dick Hardt (SXIP Identity)am1vanhornesatmisc5:00pm to 5:30pm1WWW2007 Closing Ceremonyeveningalbertasatpaper4 of 5 (Predictive Modeling of Web Users)0Data Miningpaperspm1albertasatpaper5 of 5 (Mining in Social Networks)0Data Miningpaperspm2colemansatdevelhttp://www2007.org/prog-Developers.php#saturdayWeb Data 10Developers Trackam2colemansatdevelhttp://www2007.org/prog-Developers.php#saturdayWeb Data 20Developers Trackpm1colemansatdevelhttp://www2007.org/prog-Developers.php#saturdaySocial Networks and Lightning Talks0Developers Trackpm2theatresatsponsoredhttp://www2007.org/industrytopic4.phpWeb Enterprises0Industry Talksam2theatresatsponsoredhttp://www2007.org/industrytopic5.phpSocial Computing0Industry Panelpm1theatresatsponsoredTBD0Sponsored Talk or Panelpm2beattysatpaper1 of 2 (E-Communities)0E* Applicationspapersam2beattysatpaper2 of 2 (E-Commerce and E-Content)0E* Applicationspaperspm1theatrethupanelhttp://www2007.org/panel5.phpGreg Conti (USMA)0Web Search Privacy Issuespanelpm2newbrunswickthupanelhttp://www2007.org/panel4.phpSusan Boll (U. Oldenburg) and Raphael Troncy (CWI)0Multimedia Metadata Standards in Semantic Webpanelpm2beattyfripanelhttp://www2007.org/panel6.phpYoelle Marek (Google, Israel)0Searching Personal Contentpanelam2shaughnsatpaper4 of 5 (Ontologies)0Semantic Webpapersam2shaughnsatpaper5 of 5 (Semantic Web and Web 2.0)0Semantic Webpaperspm1newbrunswicksatpaper2 of 3 (Passwords and Phishing)0Security, Privacy, Reliability, and Ethicspapersam2shaughnsatpaper3 of 3 (Access Control and Trust on the Web)0Security, Privacy, Reliability, and Ethicspaperspm2newbrunswicksatbofhttp://www2007.org/bof1.phpTopic TBD0Birds of a Feather (BOF) Session 1 of 2pm1newbrunswicksatbofhttp://www2007.org/bof2.phpTopic TBD0Birds of a Feather (BOF) Session 2 of 2pm2beattysatmiscWork in Progress Session (sign up during conference)0Lightning Talkspm2 15Extraction and Classification of Dense Communities in the WebThe World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective behaviors and emerging phenomena in the WWW. In this paper we focus on the problem of searching and classifying {\em communities} in the web. Loosely speaking a community is a group of pages related to a common interest. More formally communities have been associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyper-links are arcs of the web-graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the web-graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100\% for communities of a thirty nodes or more (even at low density). It is still about 80\% even for communities of twenty nodes with density over $50\%$ of the arcs present. At the lower extremes the algorithm catches 35\% of dense communities made of ten nodes. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and labelling each group by representative keywords.Yon DourisboureInstitute for Informatics and Telematics of C.N.R.Filippo GeraciInstitute for Informatics and Telematics of C.N.R.Marco PellegriniInstitute for Informatics and Telematics of C.N.R.56Robust Methodologies for Modeling Web Click DistributionsMetrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of the distribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten datasets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC log-likelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of ``users'' that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website.<br /><br /> Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against theoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems.Kamal AliYahooMark ScarrYahoo57DIANE - An Integrated Approach to Automated Service Discovery, Matchmaking and CompositionAutomated matching of semantic service descriptions is the key to automatic service discovery and binding. But when trying to find a match for a certain request it may often happen, that the request cannot be serviced by a single offer but could be handled by combining existing offers. In this case automatic service composition is needed. Although automatic composition is an active field of research it is mainly viewed as a planning problem and treated separatedly from service discovery. In this paper we argue that an integrated approach to the problem is better than seperating these issues as is usually done. We propose an approach that integrates service composition into service discovery and matchmaking to match service requests that ask for multiple connected effects, discuss general issues involved in describing and matching such services and present an efficient algorithm implementing our ideas.Ulrich KuesterFriedrich Schiller University JenaBirgitta Koenig-RiesFriedrich-Schillder-University JenaMirco SternUniversitaet KarlsruheMichael KleinUniversitaet Karlsruhe58Answering Relationship Queries on the WebFinding relationships between entities on the Web, e.g., the connections between different places or the commonalities of people, is a novel and challenging problem. Existing Web search engines excel in keyword matching and document ranking, but they cannot well handle many relationship queries. This paper proposes a new method for answering relationship queries on two entities. Our method first respectively retrieves the top Web pages for either entity from a Web search engine. It then matches these Web pages and generates an ordered list of Web page pairs. Each Web page pair consists of one Web page for either entity. The top ranked Web page pairs are likely to contain the relationships between the two entities. One main challenge in the ranking process is to effectively filter out the large amount of noise in the Web pages without losing much useful information. To achieve this, our method assigns appropriate weights to terms in Web pages and intelligently identifies the potential connecting terms that capture the relationships between the two entities. Only those top potential connecting terms with large weights are used to rank Web page pairs. Finally, the top ranked Web page pairs are presented to the searcher. For each such pair, the query terms and the top potential connecting terms are properly highlighted so that the relationships between the two entities can be easily identified. We implemented a prototype on top of the Google search engine and evaluated it under a wide variety of query scenarios. The experimental results show that our method is effective at finding important relationships with low overhead.Gang LuoIBM T.J. Watson Research CenterChunqiang TangIBM T.J. Watson Research CenterYing-li TianIBM T.J. Watson Research Center67Bridging the Gap Between OWL and Relational DatabasesSchema statements in OWL are interpreted in a different way from similar statements in a relational database setting. This can lead to problems in data-centric applications, where OWL's interpretation of the statements intended as constraints may be confusing and/or inappropriate. We propose an extension of OWL that attempts to mimic the intuition behind integrity constraints in relational databases. We discuss the algorithms for checking constraint satisfaction for different types of knowledge bases, and show that, provided the constraints are satisfied, we can disregard them while answering a broad range of positive queries.Boris MotikUniversity of ManchesterIan HorrocksUniversity of ManchesterUlrike SattlerUniversity of Manchester70DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks StreamsClick fraud is jeopardizing the industry of Internet advertising. Internet advertising is crucial for the thriving of the entire Internet, since it allows producers to advertise their products, and hence contributes to the well being of e-commerce. Moreover, advertising supports the intellectual value of the Internet by covering the running expenses of the content publishers' sites. Some publishers are dishonest, and use automation to generate traffic to defraud the advertisers. Similarly, some advertisers automate clicks on the advertisements of their competitors to deplete their competitors' advertising budgets. This paper describes the advertising network model, and focuses on the most sophisticated type of fraud, which involves coalitions among fraudsters. We build on several published theoretical results to devise the Similarity-Seeker algorithm that discovers coalitions made by pairs of fraudsters. We then generalize the solution to coalitions of arbitrary sizes. Before deploying our system on a real network, we conducted comprehensive experiments on data samples for proof of concept. We detected numerous coalitions that span numerous sites. Interestingly, 93% of the discovered sites were real fraudsters.Ahmed MetwallyUniversity of California, Santa BarbaraDivyakant AgrawalUniversity of California, Santa BarbaraAmr El AbbadiUniversity of California, Santa Barbara89Dynamics of Bid Optimization in Online Advertisement AuctionsWe consider the problem of online keyword advertising auctions among multiple bidders with limited budgets, and study a natural bidding heuristic in which advertisers attempt to optimize their utility by equalizing their return-on-investment across all keywords. We show that existing auction mechanisms combined with this heuristic can experience cycling (as has been observed in many current systems), and therefore propose a modified class of mechanisms with small random perturbations. This perturbation is reminiscent of the small time-dependent perturbations employed in the dynamical systems literature to convert many types of chaos into attracting motions. We show that the perturbed mechanism provably converges in the case of first-price auctions and experimentally converges in the case of second-price auctions. Moreover, the point of convergence has a natural economic interpretation as the unique market equilibrium in the case of first-price mechanisms. In the case of second-price auctions, we conjecture that it converges to the ``supply-aware'' market equilibrium. Thus, our results can be alternatively described as a tatonnement process for convergence to market equilibrium in which prices are adjusted on the side of the buyers rather than the sellers. We also observe that perturbation in mechanism design is useful in a broader context: In general, it can allow bidders to ``share'' a particular item, leading to stable allocations and pricing for the bidders, and improved revenue for the auctioneer.Christian BorgsMicrosoft ResearchJennifer ChayesMicrosoft ResearchOmid EtesamiU.C. BerkeleyNicole ImmorlicaMicrosoft ResearchKamal JainMicrosoft ResearchMohammad MahdianYahoo! Research91A New Suffix Tree Similarity Measure for Document ClusteringIn this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Our experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional keyword tfidf similarity measure in the same GHAC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.Hung ChimCity University of Hong KongXiaotie DengCity University of Hong Kong100Extraction and Search of Chemical Formulae in Text Documents on the WebOften scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets back articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords while searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: (1) extract chemical formulae from text documents, (2) index chemical formulae, and (3) design a ranking function for articles where the chemical formulae occur. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed to measure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using a classification method based on Support Vector Machines (SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision for imbalanced data are proposed to improve the over-all performance. A feature selection method based on frequency and discrimination is used to remove uninformative and redundant features. Experiments show that our approaches of chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing the ranked query results much.Bingjun SunPennsylvania State UniversityQingzhao TanPennsylvania State UniversityPrasenjit MitraPennsylvania State UniversityC. Lee GilesPennsylvania State University111Spam Double-Funnel: Connecting Web Spammers with AdvertisersSpammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam - redirection spam - where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords - one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.Yi-Min WangMicrosoft ResearchMing MaMicrosoft ResearchYuan NiuUC DavisHao ChenUC Davis120Consistency-preserving Caching of Dynamic Database ContentWith the growing use of dynamic web content generated from relational databases, traditional caching solutions for throughput and latency improvements are ineffective. We describe a middleware layer called Ganesh that reduces the volume of data transmitted without semantic interpretation of queries or results. It achieves this reduction through the use of cryptographic hashing to detect similarities with previous results. These benefits do not require any compromise of the strict consistency semantics provided by the back-end database. Further, Ganesh does not require modifications to applications, web servers, or database servers, and works with closed-source applications and databases. Using two benchmarks representative of dynamic web sites, measurements of our prototype show that it can increase end-to-end throughput by as much as twofold for non-data intensive applications and by as much as tenfold for data intensive ones.Niraj ToliaCarnegie Mellon UniversityM. SatyanarayananCarnegie Mellon University127Connecting the Bottom of the Pyramid: An Exploratory Case Study of India's Rural Communication EnvironmentThis paper is based on our exploratory study of a South Indian village in Chamrajanagar district of Karnataka. The study was to understand the rural communication environment and villagers' communication preferences. We examined people's lifestyle, working conditions and their communication eco-system. Our study revealed that villagers, unlike urban inhabitants, interacted with people outside the village only for specific, rather than casual purposes. Another interesting aspect of rural communication was the marginal use of the postal system and the ubiquitous use of pay phone, apart from word of mouth and face-to-face interactions. In fact, personal (face-to-face) interaction was usually preferred among villages in this region, over other kinds of communication, despite infrastructural constraints like poor transport services.<br /><br /> We also observed that communication frequency increased when status quo changed to one that required immediate attention. During the analysis we identified certain social, economic and cultural communication gaps (or problems). However, these problems were clear opportunities to connect the unconnected rural users, by deploying new communication systems and features. Here, we have highlighted some of our findings and possible design avenues based on these findings.Sarita SeshagiriMotorola India Research LabsAman SagarMotorola India Research LabsDhaval JoshiMotorola India Research Labs140Compiling Cryptographic Protocols for Deployment on the WebCryptographic protocols are useful for trust engineering in Web transactions. The Cryptographic Protocol Programming Language (CPPL) provides a model wherein trust management annotations are attached to protocol actions, and are used to constrain the behavior of a protocol participant to be compatible with its own trust policy.<br /><br /> The first implementation of CPPL generated stand-alone, single-session servers, making it unsuitable for deploying protocols on the Web. We describe a new compiler that uses a constraint-based analysis to produce multi-session server programs. The resulting programs run without persistent TCP connections for deployment on traditional Web servers. Most importantly, the compiler preserves existing proofs about the protocols. We present an enhanced version of the CPPL language, discuss the generation and use of constraints, show their use in the compiler, formalize the preservation of properties, present subtleties, and outline implementation details.Jay McCarthyBrown UniversityJoshua D. GuttmanMITRE CorporationJohn D. RamsdellMITRE CorporationShriram KrishnamurthiBrown University145A Scalable Application Placement Controller for Enterprise Data CentersGiven a set of machines and a set ofWeb applications with dynamically changing demands, an application placement controller decides how many instances to run for each application and where to put them, while observing all kinds of resource constraints. This problem is NP hard. In this paper, we propose an online algorithm that uses heuristics to efficiently solve this problem. It allows multiple applications to share a single machine, and strives to maximize the total satisfied application demand, to minimize the number of application starts and stops, and to balance the load across machines. It can produce within 30 seconds high-quality solutions for hard placement problems with thousands of machines and thousands of applications. This scalability is crucial for dynamic resource provisioning in large-scale enterprise data centers. Our algorithm significantly and consistently outperforms the existing state-of-the-art algorithm under a wide variety of workloads.Chunqiang TangIBM T.J. Watson Research CenterMalgorzata SteinderIBM ResearchMichael SpreitzerIBM ResearchGiovanni PacificiIBM Research158Is High-Quality VoD Feasible using P2P Swarming?Digital media companies have recently started embracing P2P networks as an alternative content distribution channel. However, the drawback of the current {\em P2P swarming} systems is that users need to download the full video and, hence, wait a long time before they can start watching it. While a lot of effort has gone into optimizing the distribution of large files, little research has been done on how to enable high-quality Video-on-Demand (VoD) functionality with P2P swarming systems. The main challenges reside in ensuring that users can start watching a movie at any point in time, while providing small start-up times, sustainable playback rates and high swarming efficiencies.<br /><br /> In this work, we explore the feasibility of providing high-quality VoD using P2P mesh-based networks. To this extent, we investigate scheduling and pre-fetching techniques, network coding, and mesh topology management. Using both simulations and results from a real implementation, we provide evidence that high-quality VoD is feasible, and give guidelines to enable play-as-you-download P2P swarming systems with high playback rates and minimum start-up delays.Siddhartha AnnapureddyNew York UniversitySaikat GuhaCornell UniversityDinan GunawardenaMicrosoft ResearchChristos GkantsidisMicrosoft ResearchPablo RodriguezTelefonica Research161Exhibit: Light-weight Structured Data PublishingIt is no surprise that Semantic Web researchers and enthusiasts are excited to publish and accumulate semi-structured data on the Web. But looking beyond our community, we recognize that many, many other people also have structured data and want to publish it in rich browsing interfaces. These small-time authors fall into the same category as those early enthusiasts of the Web who were simply excited by the opportunity of using the new medium to share information that they cared about. With this insight, we create a lightweight structured data publishing framework called Exhibit that duplicates many factors we believe have contributed to the original growth of the Web. We argue that appealing to this segment of the Web population--addressing their publishing needs and desires at very low cost in many aspects--lets us leverage their labor to structure-ize existing content on the Web that has previously been authored in HTML by hand and is remaining hard to harvest automatically.David HuynhMIT CSAILDavid KargerMIT CSAILRob MillerMIT162Navigation-Aided RetrievalUsers searching for information in hypermedia environments often perform querying followed by manual navigation. Yet, the conventional text/hypertext retrieval paradigm does not explicity take post-query navigation into account. This paper proposes a new retrieval paradigm, called navigation-aided retrieval (NAR), which treats both querying and navigation as first-class activities. In the NAR paradigm, querying is seen as a means to identify starting points for navigation, and navigation is guided based on information supplied in the query. NAR is a generalization of the conventional probabilistic information retrieval paradigm, which implicitly assumes no navigation takes place.<br /><br /> This paper presents a formal model for navigation-aided retrieval, and reports empirical results that point to the real-world applicability of the model. The experiments were performed over a large Web corpus provided by TREC, using human judgments on a new rating scale developed for navigation-aided retrieval. In the case of ambiguous queries, the new retrieval model identifies good starting points for post-query navigation. For less ambiguous queries that need not be paired with navigation, the output closely matches that of a conventional retrieval system.Shashank PanditCarnegie Mellon UniversityChristopher OlstonYahoo! Research190Turning Portlets into Services: The Consumer ProfilePortlets strive to play at the front end the same role that Web services currently enjoy at the back end, namely, enablers of application assembly through reusable services. However, it is well-known in the component community that, the larger the component, the more reduced the reuse. Hence, the coarse-grained nature of portlets (they encapsulate also the presentation layer) can jeopardize this vision of portlets as reusable services. To avoid this situation, this work proposes a perspective shift in portlet development by introducing the notion of organization profile. While the user profile characterises the end user (e.g. age, name, etc), the organization profile captures the idiosyncrasies of the organization through which the portlet is being delivered (e.g. the portal owner) as far as the portlet functionality is concerned. The user profile is dynamic and hence, requires the portlet to be customised at run time. By contrast, the organization profile is known at registration time, and it is not always appropriate/possible to consider it at run time. Rather, it is better to customize the code at development time, and produce an organization-specific portlet which built-in, custom functionality. In this scenario, we no longer have a portlet but a family of portlets, and the portlet provider becomes the "assembly line" of this family. This work promotes this vision by introduces an organization-aware, WSRP-compliant architecture that let portlet consumers registry and handle "family portlets" in the same way that "traditional portlets". In so doing, portlets are nearer to become truly reusable services.Oscar DiazUniversity of the Basque CountrySalvador TrujilloUniversity of the Basque CountrySandy PerezUniversity of the Basque Country194Do Not Crawl in the DUST: Different URLs with Similar TextWe consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.Ziv Bar-YossefTechnion - Israel Institute of Technology and Google HaifaIdit KeidarTechnion - Israel Institute of TechnologyUri SchonfeldUniversity of California Los Angeles200Towards the Theoretical Foundation of ChoreographyWith the growth of interest on the web services, people pay more and more attention to choreography, that is, to describe collaborations of participants from a global viewpoint, in accomplishing a common business goal. In this paper, based on a simple choreography languages and a role-oriented process languages, we study some fundamental issues related to choreography, especially related to implementation, including semantics, projection and natural projection, dominant role in choices and iterations, etc. We develop the concept of \emph{dominant role} and propose some novel languages structures related to it. The study reveals some clues about the language, semantics, specification and implementation of choreography.Zongyan QiuPeking UniversityXiangpeng ZhaoPeking UniversityChao CaiPeking UniversityHongli YangPeking University207NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction NetworksGiven a large online network of online auction uesrs and their histories of transactions, how can we spot anomalies, or even auction fraud? We describe the algorithms and system design decisions behind our proposed NetProbe system for uncovering auction fraud. We show that it is possible to do fast and scalable fraud detection, in large auction networks. The main idea is to use the machinery of "Markov Random Fields" (MRF), and try to guess the hidden state (fraud/honest) of each participant. We describe the algorithms behind our system, that are based on "belief propagation"; we provide our own incremental but accurate approximations to it; and we list and justify our design decisions for efficient crawling of real auction networks. We report experiments on synthetic graphs containing as many as 7,000 nodes and 30,000 edges, where NetProbe was able to spot fraudulent nodes with over 90 % precision and recall, with execution times in the order of seconds. We also report experiments on a real graph consisting of about 700,000 transactions between more than 66,000 eBay users, where NetProbe was highly effective at unearthing hidden networks of fraudsters, within a realistic response time of about 6 minutes.Shashank PanditCarnegie Mellon UniversityDuen Horng ChauCarnegie Mellon UniversitySamuel WangCarnegie Mellon UniversityChristos FaloutsosCarnegie Mellon University215Detecting Near-Duplicates for Web CrawlingNear-duplicate documents are commonly found on the web. A pair of near-duplicate web pages differ from each other in a very small portion. The differences commonly consist of advertisements and timestamps. Such differences are irrelevant for web search. During web crawling, it is useful to quickly ascertain whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.<br /><br /> In the course of developing a practical system for near-duplicate detection, we make two research contributions. First, we demonstrate the effectiveness of Charikar's fingerprinting technique for identifying near-duplicate web pages. We show that for 8 billion web-pages, a good choice of parameters is 64-bit fingerprints and 3-bit Hamming distances. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.Gurmeet MankuGoogle Inc.Arvind JainGoogle Inc.Anish Das SarmaStanford University216Optimized Query Planning of Continuous Aggregation Queries in Dynamic Data Dissemination NetworksContinuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some aggregation function over distributed data items, for example, to know (a) the average of temperatures sensed by a set of sensors (b) the value of index of mid-cap stocks. In these queries a client specifies a coherency or accuracy requirement as part of the query. In this paper we present a low-cost, scalable technique to answer continuous aggregation queries using a content distribution network of dynamic data items. In such a network of data aggregators, each data aggregator serves a set of data items at specific coherencies. Just as various fragments of a dynamic web-page are served by one or more nodes of a CDN, our technique involves decomposing a client query into sub-queries and executing sub-queries on judiciously chosen data aggregators For executing an incoherency bounded continuous query, a query plan is required which includes the set of sub-queries, their individual incoherency bounds and data aggregators which can execute these sub-queries. An optimal query execution plan should satisfy client query's coherency requirement with least cost, measured in terms of the number of refresh messages sent from aggregators to the client. For estimating query execution cost, we build a continuous query cost model which can be used to estimate the number of messages required to satisfy the client specified incoherency bound. Performance results using real-world traces show that our cost based query planning leads to queries being executed using less than one third the number of messages required by existing schemes.Rajeev GuptaIBM India Research LabKrithi RamamrithamIIT Bombay223PRIVE: Anonymous Location-Based Queries in Distributed Mobile SystemsNowadays, mobile users with positioning devices can access Location Based Services (LBS) and query about points of interest in their proximity. For such applications to succeed, privacy and confidentiality are essential. Encryption alone is not adequate; although it safeguards the system against eavesdroppers, the queries themselves may disclose the location and identity of the user. Recently, there have been proposed centralized architectures based on k-Anonymity, which utilize an intermediate anonymizer between the mobile users and the LBS. However, the anonymizer must be updated continuously with the current locations of all users. Moreover, the complete knowledge of the entire system poses a security threat, if the anonymizer is compromised.<br /><br /> In this paper we address two issues: (i) We show that existing approaches may fail to provide spatial anonymity for some distributions of user locations and describe a novel technique which solves this problem. (ii) We propose PRIVE, a decentralized architecture for preserving the anonymity of users issuing spatial queries to LBSs. Mobile users self-organize into an overlay network with good fault tolerance and load balancing properties. PRIVE avoids the bottleneck caused by centralized techniques both in terms of anonymization and location updates. Moreover, the status is distributed in numerous users, rendering the system resilient to attacks. Extensive experimental studies suggest that PRIVE is applicable to real-life scenarios with large populations of mobile users.Gabriel GhinitaNational University of SingaporePanos KalnisNational University of SingaporeSpiros SkiadopoulosUniversity of Peloponnese225Exploring in the Weblog Space by Detecting Informative and Affective ArticlesWeblogs have become a prevalent source of information for people to express themselves. In general, there are two genres of contents in weblogs. The first kind is about the webloggers' personal feelings, thoughts or emotions. We call this kind of weblogs affective articles. A second kind of weblogs is about technologies and different kinds of informative news. In this paper, we present a machine learning method for classifying informative and affective articles among weblogs. We consider this problem as a binary classification problem. By using machine learning approaches, we achieve 92% on information retrieval performance measures including precision, recall and F1. We set up three studies on the applications of above classification approach in both research and industrial fields. We use the above classification approach to improve the performance of classification of emotions from weblog articles. We also develop an intent-driven weblog-search engine based on the classification techniques to improve the satisfaction of web users. Finally, we use above classification approach to search for weblogs with a great deal of informative articles.Xiaochuan NiDepartment of Computer Science and Engineering Shanghai Jiao-Tong UniversityGui-Rong XueShanghai Jiao-Tong UniversityXiao LingDepartment of Computer Science and Engineering Shanghai Jiao-Tong UniversityYong YuShanghai Jiao-Tong UniversityQiang YangHong Kong University of Science and Technology232Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural SteganographyIn a social network, nodes correspond to people or other social entities, and edges correspond to social links between them. In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers. We describe a family of schemes such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes.Lars BackstromCornell UniversityCynthia DworkMicrosoft ResearchJon KleinbergCornell University247Privacy-Enhancing Personalized Web SearchPersonalized web search is a promising way to improve search quality by customizing search results for people with individual information goals. However, users are uncomfortable with exposing private preference information to search engines. On the other hand, privacy is not absolute, and often can be compromised if there is a gain in service or profitability to the user. Thus, a balance must be struck between search quality and privacy protection. This paper presents a scalable way for users to automatically build rich user profiles. These profiles summarize a user's interests into a hierarchical organization according to specific interests. Two parameters for specifying privacy requirements are proposed to help the user to choose the content and degree of detail of the profile information that is exposed to the search engine. Experiments showed that the user profile improved search quality when compared to standard MSN rankings. More importantly, results verified our hypothesis that a significant improvement on search quality can be achieved by only sharing some higher-level user profile information, which is potentially less sensitive than detailed personal information.Yabo XuSimon Fraser UniversityBenyu ZhangMicrosoft Research AsiaZheng ChenMicrosoft Research AsiaKe WangSimon Fraser University272ActiveRDF: Object-Oriented Semantic Web ProgrammingObject-oriented programming is the current mainstream programming paradigm but existing RDF APIs are mostly triple-oriented. Traditional techniques for bridging a similar gap between relational databases and object-oriented programs cannot be applied directly, given the different nature of Semantic Web data, as can for example be seen in the semantics of class membership, inheritance relations, and object conformance to schemas.<br /><br /> We present ActiveRDF, an object-oriented API for managing RDF data that offers full manipulation and querying of RDF data, does not rely on a schema and fully conforms to RDF(S) semantics. ActiveRDF can be used with different RDF data stores, adapters have been implemented to generic SPARQL endpoints, Sesame, Jena, Redland and YARS and new adapters can be added easily. In addition, integration with the popular Ruby on Rails framework enables fast development of Semantic Web applications.Eyal OrenDERI, National University of Ireland, GalwayRenaud DelbruDERI NUIGSebastian GerkeDERI NUIGArmin HallerDERI NUIGStefan DeckerDERI NUIG279XML Design for Relational StorageDesign principles for XML schemas that eliminate redundancies and avoid update anomalies have been studied recently. Several normal forms, generalizing those for relational databases, have been proposed. All of them, however, are based on the assumption of a native XML storage, while in practice most of XML data is stored in relational databases.<br /><br /> In this paper we study XML design and normalization for relational storage of XML documents. To be able to relate and compare XML and relational designs, we use an information-theoretic framework that measures information content in relations and documents, with higher values corresponding to lower levels of redundancy. We show that most common relational storage schemes preserve the notion of being well-designed (i.e., anomalies- and redundancy-free). Thus, existing XML normal forms guarantee well-designed relational storages as well. We further show that if this perfect option is not achievable, then a slight restriction on XML constraints guarantees a ``second-best'' relational design, according to possible values of the information-theoretic measure. We finally consider an edge-based relational representation of XML documents, and show that while it has similar information-theoretic properties with other relational representations, it can behave significantly worse in terms of enforcing integrity constraints.Solmaz KolahiUniversity of TorontoLeonid LibkinUniversity of Edinburgh286Supervised Rank AggregationThis paper is concerned with rank aggregation, the task of combining results of individual ranking functions in meta-search. Previously, rank aggregation was performed mainly by using unsupervised methods. It is hard for the unsupervised approach to improve ranking performances by leveraging the use of labeled data, when such data is available. We propose employing a supervised learning approach to perform the task, which we refer to as "Supervised Rank Aggregation". We set up a general framework for conducting rank aggregation with supervised learning, in which learning for rank aggregation is formalized as an optimization issue that minimizes disagreements with the labeled ground truth data. As case study, we focus on Markov Chain based rank aggregation in this paper. The optimization problem is not a convex optimization problem for Markov Chain based methods, however, and thus is hard to solve. We transform the optimization problem into semi-definite programming and give proofs on the correctness. Experimental results on meta-searches show that Supervised Rank Aggregation can significantly outperform existing unsupervised methods.Yu-Ting LiuMicrosoft Research Asia and Beijing Jiatong UniversityTie-Yan LiuMicrosoft Research AsiaTao QinTsinghua UniversityZhi-Ming MaChinese Academy of ScienceHang LiMicrosoft Research Asia287A Mobile Application Framework for the Geospatial WebIn this paper we present an application framework that leverages geospatial content on the World Wide Web by enabling innovative modes of interaction and novel types of user interfaces on advanced mobile phones and PDAs. We discuss the current development steps involved in building mobile geospatial Web applications and derive three technological pre-requisites for our framework: spatial query operations based on visibility and field of view, a 2.5D environment model, and a presentation-independent data exchange format for geospatial query results. We propose the Local Visibility Model as a suitable XML-based candidate and present a prototype implementation.Rainer SimonTelecommunications Research Center ViennaPeter FroehlichTelecommunications Research Center Vienna308GlobeTP: Template-Based Database Replication for Scalable Web ApplicationsGeneric database replication algorithms do not scale linearly in throughput as they require to apply all update, deletion and insertion (UDI) queries to every database replica. The throughput is therefore limited to the point where the number of UDI queries alone is sufficient to overload one server. In such scenarios, partial replication of a database can help, as update queries are executed only by a subset of all servers. In this paper we propose GlobeTP, a system that employs partial replication to improve database throughput. GlobeTP exploits the fact that a Web application's query workload is composed of a small set of read and write templates. Using knowledge of these templates and their respective execution costs, GlobeTP provides database table placements that produce significant improvements in database throughput. We demonstrate the efficiency of this technique using two different industry standard benchmarks. In our experiments, GlobeTP increases the throughput by 57% to 150% compared to full replication, while using identical hardware configuration. Furthermore, adding a single query cache improves the throughput by another 30% to 60%.Tobias GroothuyseVrije Universiteit, AmsterdamSwaminathan SivasubramanianVrije Universiteit, AmsterdamGuillaume PierreVrije Universiteit, Amsterdam324Dynamic Personalized Pagerank in Entity-Relation GraphsExtractors and taggers turn unstructured text into entity-relation (ER) graphs where nodes are entities (email, paper, person, conference, company) and edges are relations (wrote, cited, works-for). Typed proximity search of the form type=person NEAR company~"IBM", paper~"XML" is an increasingly useful search paradigm in ER graphs. Proximity search implementations either perform a Pagerank-like computation at query time, which is slow, or precompute, store and combine per-word Pageranks, which can be very expensive in terms of preprocessing time and space. We present HubRank, a new system for fast, dynamic, space-efficient proximity searches in ER graphs. During preprocessing, HubRank computes and indexes certain "sketchy" random walk fingerprints for a small fraction of nodes, carefully chosen using query log statistics. At query time, a small "active" subgraph is identified, bordered by nodes with indexed fingerprints. These fingerprints are adaptively loaded to various resolutions to form approximate personalized Pagerank vectors (PPVs). PPVs at remaining active nodes are now computed iteratively. We report on experiments with CiteSeer's ER graph and millions of real CiteSeer queries. Some representative numbers follow. On our testbed, HubRank preprocesses and indexes 52 times faster than whole-vocabulary PPV computation. A text index is 56 MB. Whole-vocabulary PPVs would consume 102 GB. If PPVs are truncated to 56 MB, precision compared to true Pagerank drops to 0.55; in contrast, HubRank has precision 0.91 at 63 MB. HubRank's average query time is 328 milliseconds; query-time Pagerank computation takes 11 seconds on average.Soumen ChakrabartiIIT Bombay326Effort Estimation: How Valuable is it for a Web company to Use a Cross-company Data Set, Compared to Using Its Own Single-company Data Set?Previous studies comparing the prediction accuracy of effort models built using Web cross- and single-company data sets have been inconclusive, and as such replicated studies are necessary to determine under what circumstances a company can place reliance on a cross-company effort model. This paper therefore replicates a previous study by investigating how successful a cross-company effort model is: i) to estimate effort for Web projects that belong to a single company and were not used to build the cross-company model; ii) compared to a single-company effort model. Our single-company data set had data on 15 Web projects from a single company and our cross-company data set had data on 68 Web projects from 25 different companies. The effort estimates used in our analysis were obtained by means of two effort estimation techniques, namely forward stepwise regression and case-based reasoning. Our results were similar to those from the replicated study, showing that predictions based on the single-company model were significantly more accurate than those based on the cross-company model.Emilia MendesThe University of AucklandSergio Di MartinoUniversity of SalernoFilomena FerrucciUniversity of SalernoCarmine GravinoUniversity of Salerno339Random Web CrawlsThis paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of course a Web crawl has a very particular structure; we recall some known results about it. We then propose a model generating similar structures. Our model simply simulates a crawling, i.e. builds and crawls the graph at the same time. The graphs generated have lot of known properties of Web crawls. Our model is simpler than most random Web graph models, but captures the sames properties. Notice that it modelizes the crawling process instead of the page writting process of Web graph models.Toufik BennouasCriteo R&amp;DFabien de MontgolfierUniversite Paris 7342Scaling Up All Pairs Similarity SearchGiven a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose novel optimization and indexing techniques for this problem, resulting in an algorithm that is both faster and simpler than the previous state-of-the-art approaches. We demonstrate the effectiveness of our algorithm on the public DBLP dataset, and on two real-world web applications: generating recommendations for the Orkut social network, and computing pairs of similar queries from search snippet data among the 5 million most frequently issued Google queries. Our algorithm is between 5 times to 20 times faster than previous algorithms on these datasets.Roberto BayardoGoogleYiming MaU. California IrvineRamakrishnan SrikantGoogle366Long Distance Wireless Mesh Network Planning: Problem Formulation and SolutionSeveral research efforts as well as deployments have chosen IEEE 802.11 as a low-cost, long-distance access technology to bridge the digital divide. In this paper, we consider the important issue of planning such networks to the minimize system cost. This is a non-trivial task since it involves several sets of variables: the network topology, tower heights, antenna types to be used and their orientations, and radio transmit powers. The task is further complicated due to the presence of network performance constraints, and the inter-dependence among the variables. Our first contribution in this paper is the formulation of this problem in terms of the variables, constraints and the optimization criterion. Our second contribution is in identifying the dependencies among the variables and breaking-down the problem into four tractable sub-parts. In this process, we extensively use domain knowledge to strike a balance between tractability and practicality.<br /><br /> We have evaluated the proposed algorithms using random input sets as well as real-life instances with success. We have been able to show detailed planning of network topology, required tower heights, antenna types, and transmit powers for the Ashwini project, a long distance WiFi network under deployment in Andhra Pradesh, India, In this case, we are able to achieve within 2\% additional cost of a lower bound estimate.Sayandeep SenIIT KanpurBhaskaran RamanIndian Institute of Technology, Kanpur391Yago: A Core of Semantic Knowledge - Unifying WordNet and WikipediaWe present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains roughly 900,000 entities and 5,000,000 facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as hasWonPrize). The facts have been automatically extracted from the unification of Wikipedia and WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships -- and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.Fabian M. SuchanekMax-Planck-Institute for Computer ScienceGjergji KasneciMax-Planck-Institute for Computer ScienceGerhard WeikumMax-Planck-Institute for Computer Science395Integrating Value-based Requirement Engineering Models to WebML using VIP Business Modeling FrameworkRequirement engineering is emerging as an increasingly important discipline for supporting Web application development, as these are designed to satisfy diverse stakeholder needs, additional functional, information, multimedia and usability requirements as compared to traditional software applications. Moreover, when considering innovative e-commerce applications, value-based requirements engineering is an extremely relevant methodology which exploits the concept of economic value during the requirements engineering activity. In contrast, most of the methodologies proposed for the development of Web applications, primarily focus on the system design, and paying less attention to the requirements engineering, and specifically to value-based requirement engineering. Focusing this aspect, the paper presents integration of value-based requirement engineering models to WebML models using our recently proposed VIP Business Modeling Framework. The integration process is demonstrated using a well-known e-commerce application example by first presenting example VIP business models and then deriving WebML process, structural and other models from these business models.Farooque AzamBUAAZhang LiBUAARashid AhmadBUAA397Optimizing Web Search Using Social AnnotationsThis paper explores the use of social annotations to improve web search. Nowadays, many services e.g., del.icio.us have been developed for web users to organize and share their favorite web pages on line by using social annotations. We observed that the social annotations can benefit the web search in two aspects: 1) the annotations are usually good summaries of corresponding web pages; 2) the count of annotations indicates the popularity of web pages. Two novel algorithms are proposed to incorporate these information into page ranking: 1) SocialSimRank (SSR) calculates the similarity between social annotations and web queries; 2) SocialPageRank (SPR) captures the popularity of web pages. Preliminary experimental results show that SSR can find the latent semantic association between queries and annotations, while SPR successfully measures the quality (popularity) of a web page from the web users' perspective. We further empirically evaluate the proposed methods with 50 manually annotated queries and 3000 auto-generated queries, on a dataset consisting of 690,482 web pages with 2,879,614 different annotations [32]. Experiments show that both SSR and SPR benefit the web search significantly. By incorporating both the SPR and SSR features, the quality of search results can be improved by as much as 14.80% and 25.02% compared with the original performance in MAP on two query sets respectively.Shenghua BaoShanghai Jiao Tong UniversityXiaoyuan WuShanghai Jiao Tong UniversityBen FeiIBM China Research LabGui-Rong XueShanghai Jiao-Tong UniversityZhong SuIBM China Research LabYong YuShanghai Jiao-Tong University403Answering Bounded Continuous Search Queries in the World Wide WebSearch queries applied to extract relevant information from the World Wide Web over a period of time may be denoted as continuous search queries. The improvement of continuous search queries may concern not only the quality of retrieved results but also the freshness of results, i.e. the time between the availability of a respective data object on the Web and the notification of a user by the search engine. In some cases a user should be notified immediately since the value of the respective information decreases quickly, as e.g. news about companies that affect the value of respective stocks or sales offers for products that may no longer be available after a short period of time. In the document filtering literature the optimization of such queries is usually based on threshold classification. Documents above a quality threshold are returned to a user. The threshold is tuned in order to optimize the quality of retrieved results. The disadvantage of such approaches is that the amount of information returned to a user may hardly be controlled without further user-interaction. In this paper we consider the optimization of bounded continuous search queries where only the estimated best k elements are returned to a user. We present a new optimization method for bounded continuous search queries based on the optimal stopping theory and compare the new method to methods currently applied by Web search systems. The new method provides results of significantly higher quality for the cases where very fresh results have to be delivered.Dirk KukulenzInstitute of Information SystemsAlexandros NtoulasMicrosoft Search Labs420Reliable QoS Monitoring Based on Client FeedbackService-level agreements (SLAs) establish a contract between service providers and clients concerning Quality of Service (QoS) parameters. Without proper penalties, service providers have strong incentives to deviate from the advertised QoS, causing losses to the clients. Reliable QoS monitoring (and proper penalties computed on the basis of delivered QoS) are therefore essential for the trustworthiness of a service-oriented environment. In this paper, we present a novel QoS monitoring mechanism based on quality ratings from the clients. A reputation mechanism collects the ratings and computes the actual quality delivered to the clients. The mechanism provides incentives for the clients to report honestly, and pays special attention to minimizing cost and overhead.Radu JurcaEcole Polytechnique Federale de Lausanne EPFLWalter BinderUniversity of LuganoBoi FaltingsEcole Polytechnique Federale de Lausanne EPFL428Hierarchical, Perceptron-like Learning for Ontology-Based Information ExtractionRecent work on ontology-based Information Extraction (IE) has tried to make an increased use of the knowledge from the target ontology in order to improve the semantic annotation results. However, only very few approaches are able to benefit from the ontology structure and one of them is not a learning system, thus is not easy to adapt to new domains, whereas the other one does not perform semantic annotation of documents, but only ontology population.<br /><br /> This paper introduces a hierarchical learning approach for IE, which uses the target ontology as an essential part of the extraction process. Hierarchical classification takes into account the relations between concepts, thus benefiting directly from the ontology.<br /><br /> We also carry out evaluation experiments on the largest available semantically annotated corpus of 146 classes. The results demonstrate clearly the benefits of using knowledge from the ontology for ontology-based IE. We also demonstrate the advantages of our approach over other state-of-the-art learning systems on a commonly used benchmark dataset.Yaoyong LiUniversity of SheffieldKalina BontchevaUniversity of Sheffield429An Adaptive Crawler for Locating Hidden-Web Entry PointsIn this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages to focus the crawl on a topic; by prioritizing promising links within the topic; and by also following links that may not lead to immediate benefit. We propose a new framework whereby crawlers automatically learn patterns of promising links, and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup and tuning. Our experiments over real Web pages in a representative set of domains indicate that online learning leads to significant gains in harvest rates: the adaptive crawlers retrieve up to three times as many forms as crawlers that use a fixed focus strategy.Luciano BarbosaUniversity of UtahJuliana FreireUniversity of Utah433Just the Right Amount: Extracting Modules from OntologiesThe ability to extract meaningful fragments from an ontology is key for ontology re-use. We propose a definition of a module that guarantees to completely capture the meaning of a given set of terms, i.e., to include all axioms relevant to the meaning of these terms, and study the problem of extracting minimally sized modules. We show that the problem of deciding if a module is minimal is undecidable even for rather restricted sub-languages of OWL DL. Hence we propose two ``approximations'', i.e., alternative definitions of modules for a vocabulary that still provide the above guarantee, but that are possibly too strict, and that may thus result in larger modules: the first approximation is semantic and can be checked using existing DL reasoners; the second is syntactic, and can be computed in polynomial time. Finally, we report on an empirical evaluation of our syntactic approximation that demonstrates that the modules we extract are surprisingly small.Bernardo Cuenca GrauUniversity of ManchesterIan HorrocksUniversity of ManchesterYevgeny KazakovUniversity of ManchesterUlrike SattlerUniversity of Manchester435From SPARQL to Rules (and back)As the data and ontology layers of the Semantic Web stack have achieved a certain level of maturity in standard recommendations such as RDF and OWL, the current focus lies on two related aspects. On the one hand, the definition of a suitable query language for RDF, SPARQL, seems to be close to candidate recommendation status within the W3C. The establishment of the Rules layer on top of the existing stack on the other hand marks the next step to be tackled, where especially languages with their roots in Logic Programming and Deductive Databases are receiving considerable attention. The purpose of this paper is threefold. First, we discuss the formal semantics of SPARQL extending recent results in several ways. Second, we provide translations from SPARQL to Datalog with stratified negation as failure. Third, we propose some useful and easy to implement extensions of SPARQL, based on this translation. As it turns out, the combination serves for direct implementations of SPARQL on top of existing rules engines as well as a basis for more general rules and query languages on top of RDF.Axel PolleresDERI, National University of Ireland447A Fault Model and Mutation Testing of Access Control PoliciesTo increase confidence in the correctness of specified policies, policy developers can conduct policy testing by supplying typical test inputs (requests) and subsequently checking test outputs (responses) against expected ones. Unfortunately, manual testing is tedious and few tools exist for automated testing of XACML policies.<br /><br /> We present a fault model for access control policies and a framework to explore it. The framework includes mutation operators used to implement the fault model, mutant generation, equivalent-mutant detection, and mutant-killing determination. This framework allows us to investigate our fault model, evaluate coverage criteria for test generation and selection, and determine a relationship between structural coverage and fault-detection effectiveness. We have implemented the framework and applied it to various XACML policies. Our experimental results offer valuable insights into choosing mutation operators in mutation testing and choosing coverage criteria in test generation and selection.Evan MartinNorth Carolina State UniversityTao XieNorth Carolina State University461Internet-Scale Collection of Human-Reviewed DataEnterprise data processing and content aggregation systems often require extensive use of human reviewed data (e.g. for training and monitoring machine learning-based applications). Today these needs are often met by in-house efforts or offshore contracting. Emerging applications attempt to provide automation for human reviewed data collection at Internet-scale. We conduct extensive experiments to study the effectiveness of one such application. We also study the feasibility of using Yahoo! Answers, a general question-answering forum, for human review data collection.Qi SuYahoo! IncDmitry PavlovYahoo! IncJyh-Herng ChowYahoo! IncWendell BakerYahoo! Inc464Using Google Distance to Weight Approximate Ontology MatchesDiscovering mappings between concept hierarchies is widely regarded as one of the hardest and most urgent problems facing the Semantic Web. The problem is even harder in domains where concepts are inherently vague and ill-defined, and cannot be given a crisp definition. A notion of approximate concept mapping is required in such domains, but until now, no such notion is available.<br /><br /> The first contribution of this paper is a definition for concepts is decomposed into a number of submappings, and a \emph{sloppiness value} determines the fraction of these submappings that can be ignored when establishing the mapping.<br /><br /> A potential problem of such a definition is that with an increasing sloppiness value, it will gradually allow mappings between any two arbitrary concepts. To improve on this trivial behaviour, we need to design a heuristic weighting which minimises the sloppiness required to conclude desirable matches, but at the same time maximises the sloppiness required to conclude undesirable matches. The second contribution of this paper is to show that a \emph{Google-based similarity measure} has exactly these desirable properties.<br /><br /> We establish these results by \emph{experimental validation in the domain of musical genres}. We show that this domain does suffer from ill-defined concepts. We take two real-life genre hierarchies from the Web, we compute approximate mappings between them at varying levels of sloppiness, and we validate our results against a hand-crafted Gold Standard.<br /><br /> Our method makes use of the huge amount of knowledge that is implicit in the current Web, and exploits this knowledge as a heuristic for establishing approximate mappings between ill-defined concepts.Risto Risto GligorovPhilips ResearchZharko AleksovskiPhilips ResearchWarner ten KatePhilips ResearchFrank van HarmelenVrije Universiteit Amsterdam468A Framework for Rapid Integration of Presentation ComponentsThe development of user interfaces (UIs) is one of the most time-consuming aspects in software development. In this context, the lack of proper reuse mechanisms for UIs is increasingly becoming manifest, especially as software development is more and more moving toward composite applications. In this paper we propose a framework for the integration of stand-alone modules or applications, where integration occurs at the presentation layer. Hence, the final goal is to reduce the effort required for UI development by maximizing reuse.<br /><br /> The design of the framework is inspired by lessons learned from application integration, appropriately modified to account for the specificity of the UI integration problem. We provide an abstract component model to specify characteristics and behaviors of presentation components and propose an event-based composition model to specify the composition logic. Components and composition are described by means of a simple XML-based language, which is interpreted by a runtime middleware for the execution of the resulting composite application. A proof-of-concept prototype allows us to show that the proposed component model can also easily be applied to existing presentation components, built with different languages and/or component technologies.Jin YuUniversity of New South WalesBoualem BenatallahUniversity of New South WalesRegis Saint-PaulUniversity of New South WalesFabio CasatiUniversity of TrentoFlorian DanielPolitecnico di MilanoMaristella MateraPolitecnico di Milano469Preference-based Selection of Highly Configurable Web ServicesA key challenge for dynamic Web service selection is that Web services are typically highly configurable and service requesters often have dynamic preferences on service configurations. Current approaches, such as WS-Agreement, describe Web services by enumerating the various possible service configurations, an inefficient approach when dealing with numerous service attributes with large value spaces. We model Web service configurations and associated prices and preferences more compactly using utility function policies, which also allows us to draw from multi-attribute decision theory methods to develop an algorithm for optimal service selection. In this paper, we present an OWL ontology for the specification of configurable Web service offers and requests, and a flexible and extensible framework for optimal service selection that combines declarative logic-based matching rules with optimization methods, such as linear programming. Assuming additive price/preference functions, experimental results indicate that our algorithm introduces an overhead of only around 2 sec. compared to a random service selection, while giving optimal results. The overhead, as percentage of total time, decreases as the number of offers and configurations increase.Steffen LamparterInstitute AIFB, Universitaet Karlsruhe THAnupriya AnkolekarInstitute AIFB, Universitaet Karlsruhe THRudi StuderInstitute AIFB, Universitaet Karlsruhe THStephan GrimmFZI Karlsruhe479Introduction and Evaluation of Martlet, a Scientific Workflow Language for Abstracted ParallelisationThe workflow language Martlet described in this paper implements a new programming model that allows users to write parallel programs and analyse distributed data without having to be aware of the details of the parallelisation. Martlet abstracts the parallelisation of the computation and the splitting of the data through the inclusion of constructs inspired by functional programming. These allow programs to be written as an abstract description that can be adjusted automatically at runtime to match the data set and available resources. Using this model it is possible to write programs to perform complex calculations across a distributed data set such as Singular Value Decomposition or Least Squares problems, as well as creating an intuitive way of working with distributed systems<br /><br /> Having described and evaluated Martlet against other functional languages for parallel computation, this paper goes on to look at how Martlet might develop. In doing so it covers both possible additions to the language itself, and the use of JIT compilers to increase the range of platforms it is capable of running on.Daniel James GoodmanOxford University481P-TAG: Large Scale Automatic Generation of Personalized Annotation TAGs for the WebThe success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies based systems, have become more and more popular and successful. Such tags are relevant keywords associated with or assigned to a piece of information (e.g., a Web page), thus describing the item and enabling keyword-based classification. In this paper we propose P-TAG, a method which automatically generates personalized tags for Web pages. Keywords are generated based on the content of the Web page but also based on the content of the user's Desktop, thus expressing a personalized viewpoint very relevant for personal tags. We implemented and tested several algorithms for this approach and evaluated the relevance of the resulting keywords. These evaluations showed very promising results and we are therefore very confident that such a user oriented automatic tagging approach can provide large scale personalized metadata annotation as an important step towards realizing the Semantic Web.Paul - Alexandru ChiritaL3S / University of HannoverStefania CostacheL3S Research Center / University of HannoverSiegfried HandschuhNational University of IrelandWolfgang NejdlUniversity of Hannover485A Unified Platform for Data Driven Web Applications with Automatic Client-Server PartitioningData-driven web applications are structured into three tiers with different programming models at each tier. This division forces developers to manually partition application functionality across the tiers, resulting in complex logic, suboptimal partitioning, and expensive re-partitioning of applications.<br /><br /> In this paper, we introduce a unified platform for automatic partitioning of data-driven web applications. Our approach is based on Hilda, a high-level declarative programming language with a unified data and programming model for all the layers of the application. Based on run-time properties of the application, Hilda's run time system automatically partitions the application between the tiers to improve response time while adhering to memory or processing constraints at the clients. We evaluate our methodology with traces from a real application and with TPC-W, and our results show that automatic partitioning outperforms manual partitioning without the associated development overhead.Fan YangCornell UniversityNitin GuptaCornell UniversityNicholas GernerCornell UniversityXin QiCornell UniversityAlan DemersCornell UniversityJohannes GehrkeCornell UniversityJayavel ShanmugasundaramYahoo!495A Large-scale Evaluation and Analysis of Personalized Search StrategiesAlthough personalized search has been proposed for many years and many personalization strategies have been investigated, it is still unclear whether personalization is consistently effective on different queries for different users, and under different search contexts. In this paper, we study the problem and get some preliminary conclusions. We present a large-scale personalized search evaluation framework based on search logs and then evaluate five personalized search strategies (including two click-based and three profile-based ones) using 12-day MSN search logs. By analyzing the results, we reveal that personalized search has significant improvement over common web search on some queries but it also has little effect on other queries (e.g., queries with small click entropy) and even harms search accuracy under some situations. Furthermore, we show that click-based personalization strategies perform consistently and considerablely well while profile-based ones are unstable in our experiments. We also reveal that both long-term and short-term contexts are very important in improving search performance for profile-based personalized search strategies.Zhicheng DouNankai University, ChinaRuihua SongMicrosoft Research AsiaJi-Rong WenMicrosoft Research Asia504Mapping-Driven XML TransformationClio is an existing schema-mapping tool that provides user-friendly means to manage and facilitate the complex task of transformation and integration of heterogeneous data such as XML over the Web or in XML databases. By means of mappings from source to target schemas, Clio can help users conveniently establish the precise semantics of data transformation and integration. In this paper we study the problem of how to efficiently implement such data transformation (i.e., generating target data from the source data based on schema mappings). We present a three-phase framework for high-performance XML-to-XML transformation based on schema mappings, and discuss methodologies and algorithms for implementing these phases. In particular, we elaborate on novel techniques such as streamed extraction of mapped source values and scalable disk-based merging of overlapping data (including duplicate elimination). We compare our transformation framework with alternative methods such as using XQuery or SQL/XML provided by current commercial databases. The results demonstrate that the three-phase framework (although as simple as it is) is highly scalable and outperforms the alternative methods by orders of magnitude.Haifeng JiangIBM Almaden Research CenterHoward HoIBM Almaden Research CenterLucian PopaIBM Almaden Research CenterWook-Shin HanComputer Enigeering Dept., Kyungpook National University, Korea507A High-Performance Interpretive Approach to Schema-Directed ParsingXML delivers key advantages in interoperability due to its flexibility, expressiveness, and platform-neutrality. As XML has become a performance-critical aspect of the next generation of business computing infrastructure, however, it has become increasingly clear that XML parsing often carries a heavy performance penalty, and that current, widely-used parsing technologies are unable to meet the performance demands of an XML-based computing infrastructure. Several efforts have been made to address this performance gap through the use of grammar-based parser generation. While the performance of generated parsers has been significantly improved, adoption of the technology has been hindered by the complexity of compiling and deploying the generated parsers. Through careful analysis of the operations required for parsing and validation, we have devised a set of specialized bytecodes, designed for the task of XML parsing and validation. These bytecodes are designed to engender the benefits of fine-grained composition of parsing and validation that make existing compiled parsers fast, while being coarse-grained enough to minimize interpreter overhead. This technique of using an interpretive, validating parser balances the need for performance against the requirements of simple tooling and robust scalable infrastructure. Our approach is demonstrated with a specialized schema compiler, used to generate bytecodes which in turn drive an interpretive parser. With almost as little tooling and deployment complexity as a traditional interpretive parser, the bytecode-driven parser usually demonstrates performance within 20% of the fastest fully compiled solutions.Morris MatsaIBMEric PerkinsIBMAbraham HeifetsIBMMargaret Gaitatzes KostoulasIBMDaniel SilvaIBMNoah MendelsohnIBMMichelle LegerIBM511Identifying and Discriminating Between Web and Peer-to-Peer Traffic in the Network CoreTraffic classification is the ability to identify and categorize network traffic by application type. In this paper, we consider the problem of traffic classification in the network core. Classification at the core is challenging because only partial information of the flows and their contributors is available. We address this problem by developing and evaluating a classification framework that can classify a flow using only unidirectional flow information. We validated this approach using recent full-payload packet traces that we collected and pre-classified to establish a ``base truth''. From our evaluation, we find that flow statistics along the server-to-client path of a TCP connection provides higher classification accuracy than flow statistics along the client-to-server path. Because collection of the server-to-client flow statistics may not always be feasible, we developed and verified an algorithm that can estimate the missing statistics from a unidirectional packet trace.Jeffrey ErmanUniversity of CalgaryAnirban MahantiUniversity of CalgaryMartin ArlittHP Labs/University of CalgaryCarey WilliamsonUniversity of Calgary516Expertise Networks in Online Communities: Structure and AlgorithmsWeb-based communities have become an important place for people to seek and share expertise. We find that networks in these communities typically differ in their topology from other online networks such as the World Wide Web. Systems targeted to augment web-based communities by automatically identifying users with expertise, for example, need to adapt to the underlying interaction dynamics. In this study, we analyze the Java Forum, a large online help-seeking community, using social network analysis methods. We test a set of network-based ranking algorithms, including PageRank and HITS, on this large size social network in order to identify users with high expertise. We then use simulations to identify a small number of simple rules governing the question-answer dynamic in the network. These simple rules not only replicate the structural characteristics and algorithm performance on the empirically observed Java Forum, but also allow us to evaluate how other algorithms may perform in communities with different characteristics. We believe this approach will be fruitful for practical algorithm design and implementation for online expertise-sharing communities.Jun ZhangUniversity of MichiganMark AckermanUniversity of MichiganLada AdamicUniversity of Michigan520Why We Search: Visualizing and Predicting User BehaviorThe aggregation and comparison of behavioral patterns on the WWW represent a tremendous opportunity for understanding past behaviors and predicting future behaviors. In this paper, we take a first step at achieving this goal. We present a large scale study correlating the behaviors of Internet users on multiple systems ranging in size from 27 million queries to 14 million blog posts to 20,000 news articles. We formalize a model for events in these time-varying datasets and study their correlation. We have created an interface for analyzing the datasets, which includes a novel visual artifact, the DTWRadar, for summarizing differences between time series. Using our tool we identify a number of behavioral properties that allow us to understand the predictive power of patterns of use.Eytan AdarUniversity of Washington, CSEDaniel WeldUniversity of Washington, CSEBrian BershadUniversity of Washington, CSESteven GribbleUniversity of Washington530GeoTracker: Geospatial and Temporal RSS NavigationThe Web is rapidly moving towards a platform for mass collaboration in content production and consumption. Fresh content on a variety of topics, people, and places is being created and made available on the Web at breathtaking speed. Navigating the content effectively not only requires techniques such as aggregating various RSS-enabled feeds, but it also demands a new browsing paradigm. In this paper, we present novel geospatial and temporal browsing techniques that provide users with the capability of aggregating and navigating RSS enabled content in a timely, personalized and automatic manner. In particular, we describe a system called GeoTracker that utilizes both a geospatial representation and a temporal (chronological) presentation to help users spot the most relevant updates quickly. Within the context of this work, we provide a middleware engine that supports intelligent aggregation and dissemination of RSS feeds with personalization to desktops and mobile devices. We study the navigation capabilities of this system on two kinds of data sets, namely, 2006 World Cup soccer data collected over two months and breaking news items that occur every day. We also demonstrate that the application of such technologies to the video search results returned by YouTube and Google greatly enhances a user's ability in locating and browsing videos based on his or her geographical interests. Finally, we demonstrate that the location inference performance of GeoTracker compares well against machine learning techniques used in the natural language processing/information retrieval community. Despite its algorithm simplicity, it preserves high recall percentages.Yih-Farn ChenAT&amp;T Labs - ResearchGiuseppe Di FabbrizioAT&amp;T Labs - ResearchDavid GibbonAT&amp;T Labs - ResearchRittwik JanaAT&amp;T Labs - ResearchSerban JoraAT&amp;T Labs - ResearchBernard RengerAT&amp;T Labs - ResearchBin WeiAT&amp;T Labs - Research535Investigating Behavioral Variability in Web SearchIn this paper we describe a longitudinal log-based study that investigated variability in peoples' interaction behavior when engaged in search-related activities on the World Wide Web. We analyze the consistency of interaction patterns for more than two thousand volunteer users over a period of five months. Findings of our analysis suggest that there are dramatic differences in variability in key aspects of the interaction within and between queries, and within and between users. Our findings also indicate the existence of at least two distinct classes of user -- navigators and explorers -- who exhibit large differences in their search behavior. These findings have implications for the design of tools to support more effective search-related interactions on the Web.Ryen WhiteMicrosoft ResearchSteven M. DruckerMicrosoft LiveLabs550Learning to Detect Phishing EmailsEach month, more attacks are launched with the aim of making web users believe that they are communicating with a trusted entity for the purpose of stealing account information, logon credentials, and identity information in general. This attack method, commonly known as "phishing," is most commonly initiated by sending out emails with links to spoofed websites that harvest information. We present a method for detecting these attacks, which in its most general form is an application of machine learning on a feature set designed to highlight user-targeted deception in electronic communication. This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites. We evaluate this method on a set of approximately 860 such phishing emails, and 6950 non-phishing emails, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1% of the legitimate emails. We conclude with thoughts on the future for such techniques to specifically identify deception, specifically with respect to the evolutionary nature of the attacks and information available.Ian FetteCarnegie Mellon UniversityNorman SadehCarnegie Mellon UniversityAnthony TomasicCarnegie Mellon University551Web Projections: Learning from Contextual Subgraphs of the WebResearch on web search has demonstrated the value of using information about the graphical structure of the Web in ranking search results. To date, specific graphical properties have been used in these analyses. We introduce a {\em web projection} method that generalizes prior efforts of graphical relationships of the web in several ways. With the approach, we create subgraphs by projecting sets of pages and domains onto the larger web graph, and then use machine learning to construct predictive models that operate on graphical properties. We describe the method and then present experiments that illustrate the construction of predictive models of search result quality and user query reformulation.Jure LeskovecCarnegie Mellon UniversitySusan DumaisMicrosoft ResearchEric HorvitzMicrosoft Research555Exposing Private Information by Timing Web ApplicationsWe show that the time web sites take to respond to HTTP requests can leak private information, using two different types of attacks. The first directly measures response times from a web site to expose private information such as validity of an username at a secured site or the number of private photos in a publicly viewable gallery. The second, called cross-site timing, enables a malicious web site to obtain information from the user's perspective at another site. For example, a malicious site can learn if the user is currently logged in at a victim site and, in some cases, the number of objects in the user's shopping cart. Our experiments suggest that these timing vulnerabilities are wide-spread. We explain in detail how and why these attacks work, and discuss methods for writing web application code that resists these attacks.Andrew BortzStanford UniversityDan BonehStanford UniversityPalash NandyStanford University557CANTINA: A Content-Based Approach to Detecting Phishing Web SitesPhishing is a significant problem involving fraudulent email and web sites that trick unsuspecting users into revealing private information. In this paper, we present the design, implementation, and evaluation of CANTINA, a novel, content-based approach to detecting phishing web sites, based on the well-known TF-IDF algorithm used in information retrieval. We also discuss the design and evaluation of several heuristics we developed to reduce our false positive rates. Our experiments show that CANTINA is good at detecting phishing sites, correctly labeling approximately 95% of phishing sites.Yue ZhangUniversity of PittsburghJason HongCarnegie Mellon UniversityLorrie CranorCarnegie Mellon University558Toward Expressive Syndication on the WebSyndication systems on the Web have attracted vast amounts of attention in recent years. As technologies have emerged and matured, there has been a transition to more expressive syndication approaches; that is subscribers and publishers are provided with more expressive means of describing their interests and published content, enabling more accurate information filtering. In this paper, we formalize a syndication architecture that utilizes expressive Web ontologies and logic-based reasoning for selective content dissemination. This provides finer grained control for filtering and automated reasoning for discovering implicit subscription matches, both of which are not achievable in less expressive approaches. We then address one of the main limitations with such a syndication approach, namely matching newly published information with subscription requests in an efficient and practical manner. To this end, we investigate continuous query answering for a large subset of the Web Ontology Language (OWL); specifically, we formally define continuous queries (i.e., subscriptions) for OWL knowledge bases and present a novel algorithm for continuous query answering in a large subset of this language. Lastly, an evaluation of the query approach is shown, demonstrating its effectiveness for syndication purposes.Christian Halaschek-WienerDepartment of Computer Science, The University of MarylandJames HendlerDepartment of Computer Science, The University of Maryland560Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the CrowdsAs part of a large effort to acquire large repositories of facts from unstructured text on the Web, a seed-based framework for textual information extraction allows for weakly supervised extraction of class attributes (e.g., "side effects" and "generic equivalent" for drugs) from anonymized query logs. The extraction is guided by a small set of seed attributes, without any need for handcrafted extraction patterns or further domain-specific knowledge. The attributes of classes pertaining to various domains of interest to Web search users have accuracy levels significantly exceeding current state of the art. Inherently noisy search queries are shown to be a highly valuable, albeit unexplored, resource for Web-based information extraction, for the task of class attribute extraction as well as for named entity discovery.Marius PascaGoogle Inc.565Ontology Summarization Based on RDF Sentence GraphOntology summarization is very important to quick understanding and selection of ontologies. In this paper, we study extractive summarization of ontology, which generates an indicative summary of ontology automatically by extracting a salient part of ontology. We use a notion of RDF sentence as the basic unit of ontology. An RDF Sentence Graph is proposed to characterize the linkage between RDF sentences derived from ontology from the viewpoint of a surfer. The salience of an RDF sentence is assessed in terms of its "centrality" in the graph. We propose to summarize an ontology as a set of salient RDF sentences extracted from the ontology according to a re-ranking strategy. We compare several methods in assessing the salience of RDF sentences and give an overall evaluation of experimented results. Experiments show that the RDF Sentence Graph approach to ontology summarization is feasible.Xiang ZhangSchool of Computer Science and Engineering, Southeast University, ChinaGong ChengSchool of Computer Science and Engineering, Southeast University, P.R. ChinaYuzhong QuSchool of Computer Science and Engineering, Southeast University, P.R. China570Google News Personalization: Scalable Online Collaborative FilteringSeveral approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.Abhinandan DasGoogleMayur DatarGoogleAshutosh GargGoogleShyam RajaramUniversity of Illinois at Urbana Champaign582On Anonymizing Query Logs via Token-based HashingIn this paper we study the privacy preservation properties of a specific technique for query log anonymization: token-based hashing. In this approach, each query is tokenized, and then a secure hash function is applied to each token. We show that statistical linguistic techniques may be applied to partially compromise the anonymization. We then analyze the specific risks that arise from these partial compromises, focused on revelation of identity from unambiguous names, addresses, and so forth, and the revelation of facts associated with an identity that are deemed to be highly sensitive. Our goal in this work is twofold: to show that token-based hashing is unsuitable for anonymization, and to present a concrete risk analysis framework for evaluating other proposals.Ravi KumarYahoo! ResearchJasmine NovakYahoo! ResearchBo PangYahoo! ResearchAndrew TomkinsYahoo! Research584Semi-Automated Adaptation of Service InteractionsIn today's Web, many functionality-wise similar Web services are offered through heterogeneous interfaces (operation definitions) and business protocols (ordering constraints defined on legal operation invocation sequences). The typical approach to enable interoperation in such a heterogeneous setting is through developing adapters. There have been approaches for classifying possible mismatches between service interfaces and business protocols to facilitate adapter development. However, the hard job is that of identifying, given two service specifications, the actual mismatches between their interfaces and business protocols.<br /><br /><br /><br /> In this paper we present novel techniques and a tool that provides semi-automated support for identifying and resolution of mismatches between service interfaces and protocols, and for generating adapter specification. We make the following main contributions: (i) we identify mismatches between service interfaces, which leads to finding mismatches of type of signature, merge/split, and extra/missing messages; (ii) we identify all ordering mismatches between service protocols and generate a tree, called mismatch tree, for mismatches that require developers' input for their resolution. In addition, we provide semi-automated support in analyzing the mismatch tree to help in resolving such mismatches. We have implemented the approach in a tool inside IBM WID (WebSphere Integration Developer). Our experiments with some real-world case studies show the viability of the proposed approach. The methods and tool are significant in that they considerably simplify the problem of adapting services so that interoperation is possible.Hamid R. Motahari-Nezhad University of New South WalesBoualem BenatallahUniversity of New South WalesAxel MartensIBM T.J. Watson Research CenterFrancisco CurberaIBM T.J. Watson Research CenterFabio CasatiUniversity of Trento586Learning Information Intent via ObservationUsers in an organization frequently request help by sending request messages to assistants that express an information intent: an intention to update an information system. Assistants spend a significant amount of time and effort processing these messages. For example, human resource assistants process requests to update personnel records, and executive assistants process requests to schedule conference rooms or to make travel reservations. To process the intent of a message, assistants read the message and then locate, complete, and submit a form that corresponds to the expressed intent . Automatically or semi-automatically processing the intent of a message on behalf of an assistant would ease the mundane and repetitive nature of this kind of work. For a well-understood domain, a straightforward application of natural language processing techniques can be used to build an intelligent form interface to semi-automatically process information intents. However, high performance parsers are based on machine learning algorithms that require a large collection of messages that have been labeled by an expert. The generation of a labeled corpus of messages is a major barrier to the construction of a parser. In this paper, we investigate the construction of a natural language processing system and an intelligent form system that observes an assistant processing a request. The intelligent form system then generates a labeled training corpus by interpreting the observations. This paper reports on the measurement of the performance of the machine learning algorithms based on real data. The combination of observations, machine learning and interaction design produces an effective intelligent form interface based on natural language processing.Anthony TomasicCarnegie Mellon UniversityIsaac SimmonsCarnegie Mellon UniversityJohn ZimmermanCarnegie Mellon University588Page-level Template Detection via Isotonic SmoothingWe develop a novel framework for the ``page-level'' template detection problem. Our framework is built on two main ideas. The first is the automatic generation of training data for a classifier that, given a page, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores by solving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approach detects templates effectively.Deepayan ChakrabartiYahoo! ResearchRavi KumarYahoo! ResearchKunal PuneraUniversity of Texas at Austin591Multiway SLCA-based Keyword Search in XML DataKeyword search for smallest lowest common ancestors (SLCAs) in XML data has recently been proposed as a meaningful way to identify interesting data nodes in XML data whose subtrees contain an input set of keywords. In this paper, we generalize this useful search paradigm to support keyword search beyond the traditional AND semantics to include both AND and OR boolean operators as well. We first analyze properties of the LCA computation and propose more efficient algorithms to solve the traditional keyword search problem (with only AND semantics). We then extend our approach to handle general keyword search involving combinations of AND and OR boolean operators. The effectiveness of our new algorithms is demonstrated with a comprehensive experimental performance study.Chong SunNational Univeristy of SingaporeChee-Yong ChanNational University of SingaporeAmit GoenkaNational University of Singapore592The Discoverability of the WebPrevious studies have highlighted the rapidity with which new content arrives on the web. We study the extent to which this new content can be efficiently discovered in the crawling model. Our study has two parts. First, we employ a maximum cover formulation to study the inherent difficulty of the problem in a setting in which we have perfect estimates of likely sources of links to new content. Second, we relax the requirement of perfect estimates into a more realistic setting in which algorithms must discover new content using historical statistics to estimate which pages are most likely to yield links to new content.<br /><br /> We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 50\% of all new content with under 3\% overhead, and 100\% of new content with 28\% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: 26\% of new content is accessible only by recrawling a constant fraction of the entire web. Of the remaining 74\%, 80\% of this content may be discovered within one week at discovery cost equal to 1.3X the cost of gathering the new content, in a model with full monthly recrawls.Anirban DasguptaYahoo! ResearchArpita GhoshYahoo! ResearchRavi KumarYahoo! ResearchChristopher OlstonYahoo! ResearchSandeep PandeyYahoo! ResearchAndrew TomkinsYahoo! Research595Defeating Script Injection Attacks with Browser-Enforced Embedded PoliciesWeb sites that accept and display content such as wiki articles or comments typically filter the content to prevent injected script code from running in browsers that view the site. The diversity of browser rendering algorithms and the desire to allow rich content makes filtering quite difficult, however, and attacks such as the Samy and Yamanner worms have exploited filtering weaknesses. To solve this problem, this paper proposes a simple mechanism called Browser-Enforced Embedded Policies (BEEP). The idea is that a web site can embed a policy inside its pages that specifies which scripts are allowed to run. The browser, which knows exactly when it will run a script, can enforce this policy perfectly. We have added BEEP support to several browsers, and built tools to simplify adding policies to web applications. We found that supporting BEEP in browsers requires only small and localized modifications, modifying web applications requires minimal effort, and enforcing policies is generally lightweight.Trevor JimAT&amp;T Labs - ResearchNikhil SwamyUniversity of MarylandMichael HicksUniversity of Maryland602Open User Profiles for Adaptive News Systems: Help or Harm?Over the last five years a range of projects focused progressively more elaborated techniques for adaptive news delivery. However, the adaptation process in these systems has become more complicated and thus less transparent to the users. In this paper, we concentrate on the application of open user models in adding transparency and controllability to adaptive news systems. We present a personalized news system YourNews that allowed their user to view and edit their interest profiles and report a study of the system. Contrary to our expectations, the study demonstrated that this ability to edit user profiles can harm the system and user performance and has to be used with caution.Jae-wook AhnUniversity of PittsburghPeter BrusilovskyUniversity of PittsburghJonathan GradyUniversity of PittsburghDaqing HeUniversity of PittsburghSue Yeon SynUniversity of Pittsburgh603Combining Classifiers to Identify Online DatabasesWe address the problem of identifying the domain of online databases. More precisely, given a set F of Web forms automatically gathered Web by a focused crawler and an online database domain D, our goal is to select from F only the forms that are entry points to databases in D. Having a set of Web forms that serve as entry points to similar online databases is a requirement for many applications and techniques that aim to extract and integrate hidden-Web information, including meta-searchers, database selection tools, hidden-Web crawlers, form-schema matching and merging, and in the construction of online database directories. We propose a new strategy that automatically and accurately classifies online databases based on features that can be easily extracted from Web forms. By judiciously partitioning the space of form features, this strategy allows the use of simpler classifiers that can be constructed using learning techniques that are better suited for each partition. Experiments using real Web data in a representative set of domains show that the use of different classifiers leads to high accuracy, precision and recall. This indicates that our modular classifier composition provides an effective and scalable solution for classifying online databases.Luciano BarbosaUniversity of UtahJuliana FreireUniversity of Utah605Speeding up Adaptation of Web Service Compositions Using Expiration TimesWeb processes must often operate in volatile environments where the parameters of the participating service providers change during the life cycle of the process. Optimally adapting to these changes, therefore, becomes an important challenge. Adaptation requires the knowledge from each of the service providers as to how much each of their service parameters have changed and using this knowledge to determine whether the Web process should make a different, and subsequently, more optimal decision. Monitoring this information change requires additional computations and may become a time-consuming process. In this paper, we use service expiration times obtained from pre-defined service level agreements to reduce the computational overhead of adaptation. We use the intuition that services whose parameters have not expired need not be considered for querying for revised information. Using two realistic scenarios, we illustrate our approach and demonstrate the associated computational savings.John HarneyLSDIS Lab, University of GeorgiaPrashant DoshiUniversity of Georgia615Homepage Live: Automatic Block Tracing for Web PersonalizationThe emergence of personalized homepage providers, e.g. Google Homepage and Microsoft Windows Live, has enabled Web users to select interesting Web contents and to aggregate them in a single Web page. The interesting web contents are usually some specific blocks of Web pages rather than a whole Web page. To satisfy the users' requirements, personalized homepage providers predefine a lot of candidate content blocks for users to composite their own homepages. However, it requires tremendous manual efforts to define the content blocks and the coverage is still very limited. In this paper, we propose a novel personalized homepage system, called "Homepage Live", to allow users to use drag-and-drop action to collect their favorite Web content blocks and organize them in a single page. Moreover, Homepage Live can also automatically trace the changes of blocks with the evolvement of the containing pages by measuring the tree edit distance of the selected blocks. Besides, by exploiting the immutable elements of Web pages, the tracing algorithm performance is significant improved. Experimental results demonstrate the effectiveness and efficiency of our proposed algorithm.Jie HanShanghai Jiaotong UniversityDingyi HanShanghai Jiaotong UniversityChenxi LinMicrosoft Research AsiaHua-Jun ZengMicrosoft Research AsiaZheng ChenMicrosoft Research AsiaYong YuShanghai Jiaotong University620A Large-Scale Study of Web Password HabitsWe report the results of a large scale study of password use and password re-use habits. The study involved half a million users over a three month period. A client component on users' machines recorded a variety of password strength, usage and frequency metrics. This allows us to measure or estimate such quantities as the average number of passwords and average number of accounts each user has, how many passwords she types per day, how often passwords are shared among sites, and how often they are forgotten. We get extremely detailed data on password strength, the types and lengths of passwords chosen, and how they vary by site. The data is the first large scale study of its kind, and yields numerous other insights into the role the passwords play in users' online experience.Dinei FlorencioMicrosoft ResearchCormac HerleyMicrosoft Research626Web Object RetrievalThe primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.Zaiqing NieMicrosoft Research AsiaYunxiao MaMicrosoft Research AsiaShuming ShiMicrosoft Research AsiaJi-Rong WenMicrosoft Research AsiaWei-Ying MaMicrosoft Research631Summarizing Email Conversations with Clue WordsWith the ever increasing popularity of emails, email overload becomes a major problem for email users. Email summarization is one way not only to solve this problem, but also to make use of one's email corpus. In this paper, we propose a new framework for email summarization. One novelty is to use a fragment quotation graph to try to capture an email conversation. The second novelty is to use clue words to measure the importance of sentences in conversation summarization. Based on clue words and their scores, we propose a method called CWS, which is capable of producing a summary of any length as requested by the user. We provide a comprehensive comparison of CWS with various existing methods on the Enron data set. Preliminary results suggest that CWS provides better summaries than existing methods.Giuseppe CareniniDept. of Computer Science, University of British ColumbiaRaymond NgDept. of Computer Science, University of British ColumbiaXiaodong ZhouDept. of Computer Science, University of British Columbia632Measuring Semantic Similarity between Words Using Web Search EnginesSemantic similarity measures play important roles in information retrieval and Natural Language Processing. In information retrieval, semantic similarity measures are used in automatic query suggestion and expansion. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. Semantic similarity is a dynamic phenomenon that changes over time and across domains. In this paper, we propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. We propose a method that exploits page counts and text snippets returned by a Web search engine to measure semantic similarity between words. We define various similarity scores for two given words \textit{P} and \textit{Q}, using the page counts for the queries \textit{P}, \textit{Q} and \textit{P AND Q}. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coefficient of $0.834$. Moreover, the proposed semantic similarity measure significantly improves the accuracy ($F$-measure of $0.78$) in a community mining task, and improves accuracy in a entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.Danushka BollegalaThe University of TokyoYutaka MatsuoAISTMitsuru IshizukaThe University of Tokyo635The Complex Dynamics of Collaborative TaggingThe debate within the Web community over the optimal means by which to organize information often pits formalized classifications against distributed collaborative tagging systems. A number of questions remain unanswered, however, regarding the nature of collaborative tagging systems including whether coherent categorization schemes can emerge from unsupervised tagging by users. This paper uses data from tagged sites on the social bookmarking site del.icio.us to examine the dynamics of collaborative tagging systems. In particular, we examine whether the distribution of the frequency of use of tags for "popular" sites with a long history (many tags and many users) can be described by a power law distribution, often characteristic of what are considered complex systems. We produce a generative model of collaborative tagging in order to understand the basic dynamics behind tagging, including how a power law distribution of tags could arise. We empirically examine the tagging history of sites in order to determine how this distribution arises over time and patterns prior to a stable distribution. Lastly, by focusing on the high-frequency tags of a site where the distribution of tags is a stabilized power law, we show how tag co-occurrence networks for a sample domain of tags can be used analyze the meaning of particular tags given their relationship to other tags.Harry HalpinUniversity of EdinburghValentin RobuCWI AmsterdamHana ShepherdPrinceton University649CSurf: A Context-Driven Non-Visual Web-BrowserWeb sites are designed for graphical mode of interaction. Sighted users can "cut to the chase" and quickly identify relevant information in Web pages. On the contrary, indi- viduals with visual disabilities have to use screen-readers to browse the Web. As screen-readers process pages sequen- tially and read through everything, Web browsing can be- come strenuous and time-consuming. Although, the use of shortcuts and searching offers some improvements, the prob- lem still remains. In this paper, we address the problem of information overload in non-visual Web access using the notion of context. Our prototype system, CSurf, embodying our approach, provides the usual features of a screen-reader. However, when a user follows a link, CSurf captures the context of the link using a simple topic-boundary detection technique, and uses it to identify relevant information on the next page with the help of a Support Vector Machine, a statistical machine-learning model. Then, CSurf reads the Web page starting from the most relevant section, identified by the model. We conducted a series experiments to eval- uate the performance of CSurf against the state-of-the-art screen-reader, JAWS. Our results show that the use of con- text can potentially save browsing time and substantially improve browsing experience of visually disabled people.Jalal MahmudStony Brook UniversityYevgen BorodinStony Brook UniversityI.V. RamakrishnanStony Brook University656Analyzing Web Access Control PoliciesXACML has emerged as a popular access control language on the Web, but because of its rich expressiveness, it has proved difficult to analyze in an automated fashion. Previous attempts to analyze XACML policies either use propositional logic or full First-Order logic. In this paper, we present a formalization of XACML using Description Logics (DL) . This formalization allows us to extend the subset of XACML supported by propositional logic-based analysis tools; we also provide a new analysis service (policy redundancy). Mapping XACML to description logics allows us to use off-the-shelf DL reasoners for analysis tasks such as policy comparison, policy verification and querying. We provide empirical evaluation of a policy analysis tool that was implemented on top of open source reasoner Pellet.Vladimir KolovskiUniversity of Maryland - College ParkJames HendlerDepartment of Computer Science, The University of MarylandBijan ParsiaUniversity of Manchester, UK664Information Flow Modeling based on Diffusion Rate for Prediction and RankingInformation flows in a network where individuals influence each other. The diffusion rate captures how efficiently the information can diffuse among the users in the network. We propose an information flow model that leverages diffusion rates for: (1) prediction - identify where information should flow to, and (2) ranking - identify who will most quickly receive the information. For prediction, we measure how likely information will propagate from a specific sender to a specific receiver during a certain time period. Accordingly a rate-based recommendation algorithm is proposed that predicts who will most likely receive the information during a limited time period. For ranking, we estimate the expected time for information diffusion to reach a specific user in a network. Subsequently, a DiffusionRank algorithm is proposed that ranks users based on how quickly information will flow to them. Experiments on two datasets demonstrate the effectiveness of the proposed algorithms to both improve the recommendation performance and rank users by the efficiency of information flow.Xiaodan SongNEC Laboratories AmericaYun ChiNEC Labs AmericaKoji HinoNEC Laboratories AmericaBelle TsengNEC Laboratories America, Inc.669Communication as Information-Seeking: The Case for Mobile Social Software for Developing RegionsIn this paper, we describe several findings from a multi-year, multi-method study of how information and communication technologies have been adopted and adapted in Central Asia. We have found that mobile phone usage is outpacing that of Internet adoption, that access to the Internet is primarily through public access sites carrying with it issues regarding privacy and surveillance, that people rely on their social networks as information sources, that public institutions tend to be fairly weak as citizen resources, and that information seeking and communication are conflated in people's usage patterns with different technologies. In addition, in the developed world social networking software has grown rapidly and shown itself to have significant potential for mobilizing a population. Based on the collection of findings from Central Asia and observing patterns of technology usage in other parts of the world, our research leads to the conclusion that exploring mobile social software holds significant potential as an ICT that meshes well with preexisting patterns of communication and information seeking and also leverages the most predominant pattern of technology adoption. Many of the findings from this research echo results from studies in other geographic areas, and so we anticipate that much of this research will be relevant to developing regions generally.Beth KolkoUniversity of WashingtonEmma RoseUniversity of WashingtonErica JohnsonUniversity of Washington676Analysis of Topological Characteristics of Huge Online Social Networking ServicesSocial networking services are a fast-growing business in the Internet. However, it is unknown if online relationships and their growth patterns are the same as in real-life social networks. In this paper, we compare the structures of three online social networking services: Cyworld, MySpace, and orkut, each with more than 10 million users, respectively. We have access to complete data of Cyworld's ilchon (friend) relationships and analyze its degree distribution, clustering property, degree correlation, and evolution over time. We also use Cyworld data to evaluate the validity of snowball sampling method, which we use to crawl and obtain partial network topologies of MySpace and orkut. Cyworld, the oldest of the three, demonstrates a changing scaling behavior over time in degree distribution. The latest Cyworld data's degree distribution exhibits a multi-scaling behavior, while those of MySpace and orkut have simple scaling behaviors with different exponents. Very interestingly, each of the two exponents corresponds to the different segments in Cyworld's degree distribution. Certain online social networking services encourage online activities that cannot be easily copied in real life; we show that they deviate from close-knit online social networks which show a similar degree correlation pattern to real-life social networks.Yong-Yeol AhnKAISTSeungyeop HanKAISTHaewoon KwakKAISTSue MoonKAISTHawoong JeongKAIST680Topic Sentiment Mixture: Modeling Facets and Opinions in WeblogsIn this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment models and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Empirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog collections. The TSM model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction.Qiaozhu MeiUniversity of Illinois at Urbana-ChampaignXu LingUniversity of Illinois at Urbana-ChampaignMatthew WondraUniversity of Illinois at Urbana-ChampaignHang SuVanderbilt UniversityChengXiang ZhaiUIUC686Demographic Prediction based on User's Browsing BehaviorDemographic information plays an important role in personalized web applications. However, it is usually not easy to obtain this kind of personal data such as age and gender. In this paper, we made a first approach to predict users' gender and age from their Web browsing behaviors, in which the webpage view information is treated as a hidden variable to propagate demographic information between different users. There are three main steps in our approach: First, learning from the web-page click-though data, Web pages are associated with users' (known) age and gender tendency through a discriminative model; Second, users' (unknown) age and gender are predicted from the demographic information of the associated Web pages through a Bayesian framework; Third, based on the fact that Web pages visited by similar users may be associated with similar demographic tendency, and users with similar demographic information would visit similar web pages, a smoothing component is employed to overcome the data sparseness of web click-though log. Experiments are conducted on a real web click-through log to demonstrate the effectiveness of the proposed approach. The experimental results show that the proposed algorithm can achieve up to 30.4% improvements on gender prediction and 50.3% on age prediction in terms of macro F1, comparing with baseline algorithms.Jian HuMicrosoft Research AsiaHua-Jun ZengMicrosoft Research AsiaHua LiMicrosoft Research AsiaCheng NiuMicrosoft Research AsiaZheng ChenMicrosoft Research Asia692A Content-Driven Reputation System for the WikipediaOn-line forums for the collaborative creation of bodies of information are a phenomenon of rising importance; the Wikipedia is one of the best-known examples. The open nature of such forums could benefit from a notion of reputation for its authors. Author reputation could be used to flag new contributions from low-reputation authors, and it could be used to allow only authors with good reputation to contribute to controversial or critical pages. A reputation system for the Wikipedia would also provide an incentive to give high-quality contributions.<br /><br /> We present in this paper a novel type of content-driven reputation system for Wikipedia authors. In our system, authors gain reputation when the edits and text additions they perform to Wikipedia articles are long-lived, and they lose reputation when their changes are undone in short order. We have implemented the proposed system, and we have used it to analyze the entire Italian and French Wikipedias, consisting of a total of 691,551 pages and 5,587,523 revisions. Our results show that our notion of reputation has good predictive value: changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, and of being undone.B. Thomas AdlerUniversity of California Santa CruzLuca de AlfaroUniversity of California Santa Cruz702Supporting End-Users in the Creation of Dependable Web ClipsWeb authoring environments enable end-users to create applications that integrate information from other web sources. Users can create web sites that include built-in components to dynamically incorporate, for example, weather information, stock-quotes, or the latest news from different web sources. Recent surveys conducted among end-users have indicated an increasing interest in creating such applications. Unfortunately, web authoring environments do not provide support beyond a limited set of built-in components. This work addresses this limitation by providing end-user support for ``clipping'' information from a web site to incorporate it into the end-user site. The support consists of a mechanism to identify the clipping target with multiple markers to increase robustness, and a dynamic assessment of the retrieved information to quantify its reliability. The clipping approach has been integrated as a feature into a popular web authoring tool on which we present the results of two preliminary studies.Sandeep LingamUniversity of Nebraska - LincolnSebastian ElbaumUniversity on Nebraska, Lincoln733Towards Effective Browsing of Large Scale Social AnnotationsThis paper is concerned with the problem of browsing social annotations. Today, a lot of services (e.g., Del.icio.us, Filckr) have been provided based on social annotations. These services help users to manage and share their favorite URL, photos etc. However, due to exponential increasing of the social annotations, more and more users are facing the problem of finding desired resources among a large annotation data. Existing methods such as tag cloud, annotation matching, work well only when the annotation scale is relatively small. Thus, an effective approach for browsing the large scale annotations and the associated resources is on great demand of both users and service providers. In this paper, we propose a novel algorithm, namely Effective Large Scale Annotation Browser (ELSABer), to browse the large-scale social annotations. With the help of ELSABer, the users could browse the large-sale annotations in a semantic, hierarchical and efficient way. More specifically, 1) the semantic relations among annotations are explored for similar resource browsing; 2) the hierarchical relations among annotations are also constructed for top-down browsing; 3) the power-law distribution of social annotations is studied for efficient browsing. By incorporating the personal and time information, the ELSABer can be further extended for personalized and time-related browsing. A prototype system is also implemented based on ELSABer and shows promising resultsRui LiShanghai Jiao Tong UniversityShenghua BaoShanghai Jiao Tong UniversityBen FeiIBM China Research LabZhong SuIBM China Research LabYong YuShanghai Jiao Tong University748MyXDNS: A Request Routing DNS Server With Decoupled Server SelectionThis paper presents the architecture and the preliminary evaluation of a request routing DNS server that decouples server selection from the rest of DNS functionality. Our DNS server, which we refer to as MyXDNS, exposes well-defined APIs for uploading an externally computed server selection policy and for interacting with an external network proximity service. With MyXDNS, researchers can explore their own network proximity metrics and request routing algorithms without having to worry about DNS internals. Furthermore, MyXDNS is based on open-source MyDNS and is available to public. Stress-testing of MyXDNS indicated that it achieves its flexibility at an acceptable cost: a single MyXDNS running on a low-level server can process 3000 req/sec with sub-millisecond response even in the presence of continuous updates to server selection policy.Hussein AlzoubiCase Western Reserve UniversityMichael RabinovichCase Western Reserve UniversityOliver SpatscheckAT&amp;T Labs - Research752Robust Web Page Segmentation for Mobile Terminal Using Content-Distances and Page Layout InformationThe demand of browsing information from general Web pages using a mobile phone is increasing. However, since the majority of Web pages on the Internet are optimized for browsing from PCs, it is difficult for mobile phone users to obtain sufficient information from the Web. Therefore, a method to reconstruct PC-optimized Web pages for mobile phone users is essential. An example approach is to segment the Web page based on its structure, and utilize the hierarchy of the content element to regenerate a page suitable for mobile phone browsing. In our previous work, we have examined a robust automatic Web page segmentation scheme which uses the distance between content elements based on the relative HTML tag hierarchy, i.e., the number and depth of HTML tags in Web pages. However, this scheme has a problem that the content-distance based on the order of HTML tags does not always correspond to the intuitional distance between content elements on the actual layout of a Web page. In this paper, we propose a hybrid segmentation method which segments Web pages based on both the content-distance calculated by the previous scheme, and a novel approach which utilizes Web page layout information. Experiments conducted to evaluate the accuracy of Web page segmentation results prove that the proposed method can segment Web pages more accurately than conventional methods. Furthermore, implementation and evaluation of our system on the mobile phone prove that our method can realize superior usability compared to commercial Web browsers.Gen HattoriKDDI R&amp;D Laboratories Inc.Keiichiro HoashiKDDI R&amp;D Laboratories Inc.Kazunori MatsumotoKDDI R&amp;D Laboratories Inc.Fumiaki SugayaKDDI R&amp;D Laboratories Inc.753Efficient Search Engine MeasurementsWe address the problem of measuring relevance neutral search quality metrics, like corpus size, index freshness, and density of duplicates in the index. The recently proposed estimators for such metrics [Bar-Yossef and Gurevich, WWW2006][Broder et al, CIKM 2006] suffer from significant bias and/or poor performance, due to inaccurate approximation of the so called ``document degrees''.<br /><br /> We present two new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.<br /><br /> Building on an idea from [Broder et al, CIKM 2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in significant performance improvements, while not compromising quality.Ziv Bar-YossefTechnion - Israel Institute of Technology and Google HaifaMaxim GurevichTechnion - Israel Institute of Technology759Compare&amp;Contrast: Using the Web to Discover Comparable Cases for News StoriesComparing and contrasting is an important strategy people employ to understand new situations and create solutions for new problems. Similar events can provide hints for problem solving, as well as larger contexts for understanding the specific circumstances of an event. Lessons can be learned from past experience, insights can be gained about the new situation from familiar examples, and trends can be discovered among similar events. As the largest knowledge base for human beings, the Web provides both an opportunity and a challenge to discover comparable cases in order to facilitate situation analysis and problem solving. In this paper, we present Compare&amp;Contrast, a system that uses the Web to discover comparable cases for news stories, documents about similar situations but involving distinct entities. The system analyzes a news story given by the user and builds a model of the story. With the story model, the system dynamically discovers entities comparable to the main entity in the original story and uses these comparable entities as seeds to retrieve web pages about comparable cases. The system is domain independent, does not require any domain-specific knowledge engineering efforts, and deals with the complexity of unstructured text and noise on the web in a robust way. We evaluated the system with an experiment on a collection of news articles and a user study.Jiahui LiuNorthwestern UniversityEarl WagnerNorthwestern UniversityLarry BirnbaumIntelligent Information Laboratory, Northwestern University764Optimal Audio-Visual Representations for Illiterate Users of ComputersWe present research leading toward an understanding of the optimal audio-visual representation for illustrating concepts for illiterate and semi-literate users of computers. In our user study, which to our knowledge is the first of its kind, we presented each of 13 different health symptoms to 200 illiterate subjects in one representation randomly selected among the following ten: text, static drawings, static photographs, hand-drawn animations, and video, each with and without voice annotation. The goal was to see how comprehensible these representation types were for an illiterate audience. We used a methodology for generating each of the representations tested in a way that fairly stacks one representational type against the others. Our main results are that (1) richer information is not necessarily better understood overall; (2) voice annotation generally helps in speed of comprehension, but bimodal audio-visual information can be confusing for the target population; (3) the relative value of dynamic imagery versus static imagery depends on other factors. Analysis of these statistically significant results and additional detailed results are also provided.Indrani MedhiMicrosoft Research IndiaArchana PrasadMicrosoft Research IndiaKentaro ToyamaMicrosoft Research, India776Explorations in the Use of Semantic Web Technologies for Product Information ManagementMaster data refers to core business entities a company uses repeatedly across many business processes and systems (such as lists or hierarchies of customers, suppliers, accounts, products, or organizational units). Product information is the most important kind of master data and product information management (PIM) is becoming critical for modern enterprises because it provides a rich business context for various applications. Existing PIM systems are less flexible and scalable for on-demand business, as well as too weak to completely capture and use the semantics of master data. This paper explores how to use semantic web technologies to enhance a collaborative PIM system by simplifying modeling and representation while preserving enough dynamic flexibility. Furthermore, we build a semantic PIM system using one of the state-of-art ontology repositories and summarize the challenges we encountered based on our experimental results, especially on performance and scalability. We believe that our study and experiences are valuable for both semantic web community and master data management community.Jean-Sebastien BrunnerIBM China Research LaboratoryLi MaIBM China Research LaboratoryChen WangIBM China Research LaboratoryLei ZhangIBM China Research LaboratoryDaniel C. Wolfson IBM Software GroupYue PanIBM China Research LaboratoryKavitha Srivinas IBM T.J. Watson Research Center777The Two Cultures: Mashing Up Web 2.0 and the Semantic Web (position paper)There is a common perception that there are two competing visions about the future evolution of the Web: the Semantic Web and Web 2.0 visions. We believe that the technologies and core strengths of these visions are complementary, rather than in competition. In fact, both technologies need each other in order to scale beyond their own strongholds. The Semantic Web can learn from Web 2.0's focus on community and interactivity, while Web 2.0 can draw from Semantic Web infrastructure to facilitate things like mashups.<br /><br /> In order to demonstrate the complementarity of the two ideas, we outline a "semantic" weblogs vision in this paper that enhances a Web 2.0 scenario with semantic technologies. This vision can be realized in the short-term and shows clear advantages of the Semantic Web in the open field. We discuss the challenges and highlight open issues that arise from this vision, such as caching in the presence of dynamically generated pages. As a result, we outline research questions and implementation issues that will be of high relevance in the near future and finish with some thoughts about the future web ecosystem.Anupriya AnkolekarUniversitat KarlsruheMarkus KroetzschUniversitat KarlsruheThanh TranUniversitat KarlsruheDenny VrandecicUniversitat Karlsruhe784Predicting Clicks: Estimating the Click-Through Rate for New AdsSearch engine advertising has become a significant aspect of the Web browsing experience. The order in which a search engine displays ads greatly affects the probability that a user will see and click on each ad. Consequently, the ranking has a strong impact on the revenue the search engine receives from the ads. Further, showing the user an ad that they prefer to click on also improves user satisfaction. For these reasons, it is crucially important to be able to estimate the click-through rate of ads in the system. For ads that have been repeatedly displayed, this is empirically meas-urable, but when ads initially appear, other means must be used. We show that we can use features of ads, keywords, and advertis-ers to learn a model that accurately predicts the click-though rate for an ad. We also show that using our model improves the con-vergence and performance of an advertising system. As a result, our model would improve both revenues and user satisfaction.Matthew RichardsonMicrosoft ResearchEwa DominowskaMicrosoftRobert RagnoMicrosoft Research785SPARQ2L: Towards Support For Subgraph Extraction Queries in RDF DatabasesMany applications in analytical domains often have the need to "connect the dots" i.e., query about the structure of data. In bioinformatics for example, it is typical to want to query about interactions between proteins. The aim of such queries is to "extract" relationships between entities i.e. paths from a data graph. Often, such queries will specify certain constraints that qualifying results must satisfy e.g. paths involving a set of mandatory nodes. Unfortunately, most present day Semantic Web query languages including the current draft of the anticipated recommendation SPARQL, lack the ability to express queries about arbitrary path structures in data. In addition, many systems that support some limited form of path queries rely on main memory graph algorithms limiting their applicability to very large scale graphs. In this paper, we present an approach for supporting Path Extraction queries. Our proposal comprises (i) a query language SPARQ2L which extends SPARQL with path variables and path variable constraint expressions, and (ii) a novel query evaluation framework based on efficient algebraic techniques for solving path problems which allows for path queries to be efficiently evaluated on disk resident RDF graphs. The effectiveness of our proposal is demonstrated by a performance evaluation of our approach on both real world based and synthetic datasets.Kemafor AnyanwuLSDIS lab, University of GeorgiaAngela MadukoUniversity of GeorgiaAmit ShethWright State University788Visibly Pushdown Automata for Streaming XMLWe propose the study of visibly pushdown automata (VPA) for processing XML documents. VPAs are pushdown automata where the input determines the stack operation, and XML documents are naturally visibly pushdown with the VPA pushing onto the stack on open-tags and popping the stack on close-tags. In this paper we demonstrate the power and ease visibly pushdown automata give in the design of streaming algorithms for XML documents.<br /><br /> We study the problems of type-checking streaming XML documents against SDTD schemas, and the problem of typing tags in a streaming XML document according to an SDTD schema. For the latter problem, we consider both pre-order typing and post-order typing of a document, which dynamically determines types at open-tags and close-tags respectively as soon as they are met. We also generalize the problems of pre-order and post-order typing to prefix querying. We show that a deterministic VPA yields an algorithm to the problem of answering in one pass the set of all answers to any query that has the property that a node satisfying the query is determined solely by the prefix leading to the node. All the streaming algorithms we develop in this paper are based on the construction of deterministic VPAs, and hence, for any fixed problem, the algorithms process each element of the input in constant time, and use space O(d), where d is the depth of the document.Viraj KumarUniversity of Illinois at Urbana-ChampaignParthasarathy MadhusudanUniversity of Illinois at Urbana-ChampaignMahesh ViswanathanUniversity of Illinois at Urbana-Champaign790Towards Domain-Independent Information Extraction from Web TablesTraditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of table tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of web pages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The thereby obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web."Wolfgang GatterbauerVienna University of TechnologyPaul BohunskyVienna University of TechnologyMarcus HerzogVienna University of TechnologyBernhard KroeplVienna University of TechnologyBernhard PollakVienna University of Technology793Navigating the Intranet with High PrecisionDespite the success of web search engines, search over large enterprise intranets still suffers from poor result quality. Earlier work that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this paper, we address the problem of providing high quality answers to navigational queries in the intranet (e.g., queries intended to find product or personal home pages, service pages, etc.). Our approach is based on offline identification of navigational pages, intelligent generation of term-variants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. Using a testbed of 5.5M pages from the IBM intranet, we present evaluation results that demonstrate that for navigational queries, our approach of using custom indices produces results of significantly higher precision than those produced by a general purpose search algorithm.Huaiyu ZhuIBM ResearchAlexander LoeserSAP ResearchSriram RaghavanIBM Almaden Research CenterShivakumar VaithyanathanIBM Research794Querying and Maintaining a Compact XML StorageAs XML database sizes grow, the amount of space used for storing the data and auxiliary supporting data structures becomes a major factor in query and update performance. This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time. In addition to supporting efficient queries, the space requirement of the proposed scheme is within a constant factor of the information theoretic minimum, while insertions and deletions can be performed in near constant time as well. As a result, the proposed structure features a small memory footprint that increases cache locality, whilst still supporting standard APIs, such as DOM, and necessary database operations, such as queries and updates, efficiently. Analysis and experiments show that the proposed structure is space and time efficient.Raymond WongNational ICT Australia and University of New South WalesFranky LamNational ICT Australia and University of New South WalesWilliam ShuiNational ICT Australia and University of New South Wales800Efficient Search in Large Textual Collections with RedundancyCurrent web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there are many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges. Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages.<br /><br /> In this paper, we propose a general framework for indexing and query processing of archival collections and, more generally, any collections with a significant amount of redundancy. Our approach results in very significant reductions in index size and query processing costs on such collections, and it is orthogonal to and can be combined with the existing techniques. It also supports highly efficient updates, both locally and over a network. Within this framework, we describe and evaluate several different implementations that trade off index size versus CPU cost and other factors, and discuss applications ranging from archival web search to local search of web sites, email archives, or file systems. We present experimental results based on search engine query log and a large collection consisting of multiple crawls.Jiangong ZhangPolytechnic UniversityTorsten SuelPolytechnic University801Subspace: Secure Cross-Domain Communication for Web MashupsCombining data and code from third-party sources has enabled a new wave of web mashups that add creativity and functionality to web applications. However, browsers are poorly designed to pass data between domains, often forcing web developers to abandon security in the name of functionality. To address this deficiency, we developed Subspace, a novel cross-domain communication mechanism that allows efficient communication across domains without sacrificing security. Our prototype requires only a small JavaScript library, and works across all major browsers. We believe Subspace can serve as a new secure communication primitive for web mashups.Collin JacksonStanford UniversityHelen WangMicrosoft 854Towards Environment Generated Media: Object-participation-type Weblog in Home Sensor NetworkThe environment generated media (EGM) are defined here as being generated from a massive amount of and/or incomprehensible environmental data by compressing them into averages or representative values and/or by converting them into such user-friendly media as text, figures, charts, and animations. As an application of EGM, an object-participation-type weblog is introduced, where anthropomorphic indoor objects with sensor nodes post weblog entries and comments about what happened to them in a sensor networked environment.Takuya Maekawa NTTYutaka Yanagisawa NTTTakeshi Okadome NTT855A Password Stretching Method using User Specific SaltsIn this paper, we present a password stretching method based on user specific salt. Our scheme takes a similar time to stretch a password as a recent password stretching algorithm, but the complexity of pre-computation attack increases by 10^8 times and also the storage to store pre-computation result increases by 10^8 times over a recent password stretching algorithm.Changhee Lee INITECHHeejo Lee Korea University857A No-Frills Architecture for Lightweight Answer RetrievalIn a new model for answer retrieval, document collections are distilled offline into large repositories of facts. Each fact constitutes a potential direct answer to questions seeking a particular kind of entity or relation, such as questions asking about the date of particular events. Question answering becomes equivalent to online fact retrieval, which greatly simplifies the de-facto system architecture.Marius Pasca Google Inc.858MedSearch: A Specialized Search Engine for Medical InformationPeople are thirsty for medical information. Existing Web search engines cannot handle medical search well because they do not consider its special requirements. Often a medical information searcher is uncertain about his exact questions and unfamiliar with medical terminology. Therefore, he prefers to pose long queries, describing his symptoms in plain English, and receive comprehensive, relevant information from search results. This paper presents MedSearch, a specialized medical Web search engine, to address these challenges. MedSearch can assist ordinary Internet users to search for medical information, by accepting queries of extended length, providing diversified search results, and suggesting related medical phrases.Gang Luo IBM T.J. Watson Research CenterChunqiang Tang IBM T.J. Watson Research CenterHao Yang IBM T.J. Watson Research CenterXing Wei University of Massachusetts Amherst859Generative Models for Name DisambiguationName ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we present an efficient framework by using two novel topic-based models, extended from Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. Experiments indicate that our approach consistently outperforms other unsupervised methods including spectral and DBSCAN clustering. Scalability is addressed by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.YANG SONG The Pennsylvania State UniversityJian Huang The Pennsylvania State UniversityIsaac G. Councill The Pennsylvania State UniversityJia Li The Pennsylvania State UniversityC. Lee Giles The Pennsylvania State University860The Largest Scholarly Semantic Network...Ever.Scholarly entities, such as articles, journals, authors and institutions, are now mostly ranked according to expert opinion and citation data. The Andrew W. Mellon Foundation funded MESUR project at the Los Alamos National Laboratory is developing metrics of scholarly impact that can rank a wide range of scholarly entities on the basis of their usage. The MESUR project starts with the creation of a semantic network model of the scholarly community that integrates bibliographic, citation, and usage data collected from publishers and repositories world-wide. It is estimated that this scholarly semantic network will include approximately 50 million articles, 1 million authors, 10,000 journals and conferences, 500 million citations, and 1 billion usage-related events; the largest scholarly semantic network ever created. The developed scholarly semantic network will then serve as a standardized platform for the definition and validation of new metrics of scholarly impact. This poster describes the MESUR project's data aggregation and processing techniques including the OWL scholarly ontology that was developed to model the scholarly communication process.Johan Bollen Los Alamos National LaboratoryMarko Rodriguez Los Alamos National LaboratoryHerbert Van de Sompel Los Alamos National Laboratory, Research Library861SRing: A Structured Non DHT P2P Overlay Supporting String Range QueriesThis paper presents SRing, a structured non DHT P2P overlay that efficiently supports exact and range queries on multiple attribute values. In SRing, all attribute values are interpreted as strings formed by a base alphabet and are published in lexicographical order. Two virtual rings are built: N-Ring is built in a skip-list way for range partition and queries; D-Ring is built in a small-world way for the construction of N-Ring. A leave-and-join based load balancing method is used to balance range overload in the network with heterogeneous nodes.Xiaoping Sun Institute of Computing Technology, Graduate School of Chinese Academy of SciencesXue Chen Institute of Computing Technology, Graduate School of Chinese Academy of Sciences862SCAN: A Small-World Structured P2P Overlay for Multi-Dimensional QueriesThis paper presents a structured P2P overlay SCAN that augments CAN overlay with long links based on Kleinberg's small-world model in a d-dimensional Cartesian space. The construction of long links does not require estimate of network size. Queries in multi-dimensional data space can achieve O(2log(dn^1/d)) hops by equipping each node with O(log(dn^1/d)) long links and O(d) short links.Xiaoping Sun Institute of Computing Technology, Graduate School of Chinese Academy of Sciences863First-order Focused CrawlingThis paper reports a new general framework of focused web crawling based on relational subgroup discovery. Predicates are used explicitly to represent the relevance clues of those unvisited pages in the crawl frontier, and then firstorder classification rules are induced using subgroup discovery technique. The learned relational rules with sufficient support and confidence will guide the crawling process afterwards. We present the many interesting features of our proposed first-order focused crawler, together with preliminary promising experimental results.Qingyang Xu Jilin UniversityWanli Zuo Jilin University867Construction by Linking: The Linkbase MethodThe success of many innovative Web applications is not based on the content they produce, but on how they combine and link existing content. Older Web Engineering methods lack flexibility in a sense that they rely strongly on a-priori knowledge of existing content structures and do not take into account initially unknown content sources. We propose the adoption of principles that are also found in Component-based Software Engineering, to assemble highly extensible solutions from reusable artifacts. The main contribution of our work is a support system, consisting of a central service that manages n:m relationships between arbitrary Web resources, and of Web application components that realize navigation, presentation, and interaction for the linked content.Johannes Meinecke University of KarlsruheFrederic Majer University of KarlsruheMartin Gaedke Chemnitz University of Technology868Search Engines and their Public Interfaces: Which APIs are the Most Synchronized?Researchers of commercial search engines often collect data using the application programming interface (API) or by "scraping" results from the web user interface (WUI), but anecdotal evidence suggests the interfaces produce different results. We provide the first in depth quantitative analysis of the results produced by the Google, MSN and Yahoo API and WUI interfaces. After submitting a variety of queries to the interfaces for 5 months, we found significant discrepancies in several categories. Our findings suggest that the API indexes are not older, but they are probably smaller for Google and Yahoo. Researchers may use our findings to better understand the differences between the interfaces and choose the best API for their particular type of queries.Frank McCown Old Dominion UniversityMichael Nelson Old Dominion University872An Information State-Based Dialogue Manager for Making Voice Web SmarterIn this paper we propose the integration of intelligent components technologies (natural language and discourse management) in voice web interfaces to make them smarter. We describe how we have integrated reusable components of dialogue management and language processing in a multilingual voice system to improve its friendliness and portability. The dialogue management component deals with complex dialogue phenomena, such as user-initiative dialogues, and follows the information state-based theory. The resulting dialogue system supports friendly communication (through the telephone and the web) in several languages: English, Spanish, Catalan and Italian. The dialogue system has been adapted to guide the users to access online public administration services.Marta Gatius Technical University of Catalonia (UPC), SpainMeritxell Gonzalez Technical University of CataloniaElisabet Comelles Thera-clic873Building and Managing Personalized Semantic PortalsIn this paper, we present a generic semantic portal SEMPort, which provides better user support with personalized views, semantic navigation, ontology-based search and three different kinds of semantic hyperlinks. SEMPort also supplies distributed content editing/provision in real time. As a case study, SEMPort is tested on the Course Modules Web Page (CMWP) of the School of Electronics and Computer Science (ECS).Melike Sah MissWendy Hall Prof.876Classifying Web SitesIn this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant functional classes of Web sites. We show that a pre-classification of Web sites utilizing structural properties considerably improves a subsequent textual classification with standard techniques. We evaluate this approach on a dataset comprising more than 16,000 Web sites with about 20 million crawled and 100 million known Web pages. Our approach achieves an accuracy of 92% for the coarse-grained classification of these Web sites.Christoph Lindemann University of LeipzigLars Littig University of Leipzig880Learning Information Diffusion Process on the WebMany text documents on the Web are not originally created but forwarded or copied from other source documents. The phenomenon of document forwarding or transmission between various web sites is denoted as Web information diffusion. This paper focuses on mining information diffusion processes for specific topics on the Web. A novel system called LIDPW is proposed to address this problem using matching learning techniques. The source site and source document of each document are identified and the diffusion process composed of a series of diffusion relationships is visually presented to users. The effectiveness of LIDPW is validated on a real data set. A preliminary user study is performed and the results show that LIDPW does benefit users to monitor the information diffusion process of a specific topic, and aid them to discover the diffusion start and diffusion center of the topic.Xiaojun Wan Peking UniversityJianwu Yang Peking University884A Probabilistic Semantic Approach for Discovering Web ServicesService discovery is one of challenging issues in Service-Oriented computing. Currently, most of the existing service discovering and matching approaches are based on keywords-based strategy. However, this method is inefficient and time-consuming. In this paper, we present a novel approach for discovering web services. Based on the current dominating mechanisms of discovering and describing Web Services with UDDI and WSDL, the proposed approach utilizes Probabilistic Latent Semantic Analysis (PLSA) to capture semantic concepts hidden behind words in the query and advertisements in services so that services matching is expected to carry out at concept level. We also present related algorithms and preliminary experiments to evaluate the effectiveness of our approach.Jiangang Ma Victoria University, AustraliaJinli Cao La Trobe University, AustraliaYanchun Zhang Victoria University, Australia885Search Engine Retrieval of Changing InformationIn this paper we analyze the Web coverage of three search engines, Google, Yahoo and MSN. We conducted a 15 month study collecting 15,770 Web content or information pages linked from 260 Australian federal and local government Web pages. The key feature of this domain is that new information pages are constantly added but the 260 web pages tend to provide links only to the more recently added information pages. Search engines list only some of the information pages and their coverage varies from month to month. Meta-search engines do little to improve coverage of information pages, because the problem is not the size of web coverage, but the frequency with which information is updated. We conclude that organizations such as governments which post important information on the Web cannot rely on all relevant pages being found with conventional search engines, and need to consider other strategies to ensure important information can be found.Yang Sok Kim University of TasmaniaByeong Ho Kang University of TasmaniaPaul Compton The University of New South WalesHiroshi Motoda Osaka University886Modeling User Behavior in Recommender Systems based on Maximum EntropyWe propose a model for user purchase behavior in online stores that provide recommendation services. We model the purchase probability given recommendations for each user based on the maximum entropy principle using features that deal with recommendations and user interests. The proposed model enable us to measure the effect of recommendations on user purchase behavior, and the effect can be used to evaluate recommender systems. We show the validity of our model using the log data of an online cartoon distribution service, and measure the recommendation effects for evaluating the recommender system.Tomoharu Iwata NTTKazumi Saito NTTTakeshi Yamada NTT887Personalized Social and Real-Time Collaborative SearchThis paper presents Adaptive Web Search (AWS), a novel search technique that combines personalized, social, and real-time collaborative search. Preliminary empirical results from on a small sample suggest that an AWS prototype built on WAMP platform using Yahoo! Web Search API generates more relevant results and allows faster discovery of information.Mukesh Dalal LDI893Comparing Apples and Oranges: Normalized PageRank for Evolving GraphsPageRank is the best known technique for link-based importance ranking. The computed importance scores, however, are not directly comparable across different snapshots of an evolving graph. We present an efficiently computable normalization for PageRank scores that makes them comparable across graphs. Furthermore, we show that the normalized PageRank scores are robust to non-local changes in the graph, unlike the standard PageRank measure.Klaus Berberich Max-Planck Institute for InformaticsSrikanta Bedathur Max-Planck Institute for InformaticsMichalis Vazirgiannis INRIA/FUTURSGerhard Weikum Max-Planck Institute for Informatics896How Naga Uncoils: Searching with Entities and RelationsCurrent keyword-oriented search engines for the World Wide Web do not allow specifying the semantics of queries. We address this limitation with NAGA, a new semantic search engine. NAGA builds on a large semantic knowledge base of binary relationships (facts) derived from the Web. NAGA provides a simple, yet expressive query language to query this knowledge base. The results are then ranked with an intuitive scoring mechanism. We show the effectiveness and utility of NAGA by comparing its output with that of Google on some interesting queries.Gjergji Kasneci Max-Planck-Institute for Computer ScienceFabian M. Suchanek Max-Planck-Institute for Computer ScienceMaya Ramanath Max-Planck-Institute for Computer ScienceGerhard Weikum Max-Planck-Institute for Computer Science897Towards Efficient Dominant Relationship Exploration of the Product Items on the WebIn recent years, there has been a prevalence of search engines being employed to find useful information in the Web as they efficiently explore hyperlinks between web pages which define a natural graph structure that yields a good ranking. Unfortunately, current search engines cannot effectively rank those relational data, which exists on dynamic websites supported by online databases. In this study, to rank such structured data (i.e., find the ``best'' items), we propose an integrated online system consisting of compressed data structure to encode the dominant relationship of the relational data. Efficient querying strategies and updating scheme are devised to facilitate the ranking process. Extensive experiments illustrate the effectiveness and efficiency of our methods. As such, we believe the work in this paper can be complementary to traditional search engines.Zhenglu Yang University of TokyoLin Li University of TokyoBotao Wang University of TokyoMasaru Kitsuregawa University of Tokyo898Ontology Engineering Using Volunteer LaborWe describe an approach designed to reduce the costs of ontology development through the use of untrained, volunteer knowledge engineers. Results are provided from an experiment in which volunteers were asked to judge the correctness of automatically inferred subsumption relationships in the biomedical domain. The experiment indicated that volunteers can be recruited fairly easily but that their attention is difficult to hold, that most do not understand the subsumption relationship without training, and that incorporating learned estimates of trust into voting systems is beneficial to aggregate performance.Benjamin Good University of British ColumbiaMark D. Wilkinson University of British Columbia899A Management and Performance Framework for Semantic Web ServersThe unification of Semantic Web query languages under the SPARQL standard and the development of commercial-quality implementations are encouraging industries to use semantic technologies for managing information. Current implementations, however, lack the performance monitoring and management services that the industry expects. In this paper, we present a performance and management framework interface to a generic SPARQL web server. We leverage existing standards for instrumentation to make the system ready-to-manage through existing monitoring applications, and we provide a performance framework which has the distinct feature of providing measurement results through the same SPARQL interface used to query data, eliminating the need for special interfaces.Malena Mesarina Hewlett Packard LaboratoriesVenugopal Srinivasmurthy Hewlett Packard CompanyNic Lyons Hewlett-PackardCraig Sayers Hewlett Packard Laboratories900Acquiring Ontological Knowledge from Query LogsWe present a method for acquiring ontological knowledge using search query logs. We first use query logs to identify important contexts associated with terms belonging to a semantic category; we then use these contexts to harvest new words belonging to this category. Our evaluation on selected categories indicates that the method works very well to help harvesting terms, achieving 85% to 95% accuracy in categorizing newly acquired terms.Satoshi Sekine New York UniversityHisami Suzuki Microsoft Research901Towards Multi-granularity Multi-facet E-Book RetrievalGenerally speaking, digital libraries have multiple granularities of semantic units: book, chapter, page, paragraph and word. However, there are two limitations of current eBook retrieval systems: (1) the granularity of retrievable units is either too big or too small, scales such as chapters, paragraphs are ignored; (2) the retrieval results should be grouped by facets to facilitate user's browsing and exploration. To overcome these limitations, we propose a multi-granularity multi-facet eBook retrieval approach.Chong Huang Graduate University, Chinese Academy of SciencesYonghong Tian Institute of Computing Technology, Chinese Academy of SciencesZhi Zhou Institute of Computing Technology, Chinese Academy of SciencesTiejun Huang Peking University902A Kernel based Structure Matching for Web Services SearchThis paper describes a kernel based Web Services (abbreviated as service) matching mechanism for service discovery and integration. The matching mechanism tries to exploit the latent semantics by the structure of services. Using textual similarity and n-spectrum kernel values as features of low-level and mid-level, we build up a model to estimate the functional similarity between services, whose parameters are learned by a Ranking-SVM. The experiment results showed that several metrics for the retrieval of services have been improved by our approach.Jianjun Yu Beihang UniversityShengmin Guo Beihang UniversityHao Su Beihang UniversityHui Zhang Beihang UniversityKe Xu Beihang University904SPath: A Path Language for XML SchemaXML is increasingly being used as a typed data format, and therefore it becomes more important to gain access to the type system, very often this is an XML Schema. The XML Schema Path Language (SPath) presented in this paper provides access to XML Schema components by extending the well-known XPath language to also include the domain of XML Schemas. Using SPath, XML developers gain access to XML Schemas and thus can more easily develop software which is type- or schema-aware, and thus more robust.Erik Wilde UC BerkeleyFelix Michel ETH Zurich907Efficient Training on Biased Minimax Probability Machine for Imbalanced Text ClassificationThe Biased Minimax Probability Machine constructs a classifier which deals with the imbalanced learning tasks. In this paper, we propose a Second Order Cone Programming based algorithm to train the model. We outline the theoretical derivatives of the biased classification model, and address the text classification tasks where negative training documents significantly outnumber the positive ones using the proposed strategy. We evaluated the learning scheme in comparison with traditional solutions on three different datasets. Empirical results have shown that our method is more effective and robust to handle imbalanced text classification problems.Xiang Peng Department of Computer Science and Engineering, The Chinese University of Hong KongIrwin King Department of Computer Science and Engineering, The Chinese University of Hong Kong908BlogScope: Spatio-temporal Analysis of the BlogosphereWe present BlogScope (www.blogscope.net), a system for analyzing the Blogosphere. BlogScope is an information discovery and text analysis system that offers a set of unique features. Such features include, spatio-temporal analysis of blogs, flexible navigation of the Blogosphere through information bursts, keyword correlations and burst synopsis, as well as enhanced ranking functions for improved query answer relevance. We describe the system, its design and the features of the current version of BlogScope.Nilesh Bansal University of TorontoNick Koudas University of Toronto909Towards Extracting Flickr Tag SemanticsWe address the problem of extracting semantics of tags -- short, unstructured text-labels assigned to resources on the Web -- based on each tag's metadata patterns. In particular, we describe an approach for extracting place and event semantics for tags that are assigned to photos on Flickr, a popular photo sharing website supporting time and location (latitude/longitude) metadata. The approach can be generalized to other domains where text terms can be extracted and associated with metadata patterns, such as geo-annotated web pages.Tye Rattenbury Yahoo! Research BerkeleyNathan Good Yahoo! Research BerkeleyMor Naaman Yahoo! Research Berkeley910Generating Efficient Labels to Facilitate Web AccessibilityFor many users with a disability it can be difficult or impossible to use a computer mouse to navigate the web. An alternative way to select elements on a web page is the label typing approach, in which users select elements by typing part of the label. In most cases, these labels are specified by the page authors, but some selectable elements do not have an obvious textual description, thus requiring that a label be generated. The set of element labels on a web page must be both efficient to select by text input and meaningful to the user. This paper discusses our approach to this problem, using page structural analysis and user history to determine important elements of a page, and then matching this information with the efficiency model of the input device.Leo Spalteholz University of VictoriaKin Fun Li University of VictoriaNigel Livingston University of Victoria911Automatic Search Engine Performance Evaluation with Click-through Data AnalysisPerformance evaluation is an important issue in Web search engine researches. Traditional evaluation methods rely on lots of human efforts and are therefore quite time-consuming. With click through data analysis, we proposed an automatic search engine performance evaluation method. This method is a Cranfield-like one and it generates navigational type query topics and answers automatically based on a search user's query and click behavior. Experimental results based on a commercial Chinese search engine's user logs show that the automatically method gets a similar evaluation result with traditional assessor-based ones.Yiqun Liu Tsinghua UniversityYupeng Fu Tsinghua UniversityMin Zhang Tsinghua UniversityShaoping Ma Tsinghua UniversityLiyun Ru Sohu Incorporation912Query Topic Detection for ReformulationIn this paper, we show that most multiple term queries include more than one topic and users usually reformulate their queries by topics instead of terms. In order to provide empirical evidence on user's reformulation behavior and to help search engines better handle the query reformulation problem, we focus on detecting internal topics in the original query and analyzing users' reformulation to those topics in this paper. We utilize the Interaction Information (II) to measure the degree of one subquery being a topic based on the local search results. The experimental results on query log show that: most users reformulate query at the topical level; and our proposed II-based algorithm is a good method to detect topics from original queries.Xuefeng He Peking UniversityJun Yan Microsoft Research AsiaJinwen Ma Peking universityNing Liu Microsoft Research AisaZheng Chen Microsoft Research Asia913EOS: Expertise Oriented Search Using Social NetworksIn this paper, we present the design and implementation of our expertise oriented search system, EOS http://www.arnetminer.net. EOS is a researcher social network system. It has gathered information about a half-million computer science researchers from the Web and constructed a social network among the researchers through their co-authorship. In particular, the relationship in the social network information is used in both ranking experts for a given topic and searching for associations between researchers. Our experimental results demonstrate that the proposed methods for expert finding and association search in a social network are both more effective and efficient than the baseline methods.Juanzi Li Tsinghua UniversityJie Tang Tsinghua UniversityJing Zhang Tsinghua UniversityQiong Luo Hong Kong University of Science and TechnologyYunhao Liu Hong Kong University of Science and TechnologyMingcao Hong Tsinghua University914Measuring Credibility of Users in an E-learning EnvironmentLearning Villages (LV) is an E-learning platform for people's online discussions and frequently citing postings of one another. In this paper, we propose a novel method to rank credit authors in the LV system. We first propose a k-EACM graph to describe the article citation structure in the LV system. And then we build a weighted graph model k-UCM graph to reveal the implicit relationship between authors hidden behind the citations among their articles. Furthermore, we design a graph-based ranking algorithm, the Credit Author Ranking (CAR) algorithm, which can be applied to rank nodes in a graph with negative edges. Finally, we perform experimental evaluations by simulations. The results of evaluations illustrate that the proposed method works pretty well on ranking the credibility of users in the LV system.Wei WEI The Royal Institute of TechnologyJimmy Lee The Chinese University of HongKongIrwin King the Chinese University of Hong Kong916Delay Tolerant Applications for Low Bandwidth and Intermittently Connected Users: the aAQUA ExperienceWith the explosive growth and spread of Internet, web access from mobile and rural users has become significant. But these users face problems of low bandwidth and intermittent Internet connectivity. To make the benefits of the Internet reach the common man in developing countries, accessibility and availability of the information has to be improved. aAQUA is an online multilingual, multimedia agricultural portal for disseminating information from and to rural communities. Considering resource constrained rural environments, we have designed and implemented an offline solution which provides an online experience to users in disconnected mode. Our solution is based on heterogeneous database synchronization which involves only a small synchronization payload ensuring an efficient use of available bandwidth. Offline aAQUA has been deployed in the field and systematic studies of our solution show that user experience has improved tremendously not only in disconnected mode but also in connected mode.Saurabh Sahni IIT BombayKrithi Ramamritham IIT Bombay918EPCI: Extracting Potentially Copyright Infringement Texts from the WebIn this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine APIs, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.Takashi Tashiro Waseda UniversityTakanori Ueda Waseda UniversityTaisuke Hori Waseda UniversityYu Hirate Waseda UniversityHayato Yamana Waseda University921System for Reminding a User of Information Obtained through a Web Browsing ExperienceDon't we forget and waste varied information obtained through our own web browsing in the past? We propose a system for reminding a user of information obtained through web browsing experience. The system extracts keywords from a content of a web page viewed presently and retrieves a context in past web browsing related to the keyword. We defined the context as a sequence of web browsing when many web pages related to the keyword were viewed intensively, because we assumed that a lot of information connected to the current content was obtained in the sequence. The information is not only what page did he/she views but also how did he/she find the web pages out and what knowledge was he/she acquired from a web page. Concretely, when a user browses web pages, the proposed system shows a list of the contexts supposed as important one related to current web page automatically. If user selects the context, details in the context are showed graphically with marks indicating characteristic activities.Tetsushi Morita NTTTetsuo Hidaka NTTAkimichi Tanaka NTTYasuhisa Kato NTT923A Search-based Chinese Word Segmentation MethodIn this paper, we propose a novel Chinese word segmentation method which leverages the huge deposit of Web documents and search technology. It simultaneously solves ambiguous phrase boundary resolution and unknown word identification problems. Evaluations prove its effectiveness.Xin-Jing Wang IBM China Research LabWen Liu Huazhong University of Sciench and TechnologyYong Qin IBM China Research Lab926Semantic Personalization of Web Portal ContentsEnriching Web applications with personalized data is of major interest for facilitating the user access to the published contents, and therefore, for guaranteeing successful user navigation. We propose a conceptual model for extracting personalized recommendations based on user profiling, ontological domain models, and semantic reasoning. The approach offers a high-level representation of the designed application based on a domain-specific metamodel for Web applications called WebML.Christina Tziviskou Politecnico di MilanoMarco Brambilla Politecnico di Milano927A Browser for a Public-Domain SpeechWebA SpeechWeb is a collection of hyperlinked applications which are accessed remotely by speech browsers running on end-user devices. Links are activated through spoken commands. Despite the fact that protocols and technologies for creating and deploying speech applications have been readily available for several years, we have not seen the development of a Public-Domain SpeechWeb. In this paper, we show how freely-available software and commonly-used communication protocols can be used to change this situation.Richard Frost University of WindsorXiaoli Ma University of WindsorYue Shi University of Windsor929Summary Attributes and Perceived Search QualityWe conducted a series of experiments in which surveyed web search users answered questions about the quality of search results on the basis of the result summaries. Summaries shown to different groups of users were editorially constructed so that they differed in only one attribute, such as length. Some attributes had no effect on users' quality judgments, while in other cases, changing an attribute had a "halo effect" which caused seemingly unrelated dimensions of result quality to be rated higher by users.Daniel Rose A9.comDavid Orr Yahoo! Inc.Raj Gopal Prasad Kantamneni Yahoo! Inc.930Review Spam DetectionIt is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products. They are used by potential customers to find opinions of existing users before deciding to purchase a product. They are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. Unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. In this paper, we make an attempt to study review spam and spam detection. To the best of our knowledge, there is still no reported study on this problem.nitin jindal University of Illinois at ChicagoBing Liu University of Illinois at Chicago931A Novel Clustering-based RSS AggregatorIn recent years, different commercial Weblog subscribing systems have been proposed to return stories from users' subscribed feeds. In this paper, we propose a novel clustering-based RSS aggregator called as RSS Clusgator System (RCS) for Weblog reading. Note that an RSS feed may have several different topics. A user may only be interested in a subset of these topics. In addition there could be many different stories from multiple RSS feeds, which discuss similar topic from different perspectives. A user may be interested in this topic but do not know how to collect all feeds related to this topic. In contrast to many previous works, we cluster all stories in RSS feeds into hierarchical structure to better serve the readers. Through this way, users can easily find all their interested stories. To make the system current, we propose a flexible time window for incremental clustering. RCS utilizes both link information and content information for efficient clustering. Experiments show the effectiveness of RCS.Xin Li Peking UniversityJun Yan Microsoft Research AsiaZhihong Deng Peking UniversityLei Ji Microsoft Research AsiaWeiguo Fan Virginia Polytechnic Institute and State UniversityBenyu Zhang Microsoft Research AsiaZheng Chen Microsoft Research Asia932Exploit Sequencing Views in Semantic Cache to Accelerate XPath Query EvaluationIn XML databases, materializing queries and their results into views in a semantic cache can improve the performance of query evaluation by reducing computational complexity and I/O cost. Although there are a number of proposals of semantic cache for XML queries, the issues of fast cache lookup and compensation query construction could be further studied. In this paper, based on sequential XPath queries, we propose fastCLU, a fast Cache LookUp algorithm and effiCQ, an efficient Compensation Query constructing algorithm to solve these two problems. Reported experimental results show that our algorithms outperform previous algorithms and can achieve good performance of query evaluation.Jianhua Feng Tsinghua UniversityNa Ta Department of Computer Science and Technology, Tsinghua UniversityYong Zhang Department of Computer Science and Technology, Tsinghua UniversityGuoliang Li Tsinghua University934XML-Based XML Schema AccessXML Schema's abstract data model are components, which are the structures that eventually define a schema as a whole. XML Schema's XML syntax, on the other hand, is not a direct representation of the schema components, and it proves to be surprisingly hard to derive a schema's components from the XML syntax. The Schema Component XML Syntax (SCX) is a representation which attempts to map schema components as faithfully as possible to XML structures. SCX serves as the starting point for applications which need access to schema components and want to do so using standardized and widely available XML technologies.Erik Wilde UC BerkeleyFelix Michel ETH Zurich935Towards Service Pool Based Approach for Services Discovery and SubscriptionThere are many function identical web services in the internet or some large-scale organizations. They provide consumers with more choices according to their personalized QoS requirements. However, in current web service discovery and subscription, consumers pay too much time on manually selection and cannot easily benefit from the wide QoS spectrum brought by the proliferating services. In this paper, we propose a QoS-aware discovery and subscription approach to free consumers from time-consuming human computer interactions while helping them negotiate QoS with multiple service providers. The core of this approach is the service pool, which is a virtual service grouping function identical services together and dispatching consumer requests to the proper service in terms of QoS requirements. Based on our previous work on the service pool, this paper does two main contributions: one is we investigate the pool construction mechanisms by the similarity retrieval and formalize the representation of the service pool; another is we propose a formalized QoS model from consumer's perspective, and design a QoS-aware discovery algorithm to select the most adequate provider from the service pool according to consumer's QoS requirements. ___________________________________________________________ Authors Declaration<br /><br /> This paper is originally submitted as a regular research paper to WWW2007. It was reviewed by three reviewers(two reviewers accept, and one rejects). We carefully cut the paper to current version according to the reviewer's advice. The following is the reviewer's advice ________________________________________________________<br /><br /> -------------------- review 1 --------------------<br /><br /> OVERALL RATING: -2 (reject) REVIEWER'S CONFIDENCE: 4 (expert) POTENTIAL POSTER: 1 (No)<br /><br /> ----------------------- REVIEW --------------------<br /><br /> The work takes the view that many services are made available independently and will likely include several near or perfect duplicates - all-be-it using different terminology. Starting from this point of view, one can use a data mining approach to identify sets of services which can be considered equivalent. If services are made available through a more formal procedure, it might be argued that the classification into equivalence sets will happen naturally.<br /><br /> In either case, given a set of operations classed as equivalents there remains the issue of how best to choose a particular service on the basis of non-functional requirements. The authors have developed in previous work a model of a service pool - which effectively represents an equivalence class. The service pool is a virtual service, whose interface is a sort of average of those of the pool members it represents, and which redirects calls to specific members on the basis of the non-functional requirements. In effect the virtual service acts much like a broker. This paper claims two contributions: the first being a method for computing the service pool given a set of services; the second being an efficient method for selecting a service according to non-functional requirements.<br /><br /> The technique chosen for computing the service pool is based on woogle, and appears to be a very slight modification of that technique. Essentially, they claim that the original woogle algorithm (repeated steps of aglomaterive clustering with cluster-splitting) tends to yield modest precision - at around 60-70% for operation matching, so they propose to filter results according to one of the UDDI fields (domainkey). This filtering technique appears to be quite orthogonal to the woogle procedure. They say this increases precision to about 85%, but don't give any contextual details, and don't appear to present any large scale implementation results, e.g. using third party public services, which might be interesting.<br /><br /> The contribution to the QoS matching phase is the presentation of an algorithm which is claimed to have polynomial cost for the selector. The idea is, given a set of QoS metrics, to pre-compute orderings of services so that selection at call-time is faciltated. This approach is valid only to the extent that QoS metrics tend to remain static for a significant interval; an assumption stated by the authors. In practice it seems that the approach could be useful in some scenarios but not in others. For instance, one of the metrics chosen is, as expected, response time. It could be argued that if such a metric is to be useful to the user, it should include transfer time, which seems very unlikely to be a very static cost. Furthermore, in a real environment it seems reasonable to assume that services will be accessed by multiple concurrent users; surely then the degree of concurrent access must affect the response time of any individual request. This could happen either through slowing execution of concurrent executions or through imposing an arbitrary delay through queueing of requests. It seems that in a ve! ry dynamic environment, the pre-computation method could turn out to be more expensive than an alternative which deferred all computation to call-time, because it is investing in a global pre-computation on the assumption of being able to amortize this investment over a period of time. Invested computation will also be wasted if the pre-computed orderings are not used sufficiently between changes in QoS metrics. On the other hand if transfer costs can be assumed to be negligible, for instance since the operation costs are very large, and concurrent access to services (and the machines hosting them) is precluded, and requests are not biassed to a small subset of the overall set of operation classes, then the overall approach seems reasonable. The authors do include some experiments which report the variation in cost of their search algorithm with pool size, but these seem to give limited support to the claim for low cost; ignoring the point corresponding to pool size of 1! 9 (which seems unexpectedly high), the cost appears to increase by a f actor of two as the pool size is increased by 1. I presume this is a feature of the implementation being preliminary and therefore not so well tuned.<br /><br /> Overall, Id suggest the work is not making a very great contribution in the pool setup phase, and is making marginal contribution in the QoS matching phase, so in its current form, should not be accepted for www2007.<br /><br /> While a woogle type approach and any extension to it is interesting, the work on QoS matching actualy appears currently to be basically orthogonal. It is thus possible to separate the two parts of the paper. As well as gaining some extra space to describe the QoS matching or permitting a submission of a shorter paper, the QoS matching component can be seen as having a greater applicability, being relevant to any scenario where a broker selects between similar services. I'd encourage the authors to consider reworking the paper along these lines - and to include more recent experimental results.<br /><br /><br /><br /> -------------------- review 2 --------------------<br /><br /> OVERALL RATING: 2 (accept) REVIEWER'S CONFIDENCE: 3 (high) POTENTIAL POSTER: 3 (Yes)<br /><br /> ----------------------- REVIEW --------------------<br /><br /> QoS-aware service discovery is a hot topic in Web services. This paper introduces a service pool based service discovery and request dissemination procedure, a formalized QoS model from consumer's perspective is also proposed with a QoS-aware discovery algorithm to select the most adequate provider. The idea is interesting, and the result is evaluated by extensive experiment. However, the paper can be improved in some distance:<br /><br /> First, from Fig. 4, we can see that the efficiency of this approach may be lower when there are more candidate services in the pool. This may introduce serious scalability problem to the proposed mechanism. Actually there are some other approaches to solve the related problem, such as QoS-aware UDDI and distributed UDDI. These may be help to improve the performance of the mechanism proposed in the paper.<br /><br /> Second, The QoS model proposed in the paper is not a new idea. We know that there are many QoS dimensions in Web service QoS domain. But only several QoS dimensions are mentioned in this paper, such as successability, response time, price cost and reputation. Is the model available to other QoS dimensions? A further survey on Qos model may be useful to the author.<br /><br /> Third, the experiment shows the availability of the proposed mechanism. It&#25263; fine, but would be much better to design extensive simulations or experiment to compare with existing service discovery method to show more value of the service pool based service discovery.<br /><br /> In addition, the writing quality can still be improved. Some sentences are not easy to understand. Also, there are some typos in the paper. For example, Paragraph 3 in section 3.2 : " two steps " -> " three steps ".<br /><br /> I suggest the authors consider the above issues to improve the paper.<br /><br /><br /><br /> -------------------- review 3 --------------------<br /><br /> OVERALL RATING: 2 (accept) REVIEWER'S CONFIDENCE: 3 (high) POTENTIAL POSTER: 3 (Yes)<br /><br /> ----------------------- REVIEW --------------------<br /><br /> This paper describes an interesting scheme for grouping services and providing mechanisms so that consumers?QoS requirements can be met by automatically selecting the service that best matches their requirements.<br /><br /> It appears to assume that all web services have an rpc-like interface.. but most do, so that need not invalidate the work. However the scope should be made clear.<br /><br /> The contributions are in the area of: a) how to group services b) how to select a service that meets the users requirements<br /><br /> I found it a very interesting paper, but I had concerns over whether it could be applied in practise. For example: - in what circumstances would a ws consumer choose a service by keywords or a directory of services? How would they be confident that it would do what they wanted? As web services are designed for machine to machine interaction rather than end user to service interaction then is this method of finding services appropriate?<br /><br /> The evaluation is a little basic and doesn&#25264; address these issues (e.g. by analysing in what proportion of the time the selected service has the desired semantics).<br /><br /> However, the method of selecting a service based on QoS requirements is interesting, and independent of the way in which services are grouped (for example they could be grouped manually by a user or trusted 3rd party).<br /><br /> Therefore, overall I thought this was thought provoking and interesting even though it raises issues of practicality. As a result I have decided to recommend it&#25263; acceptance.Xuanzhe Liu Peking UniversityLi Zhou Peking UniversityGang Huang Peking UniversityHong Mei Institute of Software, Peking University936Causal Relation of Queries from Temporal LogsIn this paper, we study a new problem of mining causal relation of queries in search engine query logs. Causal relation between two queries means event on one query is the causation of some event on the other query. We first detect events in query logs by efficient statistical frequency threshold. Then the causal relation of queries is mined by the geometric features of the events. Finally the Granger Causality Test (GCT) is utilized to further re-rank the causal relation of queries according to their GCT coefficients. In addition, we develop a 2-dimensional visualization tool to display the detected relationship of events in a more intuitive way. The experimental results on the MSN search engine query logs demonstrate that our approach can accurately detect the events in temporal query logs and the causal relation of queries is detected effectively.Yizhou Sun Peking UniversityNing Liu Microsoft Research AisaKunqing Xie Peking UniversityShuicheng Yan University of IllinoisBenyu Zhang Microsoft Research AsiaZheng Chen Microsoft Research Asia937Electoral Search Using the VerkiezingsKijker: An Experience ReportThe Netherlands had parliamentary elections on November 22, 2006. We built a system which helped voters to make an informed choice among the many participating parties. One of the most important pieces of information in the Dutch election and subsequent coalition government formation is the party program, a text document with an average length of 40 pages. Our system provides the voter with focused access to party programs, enabling her to make a topic-wise comparison of parties' viewpoints. We complemented this type of access (``What do the parties promise?'') with access to news (``What happens around these topics?'') and blogs (``What do people say about them?''). We describe the system, including design technical details, and user statistics.Valentin Jijkoun University of AmsterdamMaarten Marx University of amsterdamMaarten de Rijke University of AmsterdamFrank van Waveren ISLA, University of Amsterdam939Web Page Classification with Heterogeneous Data FusionWeb pages are more than text and they contain much contextual and structural information, e.g., the title, meta data, the anchor text, etc., each of which can be seen as a data source or a representation. Due to the different dimensionality and different representing forms of these heterogeneous data sources, simply putting them together would not greatly enhance the classification performance. We observe that via a kernel function, different dimensions and types of data sources can be represented into a common format of kernel matrix, which can be seen as a generalized similarity measure between web pages. In this sense, a kernel learning approach is employed to fuse these heterogeneous data sources. The experimental results on a collection of the ODP database validate the advantages of the proposed method over any single data source and the uniformly weighted combination of heterogeneous data sources.Zenglin Xu The Chinese University of Hong KongIrwin King The Chinese University of Hong KongMichael R. Lyu The Chinese University of Hong Kong940Integrating Web Directories by Learning their StructuresDocuments in the Web are often organized using category trees by information providers (e.g. CNN, BBC) or search engines (e.g. Google, Yahoo!). Such category trees are commonly known as Web directories. The category tree structures from different internet content providers may be similar to some extent but are usually not exactly the same. As a result, it is desirable to integrate these category trees together so that web users only need to browse through a unified category tree to extract information from multiple providers. In this paper, we address this problem by capturing structural information of multiple category trees, which are embedded with the knowledge of professional in organizing the documents. Our experiments with real Web data show that the proposed technique is promising.Christopher Yang The Chinese University of Hong KongJianfeng Lin The Chinese University of Hong Kong941Identifying Ambiguous Queries in Web SearchIt is widely believed that some queries submitted to search engines are by nature ambiguous (e.g., java, apple). However, few studies have investigated the questions of how many queries are ambiguous and how can we automatically identify an ambiguous query. This paper deals with these issues. First, we construct the taxonomy of query ambiguity, and ask human annotators to manually classify queries based upon it. From manually labeled results, we found that query ambiguity is to some extent predictable. We then use a supervised learning approach to automatically classify queries as being ambiguous or not. Experimental results show that we can correctly identify 87% of labeled queries with a machine learning approach. Finally, we estimate that about 16% of queries in a real search log are ambiguous.Ruihua Song Shanghai Jiao Tong University, ChinaZhenxiao Luo Fudan University, ChinaJi-Rong Wen Microsoft Research AsiaYong Yu Shanghai Jiao Tong UniversityHsiao-Wuen Hon Microsoft Research Asia947On Ranking Techniques for Desktop SearchThis paper addresses the desktop search problem by considering various techniques for ranking results of a search query over the file system. First, basic ranking techniques, which are based on a single file feature (e.g., file name, file content, access date, etc.) are considered. Next, two learning-based ranking schemes are presented, and are shown to be significantly more effective than the basic ranking methods. Finally, a novel ranking technique, based on query selectiveness is considered, for use during the cold-start period of the system. This method is also shown to be empirically effective, even though it does not involve any learning.Sara Cohen Technion - Israel Institute of TechnologyCarmel Domshlak Technion - Israel Institute of TechnologyNaama Zwerdling Technion - Israel Institute of Technology950Finding Community Structure in Mega-scale Social NetworksCommunity analysis algorithm proposed by Clauset, Newman, and Moore (CNM algorit hm) finds community structure in social networks. Unfortunately, CNM algorithm does not scale well and its use is practically limited to networks whose sizes a re up to 500,000 nodes. We show that this inefficiency is caused from merging c ommunities in unbalanced manner and that a simple heuristics that attempts to me rge community structures in a balanced manner can dramatically improve community structure analysis. The proposed techniques are tested using data sets obtaine d from existing social networking service that hosts 5.5 million users. We have tested three three variations of the heuristics. The fastest method processes a SNS friendship network with 1 million users in 5 minutes (70 times faster than CNM) and another friendship network with 4 million users in 35 minutes, respect ively. Another one processes a network with 500,000 nodes in 50 minutes (7 time s faster than CNM), finds community structures that has improved modularity, and scales to a network with 5.5 million.Ken Wakita Tokyo Institute of TechnologyToshiyuki Tsurumi Tokyo Institute of Technology953Mirror Site Maintenance Based on Evolution Associations of Web DirectoriesMirroring Web sites is a well-known technique commonly used in the Web community. A mirror site should be updated frequently to ensure that it reflects the content of the original site. Existing mirroring tools apply page-level strategies to check each page of a site, which is inefficient and expensive. In this paper, we propose a novel site-level mirror maintenance strategy. Our approach studies the evolution of Web directory structures and mines association rules between ancestor-descendant Web directories. Discovered rules indicate the evolution correlations between Web directories. Thus, when maintaining the mirror of a Web site (directory), we can optimally skip subdirectories which are negatively correlated with it in undergoing significant changes. The preliminary experimental results show that our approach improves the efficiency of the mirror maintenance process significantly while sacrificing slightly in keeping the ``freshness" of the mirrors.Ling Chen L3S/University of HannoverSourav Bhowmick Nanyang Technical UniversityWolfgang Nejdl L3S/University of Hannover955Understanding Web Search via a Learning ParadigmInvestigating whether one can view Web searching as a learning process, we examined the searching characteristics of 41 participants engaged in 246 searching tasks. We classified the searching tasks according an updated version of Bloom's taxonomy, a six level categorization of cognitive learning. Results show that Applying takes the most searching effort as measured by queries per session and specific topics searched per sessions. The lower level categories of Remembering and Understanding exhibit searching characteristics similar to the higher order learning of Evaluating and Creating. It appears that searchers rely primarily on their internal knowledge for Evaluating and Creating, using searching primarily as fact checking and verification. Implications are that the commonly held notion that Web searchers have simple information needs may not be correct. We discuss the implications for Web searching, including designing interfaces to support exploration.Bernard J. Jansen Pennsylvania State UniversityBrian Smith Pennsylvania State UniversityDanielle Booth Pennsylvania State University956Life is Sharable: Mechanisms to Support and Sustain Blogging Life ExperienceRecent trend in the development of mobile devices, wireless communications, sensor technologies, weblogs, and peer-to-peer communications have prompted a new design opportunity for enhancing social interactions. This paper introduces our preliminary experiences in designing a prototype utilizing the aforementioned technologies to share life experience. Users equipped with camera phones coupled with short-range communication technology, such as RFID, can capture life experience and share it as weblogs to other people. However, in reality, this is easier said than done. The success of weblogs relies on the active participation and willingness of people to contribute. To encourage active participations, a ranking system, AgreeRank, is specifically developed to get them motivated.YUN-MAW CHENG Department of Computer Science and Engineering, Tatung UniversityTzu-Chuan Chou Institute of Information Science, Academia SinicaWai Yu Virtual Engineering Centre, Queen's University BelfastLi-Chieh Chen Department of Industrial Design, Tatung UniversityChing-Long Yeh Department of Computer Science and Engineering, Tatung UniversityMeng-Chang Chen Institute of Information Science, Academia Sinica958A Link-Based Ranking Scheme for Focused SearchThis paper introduces a novel link based ranking algorithm based on a model of focused web surfers. FocusedRank is described and compared to implementations of PageRank and Topic-Sensitive PageRank and a user study is conducted to measure the relevance and precision of each. Our results are shown to be statistically significant, warranting further research into link-based ranking schemes for focused search.Tony Abou-Assaleh GenieKnows.comTapajyoti Das GenieKnows.comWeizheng Gao GenieKnows.comYingbo Miao GenieKnows.comPhilip O'Brien GenieKnows.comZhen Zhen GenieKnows.com959Development of a Semantic Web Based Mobile Local Search SystemThis paper described the development of a semantic web based mobile local search system. It illustrates the first application of semantic web technology in mobile communication. The semantic web based mobile local search system described in this paper established the ontology for 13 fields, which represents user inquiries. The ontology for these 13 fields consists of 1,715 classes and 898,400 individuals. Finally, it provides a movement service for reaching the Point Of Interest in the shortest time by downloading geographic information through the navigation service based on a local search service., This system becomes very attractive and provides the fastest most optimal movement path to the user because the downloaded geographic information includes real-time traffic information.Joo-Seong Jeon KTF R&amp;D GroupGi-Jeong Lee KTF R&amp;D Group963Exploration of Query Context for Information RetrievalA number of existing information retrieval systems propose the notion of query context to combine the knowledge of query and user into retrieval to reveal the most exact description of user's information needs. In this paper we interpret query context as a document consisting of sentences related to the current query. This kind of query context is used to re-estimate the relevance probabilities of top-ranked documents and then re-rank top-ranked documents. The experiments show that the proposed context-based approach for information retrieval can greatly improved relevance of search results.Keke Cai Zhejiang UniversityChun Chen Zhejiang UniversityJiajun Bu Zhejiang UniversityPeng Huang Zhejiang UniversityZhiming Kang Zhejiang University964Bayesian Network based Sentence Retrieval ModelThis paper makes an intensive investigation of the application of Bayesian network in sentence retrieval and introduces three Bayesian network based sentence retrieval models with or without consideration of term relationships. Term relationships in this paper are considered from two perspectives. The fist one observes relationships between pairs of terms and the second one focuses on relationships between terms and term sets. Experiments have proven the efficiency of Bayesian network in the application of sentence retrieval. Particularly and retrieval result with consideration of the second kind of term relationship performs best in improving retrieval precision.Keke Cai Zhejiang UniversityJiajun Bu zhejiang universityChun Chen Zhejiang UniversityKangmiao Liu Zhejiang UniversityWei Chen Zhejiang University966Image Annotation by Hierarchical Mapping of FeaturesIn this paper, we propose a novel approach of image annotation by constructing a hierarchical mapping between low-level visual features and text features utilizing the relations within and across both visual features and text features. Moreover, we propose a novel annotation strategy that maximizes both the accuracy and the diversity of the generated annotation by generalizing or specifying the annotation in the corresponding annotation hierarchy. Experiments with 4500 scientific images from Royal Society of Chemistry journals show that the proposed annotation approach produces satisfactory results at different levels of annotations.Qiankun Zhao PSUPrasenjit Mitra The Pennsylvania State UniversityC. Lee Giles Pennsylvania State University968Crawling Multiple UDDI Business RegistriesAs Web services proliferate, size and magnitude of UDDI Business Registries (UBRs) are likely to increase. The ability to discover Web services of interest then across multiple UBRs becomes a major challenge specially when using primitive search methods provided by existing UDDI APIs. Clients do not have the time to endlessly search accessible UBRs for finding appropriate services particularly when operating via mobile devices. Although there have been numerous standards that have the potential of enhancing the discovery of Web services, searching for relevant Web services across multiple UBRs raises a number of concerns such as performance, efficiency, reliability, and most importantly quality of returned results. Finding services of interest should be time effective and highly productive. This paper addresses issues relating to the efficient access and discovery of Web services across multiple UBRs and introduces a novel exploration engine, the Web Service Crawler Engine (WSCE). WSCE is capable of crawling multiple UBRs, and enables for the establishment of a centralized Web services repository that can be used for discovering Web services much more efficiently. The paper presents experimental validation, results, and analysis of the proposed ideas.Eyhab Al-Masri University of GuelphQusay Mahmoud University of Guelph969Brand Awareness and the Evaluation of Search ResultsWe investigate the effect of search engine brand (i.e., identifying name or logo that distinguishes a product from its competitors) on evaluation of system performance. Our research is motivated by the large amount of search traffic directed to a handful of Web search engines, even though most are of equal technical quality with similar interfaces. We conducted a laboratory study with 32 participants measuring the effect of four search engine brands while controlling for the quality of search engine results. Using average relevance ratings, there was a 25% difference between the most highly rated search engine and the lowest, even though search engine results were identical in both content and presentation. We discuss implications for search engine marketing and the design of search engine quality studies.Bernard J. Jansen Pennsylvania State UniversityMimi Zhang Pennsylvania State UniversityYing Zhang Pennsylvania State University970Discovering the Best Web ServiceMajor research challenges in discovering Web services include, provisioning of services across multiple or heterogeneous registries, differentiating between services that share similar functionalities, improving end-to-end Quality of Service (QoS), and enabling clients to customize the discovery process. Proliferation and interoperability of this multitude of Web services have lead to the emergence of new standards on how services can be published, discovered, or used (i.e. UDDI, WSDL, SOAP). Such standards can potentially provide many of these features and much more, however, there are technical challenges associated with existing standards. One of these challenges is the client's ability to control the discovery process across accessible service registries for finding services of interest. This work proposes a solution to this problem and introduces the Web Service Relevancy Function (WsRF) used for measuring the relevancy ranking of a particular Web service based on QoS metrics and client preferences. We present experimental validation, results, and analysis of the presented ideas.Eyhab Al-Masri University of GuelphQusay Mahmoud University of Guelph972Web Mashup Scripting LanguageThe Web Mashup Scripting Language (WMSL) enables an end-user (you) working from his browser, e.g. not needing any other infrastructure, to quickly write mashups that integrate any two, or more, web services on the Web. The end-user accomplishes this by writing a web page that combines HTML, metadata in the form of mapping relations, and small piece of code, or script. The mapping relations enable not only the discovery and retrieval of the WMSL pages, but also affect a new programming paradigm that abstracts many programming complexities from the script writer. Furthermore, the WMSL Web pages or scripts that disparate end-users (you) write, can be harvested by Crawlers to automatically generate the concepts needed to build lightweight ontologies containing local semantics of a web service and its data model, to extend context ontologies or middle ontologies, and to develop links, or mappings, between these ontologies. This enables an open-source model of building ontologies based on the WMSL Web page or scripts that end users (you) write.Marwan Sabbouh The Mitre CorporationJeff Higginson Mitre CorporationSalim Semy The MITRE CorporationDanny Gagne MITRE Corporation973Utility Analysis for Topically Biased PageRankPageRank is known to be an efficient metric for computing general document importance in the Web. While commonly used as a one-size-fits-all measure, the ability to produce topically biased ranks has not yet been fully explored in detail. In particular, it was still unclear to what granularity of "topic" the computation of biased page ranks makes sense. In this paper we present the results of a thorough quantitative and qualitative analysis of biasing PageRank on Open Directory categories. We show that the MAP quality of Biased PageRank generally increases with the ODP level up to a certain point, thus sustaining the usage of more specialized categories to bias PageRank on, in order to improve topic specific search.Christian Kohlschuetter L3S / University of HannoverPaul - Alexandru Chirita L3S / University of HannoverWolfgang Nejdl L3S / University of Hannover974On Building Graphs of Documents with Artificial AntsWe present an incremental algorithm for building a neighborhood graph from a set of documents. This algorithm is based on a population of artificial agents that imitate the way real ants build structures with self-assembly behaviors. We show that our method outperforms standard algorithms for building such neighborhood graphs (up to 2230 times faster on the tested databases with equal quality) and how the user may interactively explore the graph.Hanane Azzag Laboratoire d'Informatique de l'Universite de Paris-NordJulien Lavergne Laboratoire d'Informatique de l'Universite de Tours, FranceGilles Venturini Laboratoire d'Informatique de l'Universite de ToursChristiane Guinot CE.R.I.E.S.976Exploring Social Dynamics in Online Media SharingIt is now feasible to view media at home as easily as text-based pages were viewed when the Web first appeared. This development has led to the emergence of media sharing and search services providing hosting, indexing and access to large, online media repositories. Many of these sharing services also have a social aspect to them. This paper provides an initial analysis of the social interactions on a video sharing and search service. Initial results show that many users do not form social networks in the community and a very small number do not appear to contribute to the wider community. However, it does seem that people who use the tools available to form social connections do so often. This shows some hope for the future, and the possibility of leveraging off of these social networks to aid users of these services e.g. in searching for new media.Martin Halvey UCD DublinMark Keane UCD Dublin977Deriving Knowledge from Figures for Digital LibrariesFigures in digital documents contain important information. Current digital libraries do not summarize and index information available within figures for document retrieval. We present our system on automatic categorization of figures and extraction of data from 2-D plots. A machine-learning based method is used to categorize figures into a set of predefined types based on image features. An automated algorithm is designed to extract data values from solid line curves in 2-D plots. The semantic type of figures and extracted data values from 2-D plots can be integrated with textual information within documents to provide more effective document retrieval services for digital library users. Experimental evaluation has demonstrated that our system can produce results suitable for real world use.xiaonan lu the Pennsylvania State UniversityJames Wang the Pennsylvania State UniversityPrasenjit Mitra The Pennsylvania State UniversityC. Lee Giles Pennsylvania State University978AutoPerf: An Automated Load Generator and Performance Measurement Tool for Multi-tier Software SystemsWe present a load generator and performance measurement tool (AutoPerf) which requires minimal input and configuration from the user, and produces a comprehensive capacity analysis as well as server-side resource usage profile of a Web-based distributed system, in an automated fashion. The tool requires only the workload and deployment description of the distributed system, and automatically sets typical parameters that load generator programs need, such as maximum number of users to be emulated, number of users for each experiment, warm-up time, etc. The tool also does all the co-ordination required to generate a critical type of measure, namely, resource usage per transaction or per user for each software server. This is a necessary input for creating a performance model of a software system.Shrirang Shirodkar Computer Science and Engg, IIT BombayVarsha Apte Computer Science and Engg, IIT Bombay979Image Collector III: A Web Image-Gathering System with Bag-of-KeypointsWe propose a new system to mine visual knowledge on the Web. There are huge image data as well as text data on the Web. However, mining image data from the Web is paid less attention than mining text data, since treating semantics of images are much more difficult. In this paper, we propose introducing a latest image recognition technique, which is the bag-of-keypoints representation, into Web image-gathering task. By the experiments we show the proposed system outperforms our previous systems and Google Image search greatly.Keiji Yanai Univ. of Electro-Communications982The Use of XML to Express a Historical Knowledge BaseSince conventional historical records have been written assuming human readers, they are not well-suited for computers to collect and process automatically. If computers could understand descriptions in historical records and process them automatically, it would be easy to analyze them from different perspectives. In this paper, we review a number of existing frameworks used to describe historical events, and make a comparative assessment of these frameworks in terms of usability, based on deep cases of Fillmore's core grammar. Based on this assessment, we propose a new description framework, and have created a microformat vocabulary set suitable for that framework.Katsuko T. Nakahira Nagaoka University of TechnologyMasashi Matsui Nagaoka University of TechnologyYoshiki Mikami Nagaoka University of Technology983On Automated Composition for Web ServicesThe main objective of composing web services is to identify usable web services through discovery and to orchestrate or assemble selected services according to the goal specification. we formulate and study a framework to compose web services through discovery and orchestration for a given goal service. Composition algorithms without or with a goal service invocation request are developed. Two strategies are developed to tighten the goal service. Two notions of completeness are defined to measure the ability of how thorough the algorithms can find a composition.Zhongnan Shen University of California at Santa BarbaraJianwen Su University of California at Santa Barbara984Toward Automating Regression Test Selection for Web ServicesThis paper reports a safe regression test selection (RTS) approach that is designed for verifying Web services in an end-to-end manner. The Safe RTS technique has been integrated into a systematic method that monitors distributed code modifications and automates the RTS and RT processesMichael Ruth University of New OrleansShengru Tu University of New Orleans987Adaptive Faceted Browser for Navigation in Open Information SpacesOpen information spaces have several unique characteristics such as their changeability, large size, complexity and diverse user base. These result in novel challenges during user navigation, information retrieval and data visualization in open information spaces. We propose a method of navigation in open information spaces based on an enhanced faceted browser with support for dynamic facet generation and adaptation based on user characteristics.Michal Tvarozek Slovak University of Technology in BratislavaMaria Bielikova Slovak University of Technology in Bratislava988An Assessment of Tag Presentation TechniquesWith the growth of social bookmarking a new approach for metadata creation called tagging has emerged. In this paper we evaluate the use of such tags. The main goal of our evaluation is to investigate the effect of some of the different properties that can be utilized in presenting tags e.g. alphabetization, using larger fonts etc. We show that a number of these factors can affect the ease with which users can find tags and use the tools for presenting tags to users.Martin Halvey UCD DublinMark Keane UCD Dublin989Determining the User Intent of Web Search Engine QueriesDetermining the user intent of Web searches is a difficult problem due to the sparse data available concerning the searcher. In this paper, we examine a method to determine the user intent underlying Web search engine queries. We qualitatively analyze samples of queries from seven transaction logs from three different Web search engines containing more than five million queries. From this analysis, we identified characteristics of user queries based on three broad classifications of user intent. The classifications of informational, navigational, and transactional represent the type of content destination the searcher desired as expressed by their query. We implemented our classification algorithm and automatically classified a separate Web search engine transaction log of over a million queries submitted by several hundred thousand users. Our findings show that more than 80% of Web queries are informational in nature, with about 10% each being navigational and transactional. In order to validate the accuracy of our algorithm, we manually coded 400 queries and compared the classification to the results from our algorithm. This comparison showed that our automatic classification has an accuracy of 74%. Of the remaining 25% of the queries, the user intent is generally vague or multi-faceted, pointing to the need to for probabilistic classification. We illustrate how knowledge of searcher intent might be used to enhance future Web search engines.Bernard J. Jansen Pennsylvania State UniversityDanielle Booth Pennsylvania State UniversityAmanda Spink Queensland University of Technology990Determining the User Intent of Web Search Engine QueriesDetermining the user intent of Web searches is a difficult problem due to the sparse data available concerning the searcher. In this paper, we examine a method to determine the user intent underlying Web search engine queries. We qualitatively analyze samples of queries from seven transaction logs from three different Web search engines containing more than five million queries. From this analysis, we identified characteristics of user queries based on three broad classifications of user intent. The classifications of informational, navigational, and transactional represent the type of content destination the searcher desired as expressed by their query. We implemented our classification algorithm and automatically classified a separate Web search engine transaction log of over a million queries submitted by several hundred thousand users. Our findings show that more than 80% of Web queries are informational in nature, with about 10% each being navigational and transactional. In order to validate the accuracy of our algorithm, we manually coded 400 queries and compared the classification to the results from our algorithm. This comparison showed that our automatic classification has an accuracy of 74%. Of the remaining 25% of the queries, the user intent is generally vague or multi-faceted, pointing to the need to for probabilistic classification. We illustrate how knowledge of searcher intent might be used to enhance future Web search engines.Bernard J. Jansen Pennsylvania State UniversityDanielle Booth Pennsylvania State UniversityAmanda Spink Queensland University of Technology993Monitoring the Evolution of Cached Content in Google and MSNIn this paper, we describe a capture-recapture experiment conducted on Google's and MSN's cache directories. The anticipated outcome of this work was to monitor evolution rates in this these web search services as well as measure their ability to index and maintain fresh and up-to-date results in their cache directories. In our intentions was to also employ Yahoo! in our experiments. However, we could not estimate the refresh rate of Yahoo!, since this search service does not publish the information when the cache version was taken, while Google and MSN do.Ioannis Anagnostopoulos University of the Aegean995Mobile Shopping Assistant: Integration of Mobile Applications and Web ServicesThe goal of this poster is to describe our implementation of a new architecture enabling efficient integration between mobile phone applications and Web Services. Using this architecture, we have implemented a mobile shopping assistant described further. In order to build this architecture, we designed an innovative XML compression mechanism to facilitate data exchange between mobile phones and Web Services. We also designed a smart connection manager to control asynchronous communication for all possible channels of a mobile phone. In addition, we used diverse input modes in order to extend users' access to Web Services.Huaigu Wu SAP LabsYuri Natchetoi SAP Canada Labs-Montreal996Estimating the Cardinality of RDF Graph PatternsMost RDF query languages allow for graph structure search through a conjunction of triples which is typically processed using join operations. A key factor in optimizing joins is determining the join order which depends on the expected cardinality of intermediate results. This work proposes a pattern-based summarization framework for estimating the cardinality of RDF graph patterns. We present experiments on real world and synthetic datasets which confirm the feasibility of our approach.Angela Maduko University of GeorgiaKemafor Anyanwu University of GeorgiaAmit Sheth Kno.e.sis Center, Wright State UniversityPaul Schliekelman University of Georgia997Spam and Popularity Ratings for Combating Link SpamWe present a new approach for propagating spam scores in web graphs, in order to combat link spam. The resulting spam rating is then used for propagating popularity scores like PageRank. Both propagations work even in presence of censure links that represent distrust. Initial testing using a C++ prototype on small examples show more reasonable results than other published approaches.Mukesh Dalal LDI998The ScratchPad: Sensemaking Support for the WebThe World Wide Web is a powerful platform for a wide range of information tasks. Dramatic advances in technology, such as improved search capabilities and the AJAX application model, have enabled entirely new web-based applications and usage patterns, making many tasks easier to perform than ever before. However, few tools have been developed to assist with sensemaking tasks: complex research behaviors in which users gather and comprehend information from many sources to answer potentially vague, non-procedural questions. Sensemaking tasks are common and include, for example, researching vacation destinations or deciding how to invest. This paper presents the ScratchPad, an extension to the standard browser interface that is designed to capture, organize, and exploit the information discovered while performing a sensemaking task.David Gotz IBM Research1000Preserving XML Queries during Schema EvolutionIn XML databases, new schema versions may be released as frequently as once every two weeks. This poster describes a taxonomy of changes for XML schema evolution. It examines the impact of those changes on the schema validation and query evaluation. Based on that study, it proposes guidelines for XML schema evolution and for writing queries in such a way that they continue to operate as expected across evolving schemas.Mirella Moro University of California, RiversideSusan Malaika IBM Silicon Valley LabLipyeow Lim IBM T.J. Watson Research Center1001Simple Authentication for the WebAutomated email-based password reestablishment (EBPR) is an efficient, cost-effective means to deal with forgotten passwords. In this technique, email providers authenticate users on behalf of web sites. This method works because web sites trust email providers to deliver messages to their intended recipients. Simple Authentication for the Web (SAW) improves upon this basic approach to user authentication to create an alternative to password-based logins. SAW: 1) Removes the setup and management costs of passwords at sites that accept the risks of EBPR; 2) Provides single sign-on without a specialized identity provider; 3) Thwarts all passive attacks.Tim van der Horst Brigham Young UniversityKent E. Seamons Brigham Young University1003Mining Contiguous Sequential Patterns from Web LogsFinding Contiguous Sequential Patterns (CSP) is an important problem in Web usage mining. In this paper we propose a new data structure, UpDown Tree, for CSP mining. An UpDown Tree combines suffix tree and prefix tree for efficient storage of all the sequences that contain a given item. The special structure of UpDown Tree ensures efficient detection of CSPs. Experiments show that UpDown Tree improves CSP mining in terms of both time and memory usage comparing to previous approaches.Jinlin Chen Queens College, CUNYTerry Cook Graduate Center, CUNY1004Using d-gap Patterns for Index CompressionSequential patterns of d-gaps exist pervasively in inverted lists of Web document collection indices due to the cluster property. In this paper the information of d-gap sequential patterns is used as a new dimension for improving inverted index compression. We first detect d-gap sequential patterns using a novel data structure, UpDown Tree. Based on the detected patterns, we further substitute each pattern with its pattern Id in the inverted lists that contain it. The resulted inverted lists are then coded with an existing coding scheme. Experiments show that this approach can effectively improve the compression ratio of existing codes.Jinlin Chen Queens College, CUNYTerry Cook Graduate Center, CUNY1005U-Rest: An Unsupervised Record Extraction SysTemWe demonstrate a system that extracts record sets from record-list web pages with no direct human supervision. Our system, U-REST, reframes the problem of unsupervised record extraction as a two-phase machine learning problem with a clustering phase, where structurally similar regions are discovered, and a record cluster detection phase, where discovered grouping of regions are ranked by their likelihood of being records. This framework simplifies the record extraction task, and allows for independent analysis of the algorithms and the underlying features. In our work, we survey a large set of features under this simplified framework. We conclude with an preliminary comparison of U-REST against similar systems and show improvements in the extraction accuracy.Yuan Kui Shen MIT CSAILDavid Karger MIT CSAIL1006Learning Ontologies to Improve the Quality of Automatic Web Service MatchingThis paper presents a novel technique that significantly improves the quality of semantic Web service matching by (1) automatically generating ontologies based on Web service descriptions and (2) using these ontologies to guide the mapping between Web services. The experimental results indicate that with our unsupervised approach we can eliminate up to 70% of incorrect matches that are made by dictionary-based approaches.Hui Guo Stony Brook UniversityAnca Ivan IBM T.J. Watson Research CenterRama Akkiraju IBM T.J. Watson Research Center1008Query-Driven Indexing for Peer-to-Peer Text RetrievalWe describe a query-driven indexing framework for scalable text retrieval over structured P2P networks. To cope with the bandwidth consumption problem that has been identified as the major obstacle for full-text retrieval in P2P networks, we truncate posting lists associated with indexing features to a constant size storing only top-k ranked document references. To compensate for the loss of information caused by the truncation, we extend the set of indexing features with carefully chosen term sets. Indexing term sets are selected based on the query statistics extracted from query logs to index only such combinations that are a) frequently present in user queries and b) non-redundant w.r.t the rest of the index. The distributed index is compact and efficient as it constantly evolves adapting to the current query popularity distribution. Moreover, it is possible to tradeoff the storage/bandwidth requirements with the query answering quality by tuning the indexing parameters. Our theoretical analysis and experimental results indicate that we can indeed achieve scalable P2P text retrieval for very large document collections and deliver good retrieval performance.Gleb Skobeltsyn Ecole Polytechnique Federale de Lausanne (EPFL)Toan Luu Ecole Polytechnique Federale de Lausanne (EPFL)Ivana Podnar Zarko University of ZagrebMartin Rajman Ecole Polytechnique Federale de Lausanne (EPFL)Karl Aberer Ecole Polytechnique Federale de Lausanne (EPFL)1009Providing Session Management as Core Business ServiceIt is extremely hard for a global organization with services over multiple channels to capture a consistent and unified view of its data, services, and interactions. While SOA and web services are addressing integration and interoperability problems, it is painful for an operational organization with legacy systems to quickly switch to service-based methods. We need methods to combine advantages of traditional (i.e. web, desktop, or mobile) application development environments and service-based deployments.<br /><br /> In this paper, we focus on the design and implementation of session management as a core service to support business processes and go beyond application-specific sessions and web sessions. We develop local session components for different platforms and complement them with a remote session service that is independent of applications and platforms. We aim to close the gap between the two worlds by combining their performance, availability and interoperability advantages.Ismail Ari Hewlett-Packard LaboratoriesJun Li Hewlett-Packard LaboratoriesRiddhiman Ghosh Hewlett-Packard LaboratoriesMohamed Dekhil Hewlett-Packard Laboratories1010XML-Based Multimodal Interaction Framework for Contact Center ApplicationsIn this paper, we consider a way to represent contact center applications as a set of multiple XML documents written in different markups including VoiceXML and CCXML. Applications can comprise a dialog with IVR, call routing and agent scripting functionalities. We also consider ways how such applications can be executed in run-time contact center environment.Nikolay Anisimov Genesys Telecommunication LabsBrian Galvin Genesys Telecommunication LabsHerbert Ristock Genesys Telecommunication Labs1012Adaptive Record Extraction From Web PagesWe describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns. We compare our method experimentally to the state-of-the-art, and show that our approach is very competitive for rigidly-structured records (such as product descriptions) and far superior for loosely-structured records. (such as entries on blogs).Justin Park University of CalgaryDenilson Barbosa University of Calgary1014Multi-factor Clustering for a Marketplace Search InterfaceSearch engines provide a small window to the vast repository of data they index and against which they search. They try their best to return the documents that are of relevance to the user. However, depending on how specific the user is, or how clear the user is on what he/she is searching, a large number of results may be returned. Users struggle to manage this vast result set looking for the items of interest. Clustering search results is one way of alleviating this navigation pain. In this paper we describe a clustering system that enables clustering search results in an online marketplace search system. The online marketplace we focus on poses unique challenges in indexing and findability. In this paper we describe the design and implementation of a clustering system to help cluster search results in such an environment.Neel Sundaresan eBay Research LabsKavita Ganesan eBay Research LabsRoopnath Grandhi eBay Research Labs1015Generation, Documentation and Presentation of Mathematical Equations and Symbolic Scientific Expressions Using Pure HTML and CSSThis paper describes a comprehensive method for presenting mathematical equations and expressions using only pure HTML and CSS. This method renders the equations portable and editable and contrasts with previous procedures that represent equations as whole graphic objects. Methods for generating and documenting the equations using HTML and JavaScript are also described such that the equations can be interpreted and converted to or from other formats such as LaTex, MATHML, or linear representation.Kehinde Alabi SUNY at Stony Brook1017Web4CE: Accessing Web-based Applications on Consumer DevicesIn a world where all devices will be interconnected, the boundaries between the different devices will start to disappear. Devices will be able to access each other's applications; sessions can be suspended on one device and resumed on another device; devices can serve as each other's input and output device, and all devices will be able to connect to the Internet. This will give true mobility to the user as he/she will not be restricted to the time and location where he/she accesses an application.<br /><br /> Of course, we need a variety of different mechanisms and technologies to enable this, such as: - An infrastructure for discovering client and servers in a network. - Remote rendering of UIs on other devices in the network. - Mechanisms to exchange capability information between devices, and to adapt the UI based on these capabilities. - Mechanisms to deal with session migration. - Support for a wide range of consumer devices, ranging from mobile phones to high-end TVs.<br /><br /> This requires technologies that cross different domains, i.e. the PC domain, mobile domain, and TV domain. Several major companies within these different domains have decided to work together on these issues. One of the results is a framework for remote user interfaces for both UPnP networks and the Internet. This framework is called Web4CE (a.k.a. CEA-2014) [1], and has been accepted as the baseline remote user interface technology within the Digital Living Network Alliance (DLNA) [2], which is a large industry-wide effort for creating true interoperability between network-enabled devices.<br /><br /> This paper provides a short overview of the Web4CE framework, and some of the use cases that it enables.Walter Dees Philips ResearchPaul Shrubsole Philips Research1018GigaHash: Scalable Minimal Perfect Hashing for Billions of URLsA minimal perfect function maps a static set of n keys on to the range of integers {0,1,2,...,n-1}. We present a scalable high performance algorithm based on random graphs for constructing minimal perfect hash functions (MPHFs). For a set of n keys, our algorithm outputs a description of h in expected time O(n). The evaluation of h(x) requires three memory accesses for any key x and the description of h takes up 0.89n bytes (7.13n bits). This is the best (most space efficient) known result to date. Our approach reduces the space requirement to 43% of the well known previous minimal perfect hashing (MPH) scheme due to Czech, Havas and Majewski (1992) and to 77% of a more recent algorithm due to Botelho, Kohoyakawa, and Ziviani (2005). Using a simple heuristic and Huffman coding, the space requirement is further reduced to 0.79n bytes (6.86n bits). We present a high performance architecture that is easy to parallelize and scales well to very large data sets encountered in internet search applications. Experimental results on a one billion Url URL dataset obtained from Live Search crawl data, show that the proposed algorithm (a) finds an MPHF for one billion URL Urls in less than 4 minutes, and (b) requires only 6.86 bits/key for the description of h.Kumar Chellapilla Microsoft Live LabsAnton Mityagin Microsoft Live LabsDenis Charles Microsoft Live Labs1019Academic Web Search Engine - Generating a Survey AutomaticallyGiven a document repository, search engine is very helpful to retrieve information. Currently, vertical search is a hot topic, and Google Scholar is an example for academic search. However, most vertical search engines only return the flat ranked list without an efficient result exhibition for given users. We study this problem and designed a vertical search engine prototype Dolphin, where the flexible user-oriented templates can be defined and the survey-like results are presented according to the template.Ye Wang Fudan UniversityZhihua Geng Fudan UniversitySheng Huang Fudan UniversityXiaoling Wang Fudan UniversityAoying Zhou Fudan University1023Altering Document Term Vectors for Classification - Ontologies as Expectations of Co-occurrenceIn this paper we extend the state-of-the-art in utilizing background knowledge for supervised classification by exploiting the semantic relationships between terms explicated in Ontologies. Preliminary evaluations indicate that the new approach generally improves precision and recall and reveals patterns indicating the usefulness of such background knowledge.Meenakshi Nagarajan Kno.e.sis Center, Wright State UniversityAmit Sheth Kno.e.sis Center, Wright State UniversityMarcos Aguilera HP Labs, Palo Alto, CAKimberly Keeton HP Labs, Palo Alto, CAArif Merchant HP Labs, Palo Alto, CAMustafa Uysal HP Labs, Palo Alto, CA1024A Link Classification based Approach to Website Topic Hierarchy GenerationHierarchical models are commonly used to organize a Website's content. A Website's content structure can be represented by a topic hierarchy, a directed tree rooted at a Website's homepage in which the vertices and edges correspond to Web pages and hyperlinks. In this work, we propose a new method for constructing the topic hierarchy of a Website. We model the Website's link structure using weighted directed graph, in which the edge weights are computed using a classifier that predicts if an edge connects a pair of nodes representing a topic and a sub-topic. We then pose the problem of building the topic hierarchy as finding the shortest-path tree and directed minimum spanning tree in the weighted graph. We've done extensive experiments using real Websites and obtained very promising results.Nan Liu The Chinese University of Hong KongChristopher Yang The Chinese University of Hong Kong1026Visualizing Structural Patterns in Web CollectionsWe present a tool, DescribeX, suitable for exploring and visualizing the structural patterns present in collections of XML documents. DescribeX can be employed by developers to interactively discover, for example, those XPath expressions that will actually return elements known to occur in the collection.<br /><br /> Many collections of XML documents present in the Web are difficult to describe because they use different schemas, the schemas used may be extended through namespaces, and the document instances are often complex and ad-hoc in structure. Collected feeds are an example of web collections that are comprised of documents with multiple schemas (e.g. Atom, RSS, and RDF), in multiple versions (e.g. RSS 1.0, RSS 2.0, etc.), which have been fruther extended by schemas from several namespaces (e.g. Dublin core, iTunes Podcast, Microsoft Simple List Extensions). Another example not involving feeds is a collection created from traces of web service requests.M.S. Ali University of TorontoMariano Consens University of TorontoFlavio Rizzolo University of Toronto1027Towards a Scalable Search and Query Engine for the WebCurrent search engines do not fully leverage semantically rich datasets, or specialise in indexing just one domain-specific dataset. The search engine described in this paper uses the RDF data model to enable interactive query answering over large amounts of richly structured and interlinked data collected from many disparate sources on the Web.Aidan Hogan DERI GalwayAndreas Harth Digital Enterprise Research InstituteJurgen Umbrich DERI GalwayStefan Decker DERI Galway1030Designing Efficient Sampling Techniques to Detect Webpage UpdatesDue to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages and how many of them are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.Qingzhao Tan Pennsylvania State UniversityZiming Zhuang The Pennsylvania State UniversityPrasenjit Mitra The Pennsylvania State UniversityC. Lee Giles Pennsylvania State University1033Summarization of Online Image Collections via Implicit FeedbackThe availability of map interfaces and location-aware devices makes a growing amount of unstructured, geo-referenced information available on the Web. In particular, ten million tagged, geo-referenced photos are now available on Flickr, a photo-sharing website. We show how we analyze Flickr images to generate aggregate knowledge in the form of ``representative tags'' for arbitrary areas in the world. We display these tags on a map interface in an interactive web application along with images associated with each tag. We then use the aggregate user interactions with the tags and images to learn which images best describe the area shown on the map.Shane Ahern Yahoo! Research BerkeleySimon King Yahoo! Research BerkeleyMor Naaman Yahoo! Research BerkeleyRahul Nair Yahoo! Research Berkeley1034A Large-Scale Study of Robots.txtSearch engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.Yang Sun The Pennsylvania State UniversityZiming Zhuang The Pennsylvania State UniversityC. Lee Giles Pennsylvania State University1035Automatic Searching of Tables in Digital LibrariesTables are ubiquitous. Unfortunately, no search engine supports table search. In this paper, we propose a novel table specific searching engine, TableSeer, to facilitate the table extracting, indexing, searching, and sharing. In addition, we propose an extensive set of medium-independent metadata to precisely present tables. Given a query, TableSeer ranks the returned results using an innovative ranking algorithm -- TableRank with a tailored vector space model and a novel term weighting scheme. Experimental results show that TableSeer outperforms existing search engines on table search. In addition, incorporating multiple weighting factors can significantly improve the ranking results.Ying Liu Pennsylvania State UniversityKun Bai Pennsylvania State UniversityPrasenjit Mitra The Pennsylvania State UniversityC. Lee Giles Pennsylvania State University1037A Novel Collaborative Filtering-Based Framework for Personalized Services in M-CommerceWith the rapid growth of wireless technologies and handheld devices, m-commerce is becoming a promising research area. Personalization is especially important to the success of m-commerce. This paper proposes a novel collaborative filtering-based framework for personalized services in m-commerce. The framework extends our previous works by using OLAP to represent the relationships among user, content and context information, and adopting a multi-dimensional collaborative filtering model to perform inference. It provides a powerful and well-founded mechanism to personalization for m-commerce. It is implemented in an existing m-commerce platform, and experimental results demonstrate its feasibility and correctness.Qiudan Li Institute of Automation,Chinese Academy of Sciences,BeijingChunheng Wang Institute of Automation, Chinese Academy of Sciences, BeijingGuanggang Geng Institute of Automation, Chinese Academy of Sciences, BeijingRuwei Dai Institute of Automation, Chinese Academy of Sciences, Beijing1038A Cautious Surfer for PageRankThis work proposes a novel cautious surfer to incorporate trust into the process of calculating authority for web pages. We evaluate a total of sixty queries over two large, real-world datasets to demonstrate that incorporating trust can improve PageRank's performance.Lan Nie Lehigh UniversityBaoning Wu Lehigh UniversityBrian Davison Lehigh University1039A Clustering Method for Web Data with Multi-Type Interrelated ComponentsTraditional clustering algorithms work on "flat" data, making the assumption that the data instances can only be represented by a set of homogeneous and uniform features. Many real world data, however, is heterogeneous in nature, comprising of multiple types of interrelated components. We present a clustering algorithm, K-SVMeans, that integrates the well known K-Means clustering with the highly popular Support Vector Machines(SVM) in order to utilize the richness of data. Our experimental results on authorship analysis of two real world datasets show that K-SVMeans achieves better clustering performance than homogeneous data clustering.Levent Bolelli The Pennsylvania State UniversitySeyda Ertekin The Pennsylvania State UniversityDing Zhou The Pennsylvania State UniversityC. Lee Giles The Pennsylvania State University1042GeoTV: Navigating Geocoded RSS to Create an IPTV ExperienceThe Web is rapidly moving towards a platform for mass collaboration in content production and consumption from three screens: computers, mobile phones, and TVs. While there has been a surge of interests in making Web content accessible from mobile devices, there is a significant lack of progress when it comes to making the web experience suitable for viewing on a television. Towards this end we describe a novel concept, namely GeoTV, where we explore a framework by which web content can be presented or pushed in a meaningful manner to create an entertainment experience for the TV audience. Fresh content on a variety of topics, people, and places is being created and made available on the Web at breathtaking speed. Navigating fresh content effectively on TV demands a new browsing paradigm that requires few mouse clicks or user interactions from the remote control. Novel geospatial and temporal browsing techniques are provided in GeoTV that allow users the capability of aggregating and navigating RSS-enabled content in a timely, personalized and automatic manner for viewing in an IPTV environment. This poster is an extension of our previous work on GeoTracker that utilizes both a geospatial representation and a temporal (chronological) presentation to help users spot the most relevant updates quickly within the context of a Web-enabled environment. We demonstrate 1) the usability of such a tool that greatly enhances a user's ability in locating and browsing videos based on his or her geographical interests and 2) various innovative interface designs of showing RSS-enabled information for an IPTV environment.Yih-Farn Chen AT&amp;T Labs - ResearchGiuseppe Di Fabbrizio AT&amp;T Labs - ResearchDavid Gibbon AT&amp;T Labs - ResearchRittwik Jana AT&amp;T Labs - ResearchSerban Jora AT&amp;T Labs - ResearchBernard Renger AT&amp;T Labs - ResearchBin Wei AT&amp;T Labs - Research1044Behavior Based Web Page EvaluationThis paper describes our efforts to investigate factors in user's browsing behavior to automatically evaluate web pages that the user shows interest in. To evaluate web pages automatically, we developed a client-side logging/analyzing tool: the GINIS Framework. This work focuses primarily on client-side user behavior using a customized web browser and AJAX technologies. First, GINIS unobtrusively gathers logs of user behavior through the user's natural interaction with the web browser. Then it analyses the logs and extracts effective rules to evaluate web pages using C4.5 machine learning system. Eventually, GINIS becomes able to automatically evaluate web pages using these learned rules.Ganesan Velayathan The Graduate University for Advanced Studies, National Institute of Informatics (NII)&#8206;Seiji Yamada National Institute of Informatics1046Tag Clouds for Summarizing Web Search ResultsIn this paper, we describe an application, PubCloud, that uses tag clouds for the summarization of results from queries over the PubMed database of biomedical literature. PubCloud responds to queries of this database with tag clouds generated from words extracted from the abstracts returned by the query. The results of a user study comparing the PubCloud tag-cloud summarization of query results with the standard result list provided by PubMed indicated that the tag cloud interface is advantageous in presenting descriptive information and in reducing user frustration but that it is less effective at the task of enabling users to discover relations between concepts.Byron Kuo iCAPTURE Centre, UBCThomas Hentrich iCAPTURE Centre, UBC and Simon Fraser UniversityBenjamin M. Good University of British ColumbiaMark Wilkinson iCAPTURE Centre, UBC1047Extensible Schema Documentation with XSLT 2.0XML Schema documents are defined using an XML syntax, which means that the idea of generating schema documentation through standard XML technologies is intriguing. We present X2Doc, a framework for generating schema-documentation solely through XSLT. The framework uses SCX, an XML syntax for XML Schema components, as intermediate format and produces XML-based output formats. Using a modular set of XSLT stylesheets, X2Doc is highly configurable and carefully crafted towards extensibility. This proves especially useful for composite schemas, where additional schema information like Schematron rules are embedded into XML Schemas.Felix Michel ETH ZurichErik Wilde UC Berkeley1048Anchor-based Proximity MeasuresWe present a family of measures of proximity of an arbitrary vertex in a directed graph to a pre-specified subset of vertices, called the and two different uses of the connectivity structure of the graph. We consider a web-specific application of the above measures with two disjoint anchors --- \emph{good} and \emph{bad} web pages --- and study the accuracy of these measures in this context.Amruta Joshi Yahoo! ResearchRavi Kumar Yahoo! ResearchBenjamin Reed Yahoo! ResearchAndrew Tomkins Yahoo! Research1053Extending WebML towards Semantic WebAvailable methodologies for developing Sematic Web applications do not fully exploit the whole potential deriving from interaction with ontological data sources. Here we introduce an extension of the WebML modeling framework to fulfill most of the design requirements emerging for the new area of Semantic Web. We generalize the development process to support Semantic Web applications and we introduce a set of new primitives for ontology importing and querying.Federico Michele Facca Politecnico di MilanoMarco Brambilla Politecnico di Milano1056Sliding Window Technique for the Web Log AnalysisThe results of the Web query log analysis may be significantly shifted depending on the fraction of agents (non-human clients), which were not excluded from the log. To detect and exclude agents the Web log studies use threshold values for a number of requests submitted by a client during the observation period. However, different studies use different observation periods, and a threshold assigned to one period usually incomparable with the threshold assigned to the other period. We propose the uniform method equally working on the different observation periods. The method bases on the sliding window technique: a threshold is assigned to the sliding window rather than to the whole observation period. Besides, we estimate the sub-optimal values of the parameters of the method: a window size and a thresholdNikolai Buzikashvili Institute of System Analysis, Russian Academy of Science1057Parallel Crawling for Online Social NetworksGiven a huge online social network, how do we retrieve information from it through crawling? Even better, how do we improve the crawling performance by using parallel crawlers that work independent of each other? In this paper, we present the framework of parallel crawlers for online social networks, utilizing a centralized queue. To show how this works in practice, we describe our implementation of the crawlers for an online auction website. The crawlers work independently, therefore the failing of one crawler does not affect the others at all. The framework ensures that no redundant crawling would occur. Using the crawlers that we built, we visited a total of approximately 11 million auction users, about 66,000 of which were completely crawled.Duen Horng Chau Carnegie Mellon UniversityShashank Pandit Carnegie Mellon UniversitySamuel Wang Carnegie Mellon UniversityChristos Faloutsos Carnegie Mellon University920Collaborative ICT for Indian Business ClustersIndian business clusters have contributed immensely to the country's industrial output, poverty alleviation and employment generation. However, with recent globalization these clusters can loose out to international competitors if they do not continuously innovate and take advantage of the new opportunities that are available through economic liberalization. In this paper, we discuss how information and communication technologies (ICT) can help in improving the productivity and growth of these clusters.Soumya Roy Motorola India Research LabsShantanu Biswas Motorola India Research Labs