Ph.D. Candidate (my Curriculum Vitae and Research Statement)
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
5000 Forbes Ave., Newell-Simon Hall, Pittsburgh, PA 15213, USA
jielu@cs.cmu.edu

I am a Ph.D. candidate in the Language Technologies Institute of School of Computer Science at Carnegie Mellon University.
I am one of the members of the Distributed Information Retrieval Group led by my advisor Prof. Jamie Callan.

I am expecting to graduate in May, 2007. I will join the Intelligent Multimedia Interaction group at IBM T. J. Watson Research Center. My new work will focus on personalized and context-sensitive information retrieval.
Here are my curriculum vitae (html, pdf, doc, txt), research statement
(html, pdf, doc, txt), reference list (html, pdf, doc, txt), and a pdf file that include all three of them.

Here are some numbers to reach me (I assume that you can read clock ):
Office:
Home:
Here is a sketch of what I look like at work (still true after all these years):
My favorite gift shop -- Little Apple at Neverland
Last updated in March 2007.
jielu@cs.cmu.edu
Research | Publications | Contact | At Work | On the Road | At Leisure | My Beloved Pittsburgh Steelers

Here are some of the places I have been in some states of U.S., Canada, and Europe. I was on "business" trips for about one third of the places (attending conferences, project meetings etc.) and just for fun for the rest. Although going to these places was fun, especially places outside U.S., the visas I needed sometimes in order to be able to go and return were no fun at all.

Amsterdam (2004, 2005) | Brussels (2004) | Paris (2002, 2004)
London (2004) | Toronto (2003, 2004) | Santiago de Compostela, Spain (2005)

New Hampshire | Massachusetts | Connecticut | Washington DC | Maryland | Rhode Island
New York | Pennsylvania | Ohio | Tennessee | Louisiana | Florida | California | Colorado

 

Publications

"Content-based peer-to-peer network overlay for full-text federated search" (in press) Jie Lu and Jamie Callan.
8th RIAO Conference on Large-Scale Semantic Access to Content (RIAO '07), 2007.

"User modeling for full-text federated search in peer-to-peer networks" by Jie Lu and Jamie Callan.
29th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06), 2006.

"Full-text federated search of text-based digital libraries in peer-to-peer networks" by Jie Lu and Jamie Callan.
Journal of Information Retrieval, Volumn 9, Number 4,
2006.

"Combining multiple resources, evidences and criteria for genomic information retrieval" by Luo Si, Jie Lu and Jamie Callan.
Text Retrieval Conference (TREC'06), 2006.

"Full-text federated search in peer-to-peer networks" by Jie Lu.
Technical report CMU-LTI-05-197, Language Technologies Institute, Carnegie Mellon University, 2005.

"Federated search of text-based digital libraries in hierarchical peer-to-peer networks" by Jie Lu and Jamie Callan.
27th European Conference on Information Retrieval Research (ECIR'05)
, 2005.

"Federated search of text-based digital libraries in hierarchical peer-to-peer networks" by Jie Lu and Jamie Callan.
Peer-to-Peer IR Workshop of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04), 2004.

"Merging retrieval results in hierarchical peer-to-peer networks" by Jie Lu and Jamie Callan.
27th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04), 2004.

"Content-based retrieval in hybrid peer-to-peer networks" by Jie Lu and Jamie Callan.
12th International Conference on Information and Knowledge Management
(CIKM'03), 2003.

"Distributed information retrieval with skewed database size distributions" by Luo Si, Jie Lu and Jamie Callan.
National Conference on Digital Government Research (dg.o2003), 2003.

"Reducing storage costs for federated search of text databases" by Jie Lu and Jamie Callan.
National Conference on Digital Government Research (dg.o2003), 2003.

"Pruning long documents for distributed information retrieval" by Jie Lu and Jamie Callan.
11th International Conference on Information and Knowledge Management (CIKM'02), 2002.

Research

Federated search in distributed environments
My dissertation research develops an integrated framework of network overlay, network evolution, and search models for full-text ranked retrieval in P2P networks. Multiple directory services maintain full-text representations of resources located in their network neighborhoods, and provide local resource selection and result merging services. The network overlay model defines a network structure that extends previous peer functionalities and integrates search-enhancing properties of interest-based locality, content-based locality, and small-world to explicitly support full-text federated search. The network evolution model provides autonomous and adaptive topology evolution algorithms to construct a network structure with desired content distribution, navigability and load balancing without a centralized control or semantic annotations. The network search model addresses the problems of resource representation, resource selection, and result merging based on the unique characteristics of P2P networks, and balances between effectiveness and cost. The framework is a comprehensive and practical solution to full-text ranked retrieval in large-scale, distributed and dynamic environments with heterogeneous, open-domain contents.
The models developed as integrated parts of the framework for full-text federated search in P2P networks can also find be used in other applications such as organizing the server farm in large-scale centralized search, managing online communities and social networks, and improving meta-search and personalized search.

Genomic information retrieval
My work in genomic information retrieval focuses on combining multiple resources, evidence, and criteria for query expansion and result ranking. Acronyms, aliases, and synonyms are extracted from external biomedical resources such as AcroMed, LocusLink and UMLS to create lexicons. Based on the associations among terms/phrases in these lexicons, several term-weighting schemes are designed to assign weights to expansion terms from different sources. For result ranking, different scoring criteria are used to evaluate evidence from document, passage, and term-matching granularities, which are further combined using a weighted linear combination to produce final ranking. Evaluation results show that the technique developed for query expansion based on external biomedical resources is effective, and result ranking by combining multiple scoring criteria and evidence consistently provides better performance compared with result ranking based on a single criterion.

Automatic duplication detection
The task of duplicate detection in large public comment datasets is to detect exact-duplicate and near-duplicate documents in comments made by the public about proposed federal regulations. Exact-duplicate and near-duplicate comments are typically created by copying and editing form letters provided by organized interest groups and lobbies. To utilize the domain knowledge about the creation process of duplicate documents, a new fuzzy match edit operation is introduced in my work to match sentences with minor word differences. The degree of fuzzy match between sentences is measured using traditional information retrieval techniques. A modified edit distance method is proposed to compare documents at the sentence granularity based on the edit operations of substitution, insertion, deletion, and fuzzy match. By combining the complementary strengths of a similarity-based approach commonly used in IR (flexibility and efficiency) and a string-based approach which measures the effort required to transform one document into another (accuracy), more effective and robust performance can be achieved for detecting near duplicates.