|
William W. Cohen
|
Biography
William Cohen received his bachelor's degree in Computer Science from
Duke University in 1984, and a PhD
in Computer Science from Rutgers
University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from
April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company
specializing in extracting information from the web. Dr. Cohen is
member of the board of the International Machine Learning
Society, is an Associate Editor for the journal Artificial
Intelligence, and an action editor for the Journal of Machine Learning
Research. In the past he has also served as an action editor
for the journal Machine
Learning and the Journal of
Artificial Intelligence Research. He was General Chair for the
2008 International Machine
Learning Conference, held July 6-9 at the University of Helsinki,
in Finland;
Program Co-Chair of the 2006 International
Machine Learning Conference; and Co-Chair of the 1994
International Machine Learning Conference. Dr. Cohen was also the
co-Chair for the 3rd
Int'l AAAI Conference on Weblogs and Social Media, which was held
May 17-20, 2009 in San Jose. We iss a AAAI Fellow, an
in 2008, he won the SIGMOD "Test of
Time" Award for the most influential SIGMOD paper of 1998.
Dr. Cohen has also served on more than 20 other program committees or
advisory committees.
Dr. Cohen's research interests include information integration and
machine learning, particularly information extraction, text
categorization and learning from large datasets. He holds seven
patents related to learning, discovery, information retrieval, and
data integration, and is the author of more than 100 publications.
I'm currently involved with:
- Querendipity, an adaptive personal
information management system for biologists.
- SEAL, a Google-Sets-like bootstrapping tool written by my former student Richard Wang.
- SimStudent, a project that adds learning-by-demonstration to CTAT.
- SLIF, a system that analyzes the text and images
in online journal articles to find information about the subcellular localization of proteins.
- Minorthird, an
open-source Java package of information extraction software.
Demos:
-
Measure twice, cut once - Vitor and Ramnath have developed a Thunderbird
plugin that implements recipient
recommendation and leak
detection for email. It modifies Thunderbird by adding an
additional pane that pops up after you send a message, giving you one
final chance to fix any errors in your recipient list. There's a
brief writeup on how to use it, but it's
pretty self-explanatory: just download it, open Thunderbird, and go to
the tools->addon menu to install. After you've installed it, you
train by opening your folder of "Sent" mail and pressing the "train"
button. (This took about an hour for my 9000+ old messages.)
-
Ramesh
Nallapati has put together two nice demos of his multiscale topic tomography topic-modeling technique, one
for articles from Science,
and one with cancer-related
articles from PubMed.
-
Here are two movies that demo SimStudent, a programming-by-demonstration
system for constructing cognitive tutors, built by Noboru Matsuda.
Software:
- Minorthird is an
open-source Java package of information extraction and text
classification learning tools.
-
I am now distributing a standalone tool, built on Minorthird, for
annotating biomedical text. This is particularly aimed at annotating
figure captions but might be useful for other text as well. The jar file for this is rather large
(17M), as it includes a Minorthird jar. There is documentation available for this,
and some sample data.
-
My former student Vitor Carvalho distributes the poetically named Jangada and
Ciranda,
which are also standalone apps built on top of Minorthird, to analyze
email messages.
-
SecondString is
another open-source Java package, of approximate string matching
techniques.
- SLIPPER and WHIRL are
now being distributed via Rutgers University. They are free for research
purposes.
- To get a copy of RIPPER, please send mail to my evil twin brother,
wcohen -AT- gmail.com.
As an alternative to that ancient code: I haven't used it myself, but
I've heard good things about
J-RIP, a Ripper clone written for WEKA.
The following datasets are available for anyone to use for research
purposes:
-
100,000+ bibliography entries, in the original BibTeX format, converted to an EndNote-like format, and in a featurized format, for experiments with matching (60M).
-
A 56k-node, 200k-edge graph containing data from SGD and PubMed, used in Querendipity.
- 617
messages from 20 Newsgroups, annotated for reply bodies and
signatures, prepared by my former student Vitor Carvalho
-
Two subsets of the Enron data, annotated with person names,
prepared by my student Einat
Minkov.
- Enron email dataset
(400Mb, once you get there) contains 800,000+ emails from 150 users+
organized into 4700+ folders.
- Some more email data: about two
thousand messages released to the public as part of the ongoing investigation
of US Attorney firings at the Dept of Justice. This is very
strange data---the original email is released as scanned printouts in
PDF (?!), so most of the text is not available. There are links to
copies of the PDF, some manually added annotations, and a (apparently
manually-reconstructed) social network graph. About 1.5Mb (in Excel
format). From Mark
Johnson, and a network of volunteers.
- A collection of various extraction datasets
in Minorthird format (6Mb), including about 1000 Enron emails tagged
for person names and temporal expressions.
- classify.tar.gz (0.4Mb) contains
nine problems in which the goal is to classify short entity names.
This data was used in Joins that Generalize: Text Classification
Using WHIRL (KDD-98).
- ranking.tar.gz (8Mb) contains the
data used for the meta-search experiments in my JAIR paper Learning to Order
Things (with Rob Schapire and Yoram Singer).
- match.tar.gz (0.7Mb) contains a suite of
labeled entity-name matching and clustering problems
(i.e. problems for which the correct matches/clusters are provided),
in a single consistent format. In most cases WHIRL's performance is
given as a benchmark. (These are also distributed in the RIDDLE
Repository. Extraction-oriented versions of some of this data are
available on the RISE
Repository. (I.e., represented as a problem of extracting data from
a website, rather than matching two datasets).)
- whirl-bench.tgz (1.1Mb) contains some
more WHIRL-format entity name matching problems.
- Predictively Modeling Social Media,
invited talk given at
the 1st International Workshop on Mining Social Media, co-located with 13th Conference of the Spanish Association for Artificial Intelligence (CAEPIA-TTIA 2009).
- Matching and clustering product descriptions
using learned similarity metrics, invited talk given at
the IJCAI 2009 Workshop on Information Integration on the Web, July 2009. (Powerpoint; 6.7M)
- Open information extraction talks:
- Embodied Cognition and Knowledge:
Integration of Heterogeneous Databases without Common Domains Using
Queries Based on Textual Similarity, talk given for my 10-year
"Test of Time" Award at SIGMOD-2008(Powerpoint; 11Mb)
- Using Machine Learning to Discover
and Understand Structured Data, invited talk given at LinkedData
2008. (Powerpoint; 6Mb)
- Machine Learning for Personal Information
Management, invited talk given at ICMLA-2007. (Powerpoint; 8Mb)
- A Framework for Learning to Query Heterogeneous Data,
invited talk given at IQIS 2006. (Powerpoint; 8Mb)
- On Beyond Hypertext: Searching in Graphs
Containing Documents, Words, and Actual Data, invited talk given
at DB/IR Day 2006. (Powerpoint; 6Mb)
- A Century Of Progress On Information
Integration: A Mid-Term Report, an overview of information
integration, focusing modestly on my own work, given as invited
talk at WebDB-2005. (Powerpoint;
12Mb)
- Tutorials:
- Information extraction (PowerPoint;
4.8Mb), aimed at folks somewhat familiar with statistical NLP
methods. And thanks to Thierry Poibeau, there's also a version en francais (did I get that right, Thierry?)
Also, two earlier versions of this are also still around, both
given with Andew McCallum at recent conferences, KDD-2003(PowerPoint; 6.8Mb) and NIPS-2002.
- Text classification
(PowerPoint; 3Mb), given at a CALD Summer Course.
- Collaborative
filtering (PowerPoint; 9.1Mb), given at a DIMACS workshop.
- A mini-course on record linkage and matching:
- Other technical talks:
-
Spring 2010: 10-802 (Analysis of Social Media), 10:30-11:50pm Tues & Thus, 4102 Gates Building.
-
Fall 2009: 10-707
(Information Extraction), 1:30-2:50pm Mon & Wed, 5222 Gates
Building.
-
Spring 2008: 10-601 (Machine Learning)
with Tom Mitchell, on 3-4:30
Mon & Wed in Wean Hall 5409.
-
Fall 2007: Analysis of Social
Media, Machine Learning 10-802 and LTI 11-772, with Natalie Glance
(of Google Pittsburgh) - a brand-new seminar course. 4:30-6:30
Tuesdays in Wean Hall 4623.
- Note: This site is the shattered remains of a once-beautiful wiki,
created by the students of 10-802, generously hosted for free by
ScribbleWiki, tragically lost (due
a combination of RAID drive failures and low-bidder backup schemes),
and then largely recovered using
Warrick
from various internel caches and archives.
-
Fall 2007: Current Topics
in Computational Biology (Journal Club), 02-701. (Announcements). Thursdays from 4:00-5:00 in 411
Mellon Institute (after Cell & Systems Modeling).
-
Spring 2007: Information Extraction, Machine
Learning 10-707 and LTI 11-748 - back by popular demand for the first time since 2004!
-
Fall 2006: Current Topics in Computational Biology (Journal Club), 02-701.
(Announcements)
-
Spring 2006: Read the Web, CALD 10-709.
-
June 21,23,25, 2005: A mini-course on Minorthird. Materials are below.
- Slides, notes, and sample files from first
day's lecture.
- Slides, notes, and sample files from second
day's lecture.
- Powerpoint slides from third
day's lecture.
- Jar file for minorThird, if you
only want to run the code, not compile it or read it.
The installation process here is:
- Install Java 1.4 or higher (actually, JRE is all you need).
- Download the jar for minorThird
and stick it in some directory.
- Optionally, download the sample data
repository and unpack it into the same directory.
- Change to that same directory and
then run Minorthird with the command
java -Xmx500M -jar minorthird.jar
What will pop up will be a small launch pad that can be used to
start any of the UI programs. You can also start a particular
main by specifying minorthird.jar as your classpath, for
instance:
java -Xmx500M -cp minorthird.jar edu.cmu.minorthird.ui.Help
- If you want to do a real install here's the home page on Sourceforge, and
a document on how to do a CVS
install Minorthird.
- Spring 2004: "Learning to Turn Words into Data:
Machine Learning Approaches to Information Extraction and Information Integration", CALD 10-707 and LTI 11-748.
- Here's an RSS feed of my papers, created with Dapper. Here's a pointer to my DBLP page.
- A Computer Scientist's Guide To Biology is no longer
available from this web page, but is now available from Springer. Here is a the TOC,
introduction, index, and a sample chapter, from a late draft of
the book; and also all the figures
from the book in PowerPoint and all the figures in
PDF. (The figures are a little prettier than the ones in the
final book, which is black and white, not color).
- ICML
2006 Proceedings are available in print, for the true afficianado
of fine learning-related research. It's well worth the money for the
cover art alone (of course, all the papers are also available on-line
for free.)
- Recent and selected publications. These
are some representative publications for which on-line copies can be
distributed.
- All publications. Here is an more-or-less
complete chronological list of my publications. The bibliography
includes pointers to on-line versions when I can provide them, but
unfortunately copyright restrictions don't allow me to make all of my
publications available on-line. Of course, reprints are always
available from me on request.
- Publications by topic:

Recent papers I'm keeping in HTML or PDF (which requires Adobe
Acrobat Reader to view). Older papers are mostly in Postscript.
For Windows, I use the GSView reader for
postscript. Most of these papers are viewable in several formats in
ResearchIndex.
- Ramnath Balasubramanyan, LTI PhD student
- Frank Lin, LTI PhD student
- Bhavana Dalvi, LTI PhD student
(co-advised with Jamie Callan)
- Ni Lao, LTI PhD student
(co-advised with Eric Xing)
- Nan Li, HCII PhD student
(co-advised with Ken Koedinger
- Tae Yano, LTI PhD student
(co-advised with Noah Smith)
- Katie Rivard, research programmer/analyst
- Richard C. Wang,
(former LTI PhD student co-advised with Bob Frederking, now at Google).
- Andrew Arnold
(former MLD PhD student, now at WorldQuant)
- Noboru Matsuda
(former postdoc, co-supervised with Ken Koedinger,
now System Scientist in CMU's HCII)
- Einat Minkov-Manela (formerly Einat Minkov,
former LTI PhD student, now at Nokia)
- Vitor Rocha de Carvalho (former LTI PhD student, now at Microsoft)
- Ja-Hui Chang
(visiting faculty from National Central University, Taiwan, 2007-2008)
- Zhenzhen Kou (former MLD PhD student, now at Yahoo!)
- Gustavo Lacerda
(former research assistant, co-supervised with Noboru Matsuda and Ken Koedinger, now at UBC)
- Ramesh Nallapati
(former postdoc, co-supervised with John Lafferty, now at Stanford)
- Edoardo Airoldi
(former MLD/Stats PhD student, co-advised with Steve Fienberg)
- Pradeep Ravikumar
(former MLD PhD student, co-advised with Steve Fienberg)
- I have been an external committee member for the PhD theses of John Zelle (degree from U
Texas), Misha
Bilenko (from U Texas), Daniel Kudenko
(Rutgers), Chumki Basu (Rutgers), Ananlada Chotimongkol (CMU), Wei-Hao
Lin (CMU), Cenk Gazen (CMU), and David Nadeau (U Ottowa), and Ben van
Durme (Rochester) and the Master's theses of Mehrbod Sharifi (CMU) and
Weam Abu-Zaki (CMU). I am currently an external committee member for
Jon Elsas (CMU), Partha Talukdar (U Penn),
Andy Carlson (CMU), Swapna Sundaran (U
Pitt), and Michael
Heilman (CMU).
- I also have collaborated recently and frequently with Tom Mitchell, Bob Murphy,
and Anthony Tomasic.
William Cohen
Associate Research Professor
Machine Learning Department
Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213
Wean Hall 5317 Gates 8217
(shipping address: 6105 Gates Hillman Complex)
412-268-7664 (voice) / 412-268-2205 (fax)
Assistant: Sharon Cavlovich, sharonw+@cs.cmu.edu, 412-268-5196
Official CMU Contact Info
My preferred email address is: wcohen AT cs DOT cmu DOT edu
Obscure fact: two of my papers made the Citeseer's list of most-cited machine learning papers,
and one made the list of most-cited database papers,
For those many friends whose research I have built on, be warned.
My full name, "William Weston Cohen", is an anagram of the phrase "I
now cite shallow men". (From Sara Cohen - no
relation! - comes this warning: "Women's rights activists would
probably request you to use the following anagram instead: 'I shall
now cite women'".)
I am often praised for my highly artistic and functional web site
designs. An example is the site for SC Indexing, a professional book
indexer. However, I accept few clients - this one happens to be
my wife.
Through my advisor, Alex Borgida, I can trace my "academic lineage" back to luminaries like
Leibniz and Alfred Whitehead.
Poetry anyone?