BioNLP.org exists to help
researchers in their work on natural language processing (NLP) for
articles in the biomedical literature.
This is a brand-new version of the site, simple and packed with
links to resources - far different from the previous one.
This site was started by Bob Futrelle in
early 2001. Bob retired from Northeastern University in
2011 and is hard at work developing his own NLP system.
The BioNLP mailing list
The list is probably the most useful day-to-day resource. The
message archive contains more than 2,000 messages starting in
2001. The list includes announcements, discussions, and pointers
to resources such as software, text databases, conferences, and
You can join the mailing list here: http://mail.bionlp.org/mailman/listinfo/bionlp_bionlp.org
The message archive is here: http://mail.bionlp.org/pipermail/bionlp_bionlp.org/
The archives are indexed by Google, and searchable here:
PubMed itself supports limited phrase search, sometimes
reporting, "Quoted phrase not found.", even when Google finds
this email archive item for further comments on the
Google Scholar below does a good job of harvesting papers on the web,
including references to them.
GATE - General Architecture for Text Engineering (The
University of Sheffield)
GATE is a mature, powerful, and widely used
system for working with text.
It is free and open source. There is substantial
documentation including numerous courses.
NLTK - The Natural Language Toolkit - A free Python-based
set of tools
The site http://nltk.org
has not been responsive for me (mid-May 2013).
But the following Google Site appears to have almost
and points to large collections of code, data,
documentation, courses, and more.
There is an
excellent book that leads the reader through using the system,
along with explaining numerous aspects of natural language
The National Centre for Text Mining
(NaCTeM) (University of Manchester)
The NaCTeM is the first publicly-funded text
mining centre in the world.
The website includes links to text mining
services provided by NaCTeM; software tools, both those
developed by the NaCTeM team and by other text mining groups;
seminars, general events, conferences and workshops; tutorials
and demonstrations; text mining publications.
An annotated list of NLP and corpus resources from
The list is extensive and reasonably
up-to-date. 600 lines long
Stanford's own software is written in Java: http://nlp.stanford.edu/software/
Data Consortium (LDC) (University of Pennsylvania)
The LDC supports language-related education,
research and technology development by creating and sharing
linguistic resources: data, tools and standards. A number
of their largest resources are available for a fee or through
Freely available collections of
BioMed Central data mining site
As of 16 May 2013 BioMed Central (with Chemistry Central and
SpringerOpen) has published 160,020 articles of peer-reviewed
research, all of which are covered by our open access license
agreement which allows free distribution and re-use of the
full-text article, including the highly structured XML
version. The entire XML set can be downloaded as a zip
(I use the XMLs in my personal research after
applying their XSLT preview stylesheet. - Bob Futrelle)
The PubMed Central Open Access Subset (ncbi.nlm.nih.gov)
This contains additional articles beyond the large BioMed
They offer four tar.gz files containing XML (and only XML) for
all the articles in the PMC open access subset.
Finding BioNLP-related conferences
A useful strategy is to search the BioNLP mail
archives for terms such as 'Conference', 'Workshop', or 'Proceedings'.
Adding a year to your search term(s) can help to narrow the
Some notable books
Two relational database systems -
MySQL and PostgreSQL
- Daniel Jurafsky and
James H. Martin (2008). Speech and Language Processing,
2nd edition. Pearson Prentice Hall. ISBN 978-0-13-187321-6.
- Christopher D. Manning
and Hinrich Schütze (1999). Foundations of Statistical
Natural Language Processing. The MIT Press. ISBN
- Steven Bird, Ewan Klein, and Edward Loper
(2009). Natural Language Processing with Python.
O'Reilly Media. ISBN 978-0-596-51649-9.
- Christopher D. Manning,
Prabhakar Raghavan, and Hinrich Schütze (2008). Introduction
to Information Retrieval. Cambridge University
Press.ISBN 978-0-521-86571-5. Official html and pdf versions
available without charge.
- Ian H. Witten, Eibe
Frank, and Mark A. Hall (2011). Data Mining: Practical
Machine Learning Tools and Techniques, Third Edition.
Morgan Kaufmann. ISBN 978-0-12-374856-0.
Beyond these there is a slew of NoSQL
There are dozens of books about both of these systems.
data that's worth anything needs to be persisted.
"MySQL Community Edition is a freely downloadable version of the
world's most popular open source database ...."
"PostgreSQL is a powerful, open source object-relational
database system. It has more than 15 years of active
development and a proven architecture that has earned it a
strong reputation for reliability, data integrity, and
(I now use PostgreSQL thanks to the prompting
of my son, Joe
Futrelle. Works for me. It includes its own
GUI management tool, pgAdmin3. I typically use only two
fields per table, a column-oriented approach. The manual
for PostgreSQL is extensive, >2,000 pages ! There's a
nice little PostgreSQL book that I find useful: PostgreSQL:
Up and Running, http://shop.oreilly.com/product/0636920025061.do
Site updated May 16, 2013 by
Bob Futrelle - Developed using SeaMonkey and BBEdit.
email: bob then dot then futrelle at gmail.com