All Main Campus Library facilities are open and operating at full capacity to Georgetown faculty, students, and staff. The Library will be closed to external community members and guests through December 2021, with limited exceptions. Find the most current information available on the Library's COVID-19 FAQ.

or browse databases: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z #

You are here

You are here

Text Mining

Text mining and computational analysis of textual materials is available to researchers at Georgetown University. Some of the services the Library subscribes to and recommends are listed below. 

tools for text mining          corpora for text mining              


hathitrust logo  HathiTrust Research Center

The HTRC is a tool for beginner to intermediate text mining. It enables computations analysis of nearly 3 million texts in the HathiTrust Digital Library as well as texts uploaded by the user.

voyant logo Voyant

Voyant is a tool for beginner to intermediate text mining. It is a web-based and allows basic text mining of texts uploaded by the user or users can put in a URL and text mine an entire webpage.

samediff logo  SameDiff

SameDiff is a a tool for beginner text mining. It is a web-based tool for comparing two texts. It generates lists of common words and distinct words. It also uses cosine similarity algorithm to rate the similarity of the documents.  

google logoGoogle Ngram Viewer

Google Ngram Viewer is a beginner tool for text mining. It is a web-based tool where users can search for a word or phrase across the Google Books corpus. The tool returns a chart of occurrences of the word or phrase over time.

mallet logo MALLET

MALLET (Machine Learning for Language Toolkit) is a tool for intermediate to advanced text mining. It is a command line tool for natural language processing, document classification, clustering, topic modeling and information extraction.



jstor logo JSTOR Data for Research

JSTOR provides datasets for the journals and pamphlets in their digital library to researchers interested in text mining. 

hathitrust logo HathiTrust Digital Library

The Georgetown University Library is a member of HathiTrust, a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. Georgetown users can log into HathiTrust's digital library to download millions of titles for free.

project gutenberg logoProject Gutenburg 

Project Gutenberg provides free access to over 54,000 eBooks. EBooks can be downloaded as plain text files, enabling easy text mining with a range of tools. 

university of oxford logoUniversity of Oxford Text Archive

The University of Oxford Text Archive (OTA) is a repository of digital literary and linguistic resources for research and teaching. Users can download texts as XML files, plain text files, or view the text in Voyant with the direct link provided in each entry of the OTA catalog. 



Have questions or want help getting started? Contact us: