Introduction
Text and Data Mining refers to the computer-aided harvesting and analysis of a corpus of data. A corpus can include the full text of a book or the entire body of an author's work, journal articles, social media posts, census data, and more. The goals of Text and Data Mining activities are to find patterns, discover relationships, and analyze semantics that suggest new meanings.
UNB Libraries can provide assistance with developing a text or data mining project including:
- Negotiating licenses for access to resources;
- Developing agreements with providers of texts;
- Consulting on project planning and tool selection;
- Helping with training.
Text Analysis and Mining Tools
There can be a learning curve to using the following tools effectively. Please contact Erik Moore (ecmoore@unb.ca) or Julie Morris (jullie.morris@unb.ca) with questions or for guidance.
Subscribed resources
- Constellate Constellate is a browser-based tool for creating datasets from collections, such as JSTOR, and then teaches and facilitates text analysis on those datasets. A number of collections can be analyzed (including the user’s own content). Users have the ability to create 50,000 item datasets, view all the visualizations, and access the Constellate Lab. The platform provides the content and tools you need together in one place, alongside a defined curriculum and tutorials, live classes taught by text analysis experts, and a community you can connect to for inspiration and guidance.
Subscribed multi-user unlimited access - Gale Digital Scholar Lab Gale Digital Scholar Lab equips students and scholars with text and data mining resources, visualization tools, and methodology suggestions. The incremental process of Build, Clean, and Analyze supports newcomers and experienced users alike as they interpret both Gale Primary Sources and their own documents.
Subscribed multi-user unlimited access
Free resources
- Google Ngram Viewer: https://books.google.com/ngrams
When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books. Learn more here: https://books.google.com/ngrams/info
- OpenRefine: https://openrefine.org
A powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
- Voyant Tools: https://voyant-tools.org/
An open-source, web-based application that supports scholarly reading and interpretation of texts or a corpus. Voyant was conceived to enhance reading through lightweight text analytics such as word frequency lists, frequency distribution plots, and KWIC (keyword in context) displays.
Text Corpora with Data Mining Rights
The following resources include text and data mining rights as part of their license agreements, sometimes with conditions. To learn the details of a specific resource's agreement, please contact:
- Joanne Smyth, Director, Collections Strategy and Scholarly Communication: jsmyth@unb.ca
- Linda Roulston, Electronic Licensing Librarian: lroulsto@unb.ca
Subscribed Resources
- 17th and 18th century Burney newspapers collection (Gale) "The newspapers, pamphlets, and books gathered by the Reverend Charles Burney (1757-1817) represent the largest and most comprehensive collection of early English news media. The present digital collection, that helps chart the development of the concept of 'news' and 'newspapers' and the "free press", totals almost 1 million pages and contains approximately 1,270 titles. Many of the Burney newspapers are well known, but many pamphlets and broadsides also included have remained largely hidden. These treasures can now be searched, browsed and discovered again within Gale Digital Collections."
Unlimited simultaneous users. - 17th and 18th century Nichols newspapers collection (Gale) "The 17th and 18th Century Nichols Newspapers Collection features the newspapers, periodicals, pamphlets and broadsheets that form the Nichols newspaper collection held at the Bodleian library in Oxford, UK. All 296 volumes of bound material, covering the period 1672-1737 are presented in digitized format here.
This collection charts the history of the development of the press in England and provides invaluable insight into 17th-18th century England."
Unlimited simultaneous users. - British Library Newspapers - Part I & IV (Gale) "Sourced from the extensive holdings of the British Library, British Library Newspapers delivers a wide range of irreplaceable local and regional voices to reflect the social, political, and cultural events of the eighteenth, nineteenth, and twentieth centuries. With more than 160 newspaper titles, the series is comprised of approximately 5.5 million pages of historic content, from articles to advertisements. UNB Libraries provides access to:
Part I: 1800-1900
Ranging from early tabloids like the Illustrated Police News to radical papers like the Chartist Northern Star, publications in Part I span a vast range of national, regional, and local interests.Part IV: 1732-1950 From key early newspaper titles like the Stamford Mercury to what is possibly the oldest magazine in the world still in publication, the Scots Magazine, Part IV offers key local and regional perspectives."
Unlimited simultaneous users. - Cambridge Core (eBooks & eJournals) Cambridge Core provides full text for eJournals in the sciences, social sciences, and humanities, as well as access to selected eBooks purchased by UNB Libraries.
- Early English Books Online (EEBO via ProQuest)
EEBO is based on the microfilm collections curated by the Ann Arbor publisher Eugene B. Power (1905-1993). The founder of what became University Microfilms International or UMI, Power’s first foreign project established the microfilming operation at the British Museum in 1942 and, since then, more than 200 libraries worldwide have contributed to the microfilm collection.
Following its digital launch in 1998, Early English Books Online now contains page images of virtually every work printed in England, Ireland, Scotland, Wales and British North America, as well as works in English printed elsewhere between 1473 and 1700.
Unlimited simultaneous users. - Eighteenth Century Collections Online (ECCO) A comprehensive digital edition of The Eighteenth Century microfilm set, which has aimed to include every significant English-language and foreign-language title printed in the United Kingdom, along with thousands of important works from the Americas, between 1701 and 1800. Consists of books, pamphlets, broadsides, ephemera. Subject categories include history and geography; fine arts and social sciences; medicine, science, and technology; literature and language; religion and philosophy; law; general reference. Also included are significant collections of women writers of the eighteenth century, collections on the French Revolution, and numerous eighteenth-century editions of the works of Shakespeare. Where they add scholarly value or contain important differences, multiple editions of each individual work are offered. Allows searching Early English Books Online as an option.
Unlimited simultaneous users. - English Poetry Database The English Poetry Database "contains poems in English from Anglo-Saxon times to the end of the nineteenth century by writers from the British Isles. The database covers the works of 1,257 named poets and many items by different anonymous hands."
Unlimited simultaneous users. - Institute of Physics Publishing (IOP) The Institute of Physics Publishing promotes research and the advancement of knowledge in Physics and Physics-related fields. This resource provides access to eBooks, as well as current and archival journal content.
Unlimited simultaneous users. - JSTOR Archival Collection JSTOR provides access to back issues of a variety of scholarly journals. UNB Libraries currently subscribes to the Arts & Sciences (I through X) collections, along with the Life Sciences and Ireland collections.
Unlimited simultaneous users. - JSTOR Current Collection In addition to being an archive, JSTOR offers current access to a range of titles from various publishers. UNB has access to current and archival content for almost 50 of these journals.
Unlimited simultaneous users. - JSTOR eBooks Books at JSTOR offers more than 35,000 ebooks from renowned scholarly publishers, integrated with journals and primary sources on JSTOR's easy-to-use platform. UNB subscribes to selected eBook titles.
Unlimited simultaneous users. - JSTOR Open Access (eBooks & Archival eJournals) JSTOR Open Access offers more than 2,000 ebook titles now available from publishers such as University of California Press, Cornell University Press, NYU Press, and University of Michigan Press, and JSTOR will continue to add new titles. In addition, all journal content in JSTOR published prior to 1923 in the United States and prior to 1870 elsewhere is freely available to anyone, anywhere in the world. These open access books and archival journals are freely available for anyone in the world to use.
Unlimited simultaneous users. - Literature Online Literature Online offers full text access to rare and inacessible works, up-to-date, reference resources, in addition to the full text of poetry, drama, and prose fiction from the seventh century to the present day. Materials are included from almost every period and genre of English literature as well as many works by 20th century authors. Contemporary criticism is available through the Annual Bibliography of English Language and Literature (ABELL).
Unlimited simultaneous users. - Making of the Modern World: Part I & Part II (Gale) "The Making of the Modern World is an extraordinary series which covers the history of Western trade, encompassing the coal, iron, and steel industries, the railway industry, the cotton industry, banking and finance, and the emergence of the modern corporation." UNB Libraries provides access to: Part I, The Goldsmiths'-Kress Collection, 1450-1850 Offers ways of understanding the expansion of world trade, the Industrial Revolution, and the development of modern capitalism, supporting research in variety of disciplines. Users have access to an abundance of rare books and primary source materials, many of which are the only known copy of the work. Part II, 1851-1914 Takes The Making of the Modern World series to the end of the nineteenth century. Comprised mainly of primary source documents such as monographs, reports, correspondence, speeches, and surveys, this collection broadens Gale’s international coverage of social, economic, and business history, as well as political science, technology, industrialisation, and the birth of the modern corporation."
Purchased multi-user unlimited access - Market share reporter (Gale) Market Share Reporter (MSR) is a compilation of published market share data about companies, brands, products, commodities, services and facilities in U.S. and international markets. The 2016 and every 2nd year's subsequent edition online edition are available through Gale Virtual Reference Library. Data is compiled from periodical sources (newspapers, magazines, newsletters, government reports etc.) over the previous three to four years. Entries feature a descriptive title; data and market description; a list of producers/products; original sources are also provided. The main method used to store entries in MSR is by name of the report; reports can be found by keyword or by using the Advanced Search feature.
Unlimited simultaneous users. - Nineteenth Century Collections Online (NCCO eBooks) Nineteenth Century Collections Online is a digitization and publishing program focusing on primary source collections of the long nineteenth century. The program includes a variety of content types--monographs, newspapers, pamphlets, manuscripts, ephemera, maps, statistics, and more--and unites them in one central, cross-searchable location. 12 collections are now available:
Individual titles in these collections are available for discovery in our eBooks search or in UNBWorldCat:
• Asia and the West: Diplomacy and Cultural Exchange
• British Politics and Society
• British Theatre, Music, and Literature: High and Popular Culture
• Children's Literature and Childhood
• European Literature, 1790-1840: The Corvey Collection
• Mapping the World: Maps and Travel Literature
• Religion, Society, Spirituality, and Reform
• Science, Technology, and Medicine: 1780-1925, Part II
Individual titles in these collections can only be discoverd in the NCCO site:
• Europe and Africa: Commerce, Christianity, Civilization, and Conquest
• Photography: The World through the Lens
• Science, Technology, and Medicine: 1780-1925, Part I
• Women: Transnational Networks
Unlimited simultaneous users. - Oxford University Press Journals Oxford Journals is a division of Oxford University Press, which is a department of Oxford University. We publish well over 230 academic and research journals covering a broad range of subject areas, two-thirds of which are published in collaboration with learned societies and other international organizations.
Unlimited simultaneous users. - Past Masters (Intelex) InteLex Past Masters is comprised of 100+ full-text humanities and sciences databases that make available cohesive collections of editions, in both original language and in English translation, of seminal figures in the humanities and sciences.
Unlimited simultaneous users. - ProQuest Historical Newspapers ProQuest Historical Newspapers offers full-text and full-image articles for newspapers dating back to the 19th century. As part of the ProQuest Historical Newspapers program, every issue of each title includes the complete paper, cover-to-cover, with full-page and article images in downloadable PDF. Includes The New York Times (1851-2007), The Wall Street Journal (1889-1993), and Washington Post (1877-1994).
Unlimited simultaneous users. - ProQuest Historical Newspapers: The Globe and Mail Canada's Heritage from 1844 contains complete coverage of The Globe and Mail newspaper from 1844 through 2011. Coverage includes major events in Canadian history, images, advertisements, classifieds, cartoons, birth/death notices and the full content of the Report on Business section first published in 1962.
Unlimited simultaneous users. - SAGE Journals Online "SAGE Publications is an independent international publisher of journals, books, and electronic media. Since its inception in 1965, SAGE Publications has been a leader in publishing high-caliber titles for academic researchers in the social sciences."
Unlimited simultaneous users. - Science Direct Science Direct offers comprehensive coverage of literature across all fields of science, medicine and technology. All previous ScienceDirect journal collections have been merged into this single collection, along with select purchased eBook titles.
Unlimited simultaneous users. - Scopus Scopus, a multidisciplinary online resource, will be invaluable to students and faculty in various fields of study within the sciences, health sciences and the social sciences. Scopus offers full-text linking, abstracting-and-indexing information including peer-reviewed titles from international publishers, Open Access journals, conference proceedings, trade publications, quality web sources.
Unlimited simultaneous users. - SpringerLink SpringerLINK service provides access to electronic journals in a variety of subjects, including "life sciences, chemical sciences, geosciences, computer science, mathematics, medicine, physics & astronomy, engineering, environmental sciences, law, and economics."
[NOTE: pre-1996 Archival content now accessible when available]
Unlimited simultaneous users. - Taylor & Francis Online - eJournals "Taylor & Francis Group collaborates with researchers, scholarly societies, universities and libraries worldwide to bring knowledge to life. Our journals program encompasses over 1,600 titles and as one of the world’s leading publishers of scholarly journals our content spans all areas of Humanities, Social Sciences, Science and Technology." View journal titles in collection: Taylor & Francis Online - eJournals (Archive 2017 purchased access) Taylor & Francis Online - eJournals (CRKN Medical Library subscribed access) Taylor & Francis Online - eJournals (CRKN S&T Library subscribed access) Taylor & Francis Online - eJournals (CRKN SSH Library subscribed access) Taylor & Francis Online - eJournals (UNB Select Titles Archive 2017 purchased access)
Unlimited simultaneous users. - Times Digital Archives (Gale) The Times Digital Archive allows users to search and view online The Times (London) newspaper from 1785-1985.
NOTE: The Times is not published on Sunday, and the The Sunday Times, a distinct newspaper, is not included in this database.
Unlimited simultaneous users. - War on Poverty and Office of Economic Opportunity: Part III Administration of Antipoverty Programs & Civil Rights, 1964-67 (Gale) "This collection brings together a series of Office of Economic Opportunity (OEO) collections that highlight efforts to meld the issue of civil rights and antipoverty initiatives: 1) Alphabetical File of Samuel Yette, 1964-1966: Yette was the Special Assistant to the Director of Civil Rights. Among his records are correspondence, reports, antipoverty program analyses, minutes of meetings, transcripts of testimonies, and other material. 2) Program Files, 1964-1967: These records consist of correspondence, weekly reports on civil rights matters, reports by civil rights coordinators, equal employment opportunity guidelines, and more. 3) Records Relating to the Administration of the Civil Rights Program in the Regions, 1965-1966: These records arranged by region > state > local areas and cities consist of correspondence between regional coordinators, various civil rights groups, labor organizations, members of Congress, and community groups regarding the activities of the OEO." Original Microform Title: The War on Poverty and the Office of Economic Opportunity; Part 3: Administration of Antipoverty Programs and Civil Rights, 1964-1967
Unlimited simultaneous users. - Wiley Online Library Wiley Online Library hosts the world's broadest and deepest multidisciplinary collection of online resources covering life, health and physical sciences, social science, and the humanities. It delivers seamless integrated access to over 4 million articles from 1500 journals. UNB also subscribes to select eBook titles.
Unlimited simultaneous users.
Digitized Newspapers
Likewise, UNB Libraries makes available for text and data mining, with some conditions, digital back files of New Brunswick's big three daily newspapers, The Telegraph Journal, The Moncton Times-Transcript, and The Daily Gleaner. For more information, please contact James MacKenzie, Director, Advanced Digital Research and Scholarship (jmackenz@unb.ca).
Free Resources
Copyright Considerations
Frequently Asked Questions
What is the status of Text and Data Mining activities under the current Canadian Copyright regime?
The Canadian Copyright Act does not address text and data mining. The federal government has signalled that they intend to consider changes to the Act for this type of research, but there is no clear regulation at this point.
For more information see ‘A Consultation on a Modern Copyright Framework for Artificial Intelligence and the Internet of Things’: https://www.ic.gc.ca/eic/site/693.nsf/eng/00316.html
What are the applications and limits of the fair dealing doctrine?
For more information, contact Josh Dickison, Copyright Officer and Manager of Digital Delivery at UNB Libraries: copyright@unb.ca
UNB Libraries may forward you to, or ask you to contact, the Office of Research Services at ors@unb.ca if a research license agreement is needed, or for assistance on different licenses and their allowed uses.
More Information More Information
-
- Erik Moore
- UNB Fredericton
- ecmoore@unb.ca
-
- Julie Morris (They/Them or She/Her)
- Collections Analysis/Bibliometrics Librarian
- UNB Fredericton
- julie.morris@unb.ca
- (506)-447-3220
-
- Marc Bragdon
- Head, Harriet Irving Library Research Commons
- UNB Fredericton
- mbragdon@unb.ca
- WhatsApp:-506-440-3793