Description of Document Collections

Some document collections can be distributed by NIST at no cost to participants in TAC tasks that require the collections. Other collections must be obtained directly from other organizations.

    The AQUAINT corpus of English News Text consists of 1,033,461 documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires. The collection spans the years 1999-2000 (1996-2000 for Xinhua documents). The AQUAINT collection is distributed by the Linguistic Data Consortium (LDC catalog number LDC2002T31).

    The AQUAINT-2 collection is a subset of the LDC English Gigaword Third Edition (LDC catalog number LDC2007T07). The AQUAINT-2 collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period of October 2004 - March 2006. Articles are in English and come from a variety of sources including Agence France Presse, Central News Agency (Taiwan), Xinhua News Agency, Los Angeles Times-Washington Post News Service, New York Times, and the Associated Press. Each document has an ID consisting of a source code, a date when the document was delivered to LDC, and 4 digits to differentiate documents that come from the same source on the same date (e.g., document NYT_ENG_20050311.0029 was received from the New York Times on March 11, 2005). The AQUAINT-2 collection is distributed by the Linguistic Data Consortium (LDC catalog number LDC2008T25).

  • Blog06
    The Blog06 collection is distributed by the University of Glasgow and is the same collection as was used in the TREC 2006 Blog Track. The University of Glasgow collected Blog06 documents by polling 100,649 RSS and Atom feeds over an 11 week period (December 6, 2005 - February 21, 2006). A blog document is defined to be a blog post plus its follow-up comments (a permalink). Each document in the permalinks collection is the raw HTML content from the Web wrapped between a <DOC>...</DOC> pair. Just after <DOC>, there are some informational metadata tags, including the <DOCNO> section which contains the document ID. NIST distributes a small number of Blog06 test documents to TAC track participants for those tasks in which the entire Blog06 collection is not required. However, organizations wishing access to the entire Blog06 collection must obtain it directly from the University of Glasgow. (See http://ir.dcs.gla.ac.uk/test_collections/.)
    • Blog06 README
    • The TREC Blogs06 Collection : Creating and Analysing a Blog Test Collection. Craig Macdonald and Iadh Ounis. DCS Technical Report TR-2006-224. Department of Computing Science, University of Glasgow. 2006. [PDF]

    The TIPSTER collection (Disks 1-3) is distributed by NIST to TAC track participants at no cost, upon request. The collection is also distributed by the Linguistic Data Consortium (LDC catalog number LDC93T3A).
    • Disk 1: Includes material from the Wall Street Journal (1987, 1988, 1989), the Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis.
    • Disk 2: Includes material from the Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis.
    • Disk 3: Includes material from the San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992) copyrighted by Ziff-Davis.

  • TREC
    The TREC collection (Disks 4-5) is distributed by NIST to NLP researchers at no cost, upon request. Please follow the procedure to request the TREC Collection.
    • Disk 4: Includes material from the Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994).
    • Disk 5: Includes material from the Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).

  • WashingtonPost
    The TREC Washington Post Corpus contains approximately 840,000 news articles and blog posts from January 2012 through December 2022. The articles are stored in JSON format, and include:
    • title
    • byline
    • date of publication
    • kicker (a section header)
    • article text broken into paragraphs
    • links to embedded images and multimedia (for 2012-2017 documents)
    Individual TAC tracks distribute selected documents from this collection, which have also been reformatted into different formats.

