Description of Document Collections
Some document collections can be distributed by NIST at no cost
to participants in TAC tasks that require the collections.
Other collections must be obtained directly from other organizations.
The AQUAINT corpus of English News Text consists of 1,033,461 documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires. The collection spans the years 1999-2000 (1996-2000 for Xinhua documents). The AQUAINT collection is distributed by the Linguistic Data Consortium (LDC catalog number LDC2002T31).
The AQUAINT-2 collection is a subset of the LDC English Gigaword Third Edition (LDC catalog number LDC2007T07). The AQUAINT-2 collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period of October 2004 - March 2006. Articles are in English and come from a variety of sources including Agence France Presse, Central News Agency (Taiwan), Xinhua News Agency, Los Angeles Times-Washington Post News Service, New York Times, and the Associated Press. Each document has an ID consisting of a source code, a date when the document was delivered to LDC, and 4 digits to differentiate documents that come from the same source on the same date (e.g., document NYT_ENG_20050311.0029 was received from the New York Times on March 11, 2005). The AQUAINT-2 collection is distributed by the Linguistic Data Consortium (LDC catalog number LDC2008T25).
The Blog06 collection is distributed by the University of Glasgow and
is the same collection as was used in the TREC 2006 Blog Track. The
University of Glasgow collected Blog06 documents by polling 100,649
RSS and Atom feeds over an 11 week period (December 6, 2005 -
February 21, 2006). A blog document is defined to be a blog post plus
its follow-up comments (a permalink). Each document in the permalinks
collection is the raw HTML content from the Web wrapped between a
<DOC>...</DOC> pair. Just after <DOC>, there are
some informational metadata tags, including the <DOCNO> section
which contains the document ID. NIST distributes a small number of
Blog06 test documents to TAC track participants for those tasks in
which the entire Blog06 collection is not required. However,
organizations wishing access to the entire Blog06 collection must
obtain it directly from the University of Glasgow. (See http://ir.dcs.gla.ac.uk/test_collections/.)
- Blog06 README
- The TREC Blogs06 Collection : Creating and Analysing a Blog Test Collection. Craig Macdonald and Iadh Ounis. DCS Technical Report TR-2006-224. Department of Computing Science, University of Glasgow. 2006. [PDF]
The TIPSTER collection (Disks 1-3) is distributed by NIST to TAC track participants at no cost, upon request. The collection is also distributed by the Linguistic Data Consortium (LDC catalog number LDC93T3A).
- Disk 1: Includes material from the Wall Street Journal (1987, 1988, 1989), the Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis.
- Disk 2: Includes material from the Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis.
- Disk 3: Includes material from the San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992) copyrighted by Ziff-Davis.
The TREC collection (Disks 4-5) is distributed by NIST to NLP researchers at no cost, upon request. Please follow the procedure to request the TREC Collection.
- Disk 4: Includes material from the Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994).
- Disk 5: Includes material from the Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).