Automatic Document Classification

doi:10.5281/zenodo.4136728

Informationskompetenz - Basiskompetenz in der Informationsgesellschaft, Proceedings des 7. Internationalen Symposiums für Informationswissenschaft (ISI 2000), Darmstadt, 8.-10. November 2000.

Published October 27, 2020 | Version v1

Conference paper Open

Automatic Document Classification

(Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical pattern recognition, or neural network approaches are used to construct classifiers automatically. In this paper we thoroughly evaluate a wide variety of these methods on a document classification task for German text. We evaluate different feature construction and selection methods and various classifiers. Our main results are: (1) feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); (2) surprisingly, our morphological analysis does not improve classification quality compared to a letter 5-gram approach; (3) Support Vector Machines are significantly better than all other classification methods.

Files

isi2000_9.pdf

Files (476.6 kB)

Name	Size	Download all
isi2000_9.pdf md5:ee7517aa74937d61282501c276b5bbab	476.6 kB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	89	89
Downloads	37	37
Data volume	20.0 MB	20.0 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Zenodo

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: October 27, 2020
Modified: October 27, 2020

Automatic Document Classification

Creators

Description

Files

isi2000_9.pdf

Files (476.6 kB)