Automated Text Categorization: Techniques and Applications

Fabrizio Sebastiani
Istituto di Scienza e Tecnologie dell'Informazione
Consiglio Nazionale delle Ricerche
Pisa, Italy
E-mail: fabrizio@iei.pi.cnr.it

Abstract:

In  this talk  I will  present a  number of  research efforts  we have
recently  undertaken  at  ISTI-CNR  in  the field  of  automated  text
categorization,  a   discipline  at  the   crossroads  of  information
retrieval and machine learning which is concerned with tagging natural
language texts with category labels from a predefined set.

Novel techniques we have investigated include "supervised indexing", a
technique in which information on the membership of training documents
to categories is used not only to learn a text classifier, but also to
determine   the  term   weights  that   are   to  be   input  to   the
classifier-learning mechanism and  to the classifiers themselves, once
they have been built.

Novel applications of text categorization we have investigated include
    a)  the  automated  extension   of  specialized  thesauri  by  the
    application   of   text    categorization   techniques   to   term
    representations obtained  by "term  indexing" methods dual  to the
    usual text indexing ones;
    b) "automated  survey coding", i.e.   the automated classification
    of human subjects into topical categories based on the answer they
    have   given  in  response   to  an   open-ended  question   in  a
    questionnaire.