calculate_idf

pyhelpers.text.calculate_idf(raw_documents, rm_punc=False)[source]

Calculate inverse document frequency.

Parameters:
  • raw_documents (Iterable | Sequence) – a sequence of textual data

  • rm_punc (bool) – whether to remove punctuation from the input textual data, defaults to False

Returns:

term frequency (TF) of the input textual data, and inverse document frequency

Return type:

tuple[list[dict], dict]

Examples:

>>> from pyhelpers.text import calculate_idf

>>> raw_doc = [
...     'This is an apple.',
...     'That is a pear.',
...     'It is human being.',
...     'Hello world!']

>>> docs_tf_, corpus_idf_ = calculate_idf(raw_doc, rm_punc=False)
>>> docs_tf_
[{'This': 1, 'is': 1, 'an': 1, 'apple': 1, '.': 1},
 {'That': 1, 'is': 1, 'a': 1, 'pear': 1, '.': 1},
 {'It': 1, 'is': 1, 'human': 1, 'being': 1, '.': 1},
 {'Hello': 1, 'world': 1, '!': 1}]

>>> corpus_idf_
{'This': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 '.': 0.0,
 'That': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'It': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'Hello': 0.6931471805599453,
 'world': 0.6931471805599453,
 '!': 0.6931471805599453}

>>> docs_tf_, corpus_idf_ = calculate_idf(raw_doc, rm_punc=True)
>>> docs_tf_
[{'This': 1, 'is': 1, 'an': 1, 'apple': 1},
 {'That': 1, 'is': 1, 'a': 1, 'pear': 1},
 {'It': 1, 'is': 1, 'human': 1, 'being': 1},
 {'Hello': 1, 'world': 1}]

>>> corpus_idf_
{'This': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 'That': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'It': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'Hello': 0.6931471805599453,
 'world': 0.6931471805599453}