calculate_tf_idf

pyhelpers.text.calculate_tf_idf(raw_documents, rm_punc=False)

Count term frequency–inverse document frequency.

Parameters

raw_documents (Iterable or Sequence) – a sequence of textual data
rm_punc (bool) – whether to remove punctuation from the input textual data, defaults to False

Returns

tf-idf of the input textual data

Return type

dict

Examples:

>>> from pyhelpers.text import calculate_tf_idf

>>> raw_doc = [
...     'This is an apple.',
...     'That is a pear.',
...     'It is human being.',
...     'Hello world!']

>>> docs_tf_idf_ = calculate_tf_idf(raw_documents=raw_doc)
>>> docs_tf_idf_
{'This': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 '.': 0.0,
 'That': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'It': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'Hello': 0.6931471805599453,
 'world': 0.6931471805599453,
 '!': 0.6931471805599453}

>>> docs_tf_idf_ = calculate_tf_idf(raw_documents=raw_doc, rm_punc=True)
>>> docs_tf_idf_
{'This': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 'That': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'It': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'Hello': 0.6931471805599453,
 'world': 0.6931471805599453}