calculate_tfidf

pyhelpers.text.calculate_tfidf(documents, **kwargs)[source]

Calculate TF-IDF (Term Frequency-Inverse Document Frequency) for the given textual documents.

TF (Term Frequency) measures how frequently a term appears in a document relative to its length. IDF (Inverse Document Frequency) measures how important a term is across the entire corpus of documents.

Parameters:
  • documents (Iterable | Sequence) – A sequence of textual data.

  • kwargs – [Optional] Additional parameters for the function calculate_idf(); also refer to count_words().

Returns:

TF-IDF values for the input textual data, represented as a dictionary.

Return type:

dict

Examples:

>>> from pyhelpers.text import calculate_tfidf
>>> documents = [
...     'This is an apple.',
...     'That is a pear.',
...     'It is human being.',
...     'Hello world!']
>>> tfidf = calculate_tfidf(documents)
>>> tfidf
{'this': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 'that': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'it': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'hello': 0.6931471805599453,
 'world': 0.6931471805599453}
>>> tfidf = calculate_tfidf(documents, lowercase=False)
>>> tfidf
{'This': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 'That': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'It': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'Hello': 0.6931471805599453,
 'world': 0.6931471805599453}