calculate_idf¶

pyhelpers.text.calculate_idf(documents, lowercase=True, ignore_punctuation=True, stop_words=None, smoothing_factor=1, log_base=None, **kwargs)[source]¶

Calculates Inverse Document Frequency (IDF) for a sequence of textual documents.

Parameters:

documents (Iterable | Sequence) – A sequence of textual data.
lowercase (bool) – Whether to convert the documents to lowercase before calculating IDF; defaults to True.
ignore_punctuation (bool) – Whether to exclude punctuation marks from the textual data; defaults to True.
stop_words (list[str] | bool | None) –
List of words to be excluded from the IDF calculation;
- If stop_words=None (default), no words are excluded.
- If stop_words=True, NLTK’s built-in stopwords are used.
smoothing_factor (int | float) –
Factor added to the denominator in the IDF formula:
- For smaller corpora: Use smoothing_factor=1 (default) to prevent IDF values from becoming too extreme (e.g. zero for terms in all documents). This adjustment ensures more stable IDF values reflecting term rarity.
- For larger corpora: Use smoothing_factor=0 for standard IDF calculations, which provides a measure of how terms are distributed across documents. This can generally be sufficient and standard practice.
log_base (int) – The logarithm base in the IDF formula; when log_base=None (default), use math.e.
kwargs – [Optional] Additional parameters for the function nltk.word_tokenize(); also refer to the function count_words(), which calculates term frequencies (TF) for each document.

Returns:

Tuple containing:

Term frequency (TF) of each document as a list of dictionaries, where each dictionary represents TF for one document.
Inverse document frequency (IDF) as a dictionary, where keys are unique terms across all documents and values are their IDF scores.

Return type:

tuple[list[dict], dict]

Examples:

>>> from pyhelpers.text import calculate_idf
>>> documents = [
...     'This is an apple.',
...     'That is a pear.',
...     'It is human being.',
...     'Hello world!']
>>> docs_tf, corpus_idf = calculate_idf(documents)
>>> docs_tf
[{'this': 1, 'is': 1, 'an': 1, 'apple': 1},
 {'that': 1, 'is': 1, 'a': 1, 'pear': 1},
 {'it': 1, 'is': 1, 'human': 1, 'being': 1},
 {'hello': 1, 'world': 1}]
>>> corpus_idf
{'this': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 'that': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'it': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'hello': 0.6931471805599453,
 'world': 0.6931471805599453}
>>> docs_tf, corpus_idf = calculate_idf(documents, ignore_punctuation=False)
>>> docs_tf
[{'this': 1, 'is': 1, 'an': 1, 'apple': 1, '.': 1},
 {'that': 1, 'is': 1, 'a': 1, 'pear': 1, '.': 1},
 {'it': 1, 'is': 1, 'human': 1, 'being': 1, '.': 1},
 {'hello': 1, 'world': 1, '!': 1}]
>>> corpus_idf
{'this': 0.6931471805599453,
 'is': 0.0,
 'an': 0.6931471805599453,
 'apple': 0.6931471805599453,
 '.': 0.0,
 'that': 0.6931471805599453,
 'a': 0.6931471805599453,
 'pear': 0.6931471805599453,
 'it': 0.6931471805599453,
 'human': 0.6931471805599453,
 'being': 0.6931471805599453,
 'hello': 0.6931471805599453,
 'world': 0.6931471805599453,
 '!': 0.6931471805599453}