calculate_idf¶
- pyhelpers.text.calculate_idf(documents, lowercase=True, ignore_punctuation=True, stop_words=None, smoothing_factor=1, log_base=None, **kwargs)[source]¶
Calculate Inverse Document Frequency (IDF) for a sequence of textual documents.
- Parameters:
documents (Iterable | Sequence) – A sequence of textual data.
lowercase (bool) – Whether to convert the documents to lowercase before calculating IDF; defaults to
True
.ignore_punctuation (bool) – Whether to exclude punctuation marks from the textual data; defaults to
True
.stop_words (list[str] | bool | None) –
List of words to be excluded from the IDF calculation;
If
stop_words=None
(default), no words are excluded.If
stop_words=True
, NLTK’s built-in stopwords are used.
smoothing_factor (int | float) –
Factor added to the denominator in the IDF formula:
For smaller corpora: Use
smoothing_factor=1
(default) to prevent IDF values from becoming too extreme (e.g. zero for terms in all documents). This adjustment ensures more stable IDF values reflecting term rarity.For larger corpora: Use
smoothing_factor=0
for standard IDF calculations, which provides a measure of how terms are distributed across documents. This can generally be sufficient and standard practice.
log_base (int) – The logarithm base in the IDF formula; when
log_base=None
(default), use math.e.kwargs – [Optional] Additional parameters for the function nltk.word_tokenize(); also refer to the function
count_words()
, which calculates term frequencies (TF) for each document.
- Returns:
Tuple containing:
Term frequency (TF) of each document as a list of dictionaries, where each dictionary represents TF for one document.
Inverse document frequency (IDF) as a dictionary, where keys are unique terms across all documents and values are their IDF scores.
- Return type:
tuple[list[dict], dict]
Examples:
>>> from pyhelpers.text import calculate_idf >>> documents = [ ... 'This is an apple.', ... 'That is a pear.', ... 'It is human being.', ... 'Hello world!'] >>> docs_tf, corpus_idf = calculate_idf(documents) >>> docs_tf [{'this': 1, 'is': 1, 'an': 1, 'apple': 1}, {'that': 1, 'is': 1, 'a': 1, 'pear': 1}, {'it': 1, 'is': 1, 'human': 1, 'being': 1}, {'hello': 1, 'world': 1}] >>> corpus_idf {'this': 0.6931471805599453, 'is': 0.0, 'an': 0.6931471805599453, 'apple': 0.6931471805599453, 'that': 0.6931471805599453, 'a': 0.6931471805599453, 'pear': 0.6931471805599453, 'it': 0.6931471805599453, 'human': 0.6931471805599453, 'being': 0.6931471805599453, 'hello': 0.6931471805599453, 'world': 0.6931471805599453} >>> docs_tf, corpus_idf = calculate_idf(documents, ignore_punctuation=False) >>> docs_tf [{'this': 1, 'is': 1, 'an': 1, 'apple': 1, '.': 1}, {'that': 1, 'is': 1, 'a': 1, 'pear': 1, '.': 1}, {'it': 1, 'is': 1, 'human': 1, 'being': 1, '.': 1}, {'hello': 1, 'world': 1, '!': 1}] >>> corpus_idf {'this': 0.6931471805599453, 'is': 0.0, 'an': 0.6931471805599453, 'apple': 0.6931471805599453, '.': 0.0, 'that': 0.6931471805599453, 'a': 0.6931471805599453, 'pear': 0.6931471805599453, 'it': 0.6931471805599453, 'human': 0.6931471805599453, 'being': 0.6931471805599453, 'hello': 0.6931471805599453, 'world': 0.6931471805599453, '!': 0.6931471805599453}