calculate_idf
- pyhelpers.text.calculate_idf(raw_documents, rm_punc=False)[source]
Calculate inverse document frequency.
- Parameters:
raw_documents (Iterable | Sequence) – a sequence of textual data
rm_punc (bool) – whether to remove punctuation from the input textual data, defaults to
False
- Returns:
term frequency (TF) of the input textual data, and inverse document frequency
- Return type:
tuple[list[dict], dict]
Examples:
>>> from pyhelpers.text import calculate_idf >>> raw_doc = [ ... 'This is an apple.', ... 'That is a pear.', ... 'It is human being.', ... 'Hello world!'] >>> docs_tf_, corpus_idf_ = calculate_idf(raw_doc, rm_punc=False) >>> docs_tf_ [{'This': 1, 'is': 1, 'an': 1, 'apple': 1, '.': 1}, {'That': 1, 'is': 1, 'a': 1, 'pear': 1, '.': 1}, {'It': 1, 'is': 1, 'human': 1, 'being': 1, '.': 1}, {'Hello': 1, 'world': 1, '!': 1}] >>> corpus_idf_ {'This': 0.6931471805599453, 'is': 0.0, 'an': 0.6931471805599453, 'apple': 0.6931471805599453, '.': 0.0, 'That': 0.6931471805599453, 'a': 0.6931471805599453, 'pear': 0.6931471805599453, 'It': 0.6931471805599453, 'human': 0.6931471805599453, 'being': 0.6931471805599453, 'Hello': 0.6931471805599453, 'world': 0.6931471805599453, '!': 0.6931471805599453} >>> docs_tf_, corpus_idf_ = calculate_idf(raw_doc, rm_punc=True) >>> docs_tf_ [{'This': 1, 'is': 1, 'an': 1, 'apple': 1}, {'That': 1, 'is': 1, 'a': 1, 'pear': 1}, {'It': 1, 'is': 1, 'human': 1, 'being': 1}, {'Hello': 1, 'world': 1}] >>> corpus_idf_ {'This': 0.6931471805599453, 'is': 0.0, 'an': 0.6931471805599453, 'apple': 0.6931471805599453, 'That': 0.6931471805599453, 'a': 0.6931471805599453, 'pear': 0.6931471805599453, 'It': 0.6931471805599453, 'human': 0.6931471805599453, 'being': 0.6931471805599453, 'Hello': 0.6931471805599453, 'world': 0.6931471805599453}