text¶

Manipulation of textual data.

Textual data (pre)processing¶

`clean_html_text`(input_text)	Clean and normalize text extracted from HTML content.
`remove_punctuation`(input_text[, ...])	Removes punctuation from textual data.
`get_acronym`(input_text[, only_capitals, ...])	Generates an acronym (in capital letters) from textual data.
`split_on_uppercase`(input_text[, join_with, ...])	Extracts words from a string by splitting it at occurrences of uppercase letters.
`numeral_english_to_arabic`(input_text)	Converts a number written in English words into its equivalent numerical value represented in Arabic numerals.
`count_words`(input_text[, lowercase, ...])	Counts the occurrences of each word in the given text.
`calculate_idf`(documents[, lowercase, ...])	Calculates Inverse Document Frequency (IDF) for a sequence of textual documents.
`calculate_tfidf`(documents, **kwargs)	Calculates TF-IDF (Term Frequency-Inverse Document Frequency) for the given textual documents.

`euclidean_distance_between_texts`(txt1, txt2)	Computes the Euclidean distance between two sentences.
`cosine_similarity_between_texts`(txt1, txt2)	Calculates the cosine similarity between two sentences.
`find_matched_str`(input_str, lookup_list[, ...])	Finds all strings (in a sequence) that match a given string or regex pattern.
`find_similar_str`(input_str, lookup_list[, ...])	Finds `n` strings that are similar to `input_str` from a sequence of candidates.