Questions and Answers

Summary : Using Python, Gensim,and NLTK to find articles on COVID-19

Posted by : Wilson Fok on May 3, 2020

Category : NLP

This post follows up from my earlier post on topic modelling, analyzing scientific publications to gain knowledge on COVID-19.

Here, I want to delve into a bit deeper on how to use only relatively simple and classic natural language processing techniques to help find semantically similar documents on some subtopics of COVID-19.

Specifically, let’s try to find documents pertaining to:

Effectiveness of personal protective equipment and its usefulness to reduce risk of transmission in health care and community settings

This topic is taken out of a list of topics from COVID-19-research-challenge.

Prepare document Corpus

The preprocessing step for the texts is largely the same as before. However, I have added more tokens in the dictionary by using unigram, bigram and trigram together. This combination is helpful for taking phrases such as “crystal structure” into account.

    data_raw = load_data() # raw data from Kaggle
    data_list = pickle.load( open( "data_list.p", "rb" ) ) # with preprocessing done already
	

    # Build the bigram models
    bigram = gensim.models.phrases.Phrases(data_list['title'], min_count=20, threshold=10)
    # Build the trigram models
    trigram = gensim.models.phrases.Phrases(bigram[data_list['title']], threshold=10)
    
    breakdown = trigram[data_list['title']]
    
    # Count n-gram frequencies
    frequency = defaultdict(int)
    for text in breakdown:
        for token in text:
            frequency[token] += 1
    
    processed_corpus = [[token for token in text if frequency[token] > 60] for text in data_list['abstract']]

Some token examples:

unigram_examples[0:10]
Out[11]: 
['sequence',
 'rna',
 'transfer',
 'synthesis',
 'murine',
 'coronavirus',
 'receptor',
 'family',
 'novel',
 'protein']
 
 bigram_examples[0:10]
Out[12]: 
['crystal_structure',
 'hepatitis_virus',
 'rt_pcr',
 'severe_acute',
 'respiratory_syndrome',
 'sars_cov',
 'rna_synthesis',
 'vaccinia_virus',
 'sars_coronavirus',
 'intensive_care']
 
 trigram_examples[0:10]
Out[16]: 
['ucc_uuu_cgu',
 '002_05_2015',
 'x_xxy_yyz',
 'x_xxz_zzn',
 'run_tiled_primers',
 'g_guu_uuu',
 'dme1_chrx_2630566',
 'ssc_circ_009380',
 'vb_bbrs_phb09',
 'vb_bbrm_phb04']

Query

I create a dictionary, convert the documents to a bag-of-word vectors, construct a term-frequency-inverse-document-frequency matrix, and calculate the similarity of the corpus based on cosine distance.

############ Gensim

    dictionary = corpora.Dictionary(processed_corpus)
    print(dictionary)
    
    
    bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
    # train the model
    tfidf = models.TfidfModel(bow_corpus)
    corpus_tfidf = tfidf[bow_corpus]
    index = similarities.MatrixSimilarity(corpus_tfidf)

Once I have the similarity matrix of the entire corpus, I can start to compare the query against each one of them.

def parse(string, length=25):
    
    if isinstance(string, str):
        token, *vec = string.split(' ')
        vec = map(float, vec)
        return token, vec
    else:
        print (string)
        return None
    
def flatten_list(x):
    return [ item for sublist in x for item in sublist]

def preprocess_query(words, stopwords, wnl, dictionary):
    
    words = words.split(' ')
    words = [token.lower() for token in words]
    words = [token for token in words if token not in stopwords.words('english')]
       
    words = [wnl.lemmatize(token) for token in words]
    query_vec = dictionary.doc2bow(words)
    return query_vec

def search(index, tfidf, word, stopwords, wnl, dictionary, data_raw, total_doc_samples):
    
    query_vec = preprocess_query(word, stopwords, wnl, dictionary)
    
    sims = index[tfidf[query_vec]]

    # keep track of how similar each document is to the query
    rank = {i:value for i, value in enumerate(sims)}

    sorted_rank = sorted(rank.items(), key=lambda x:x[1])
    
    results = []
    for i in range(1,total_doc_samples+1):
        string = data_raw['title'][sorted_rank[-i][0]]
        results.append((sorted_rank[-i], string))
    
    return results

Search results

I rank all the documents in the order of decreasing similarity and pick only the top 1000 most similar documents.

    from nltk.corpus import stopwords
    import nltk
    wnl = nltk.WordNetLemmatizer()
    
    words = 'Effectiveness of personal protective equipment and its usefulness to reduce risk of transmission in health care and community settings'
    
    total_doc_samples = 1000
    results = search(index, tfidf, words, stopwords, wnl, dictionary, data_raw, total_doc_samples)

    x = np.zeros((len(results), len(dictionary)), dtype=np.float16)
    sim_x =  np.zeros((len(results),), dtype=np.float16)
    sim_doc_titles = []
    
    for i, result in enumerate(results):
        index = result[0][0]
        doc = corpus_tfidf[index]
        col = [item[0] for item in doc]
        v =  [item[1] for item in doc]
        x[i, col] = v
        sim_x[i] = result[0][1]
        sim_doc_titles.append(result[1])
		

Since a dry list of documents may be boring to look at, I cluster the search results and put them in a tree. Sklearn provides a very handy script which I adopt for this subtask.

    from sklearn.cluster import AgglomerativeClustering
    
    model = AgglomerativeClustering(affinity='cosine',
                                    distance_threshold=0, n_clusters=None,
                                    linkage='complete')
    model = model.fit(x)
    
    
    from scipy.cluster.hierarchy import dendrogram
    
    def plot_dendrogram(model, **kwargs):
        # Create linkage matrix and then plot the dendrogram
    
        # create the counts of samples under each node
        counts = np.zeros(model.children_.shape[0])
        n_samples = len(model.labels_)
        for i, merge in enumerate(model.children_):
            current_count = 0
            for child_idx in merge:
                if child_idx < n_samples:
                    current_count += 1  # leaf node
                else:
                    current_count += counts[child_idx - n_samples]
            counts[i] = current_count
    
        linkage_matrix = np.column_stack([model.children_, model.distances_,
                                          counts]).astype(float)
    
        # Plot the corresponding dendrogram
        dendrogram(linkage_matrix, **kwargs)


    
    plt.title('Hierarchical Clustering Dendrogram')
    # plot the top three levels of the dendrogram
    plot_dendrogram(model, truncate_mode='level', p=4)
    plt.xlabel("Number of points in node (or index of point if no parenthesis).")
    plt.ylabel("distances")
    plt.savefig(os.path.join(output_dir, 'dendrogram_similar_docs.png'), transparent=True , dpi=400, bbox_inches='tight')
    plt.show()

Picture description

The similarity of documents decays gradually and there aren’t that many semantically similar documents to the query have been found. I group highly similar documents in red, moderately similar documents in blue, and not-quite similar documents in green.

I personally would say only those in red are relatively relevant and are therefore worth reading the full content of the articles. Here are the list of some of the titles in each color group.

Effectiveness of handwashing in preventing SARS: a review
Response and role of palliative care during the COVID-19 pandemic: a national telephone survey of hospices in Italy
Disinfection efficiency of positive pressure respiratory protective hood using fumigation sterilization cabinet
Rapid De-Escalation and Triaging Patients in Community-Based Palliative Care
Face shields for infection control: A review
Helmet Modification to PPE with 3D Printing During the COVID-19 Pandemic at Duke University Medical Center: A Novel Technique
Personal Protective Equipment
Brief guideline for the prevention of COVID-19 infection in head and neck and otolaryngology surgeons
Taking the right measures to control COVID-19
Outbreaks in Health Care Settings
The Role of Managerial Epidemiology in Infection Prevention and Control
Elective surgery in the time of COVID-19
Evaluation of the Person Under Investigation
Appraisal of recommended respiratory infection control practices in primary care and emergency department settings
Methicillin-resistant Staphylococcus aureus, Clostridium difficile, and extended-spectrum β-lactamase–producing Escherichia coli in the community: Assessing the problem and controlling the spread
Variation in health care worker removal of personal protective equipment
How Should U.S. Hospitals Prepare for Coronavirus Disease 2019 (COVID-19)?
Head and neck oncology during the COVID-19 pandemic: Reconsidering traditional treatment paradigms in light of new surgical and other multilevel risks
Modeling layered non-pharmaceutical interventions against SARS-CoV-2 in the United States with Corvid
Protecting health care workers from SARS and other respiratory pathogens: A review of the infection control literature

Covid-19: What’s the current advice for UK doctors?
Cost-effectiveness analysis of N95 respirators and medical masks to protect healthcare workers in China from respiratory infections
Assessment of Temporary Community-Based Health Care Facilities During Arbaeenia Mass Gathering at Karbala, Iraq: Cross-Sectional Survey Study
23 Response to SARS as a prototype for bioterrorism Lessons in a Regional Hospital in Hong Kong
Recommendations and guidance for providing pharmaceutical care services during COVID-19 pandemic: A China perspective
Could influenza transmission be reduced by restricting mass gatherings? Towards an evidence-based policy framework
Equipment for Exotic Mammal and Reptile Diagnostics and Surgery
Infrastructure and Organization of Adult Intensive Care Units in Resource-Limited Settings
Risk factors for febrile respiratory illness and mono-viral infections in a semi-closed military environment: a case-control study
Contamination during doffing of personal protective equipment by healthcare providers
Facing the threat of influenza pandemic - roles of and implications to general practitioners
Supplies and equipment for pediatric emergency mass critical care
SARS Transmission among Hospital Workers in Hong Kong
The Demand for Health Care
Isolation Facilities for Highly Infectious Diseases in Europe – A Cross-Sectional Analysis in 16 Countries
A systematic risk-based strategy to select personal protective equipment for infectious diseases
4 Steps in the selection of protective clothing materials
Helen Salisbury: Is general practice prepared for a pandemic?
Performance of materials used for biological personal protective equipment against blood splash penetration
Role of viral bioaerosols in nosocomial infections and measures for prevention and control

Guideline for Antibiotic Use in Adults with Community-acquired Pneumonia
Hydroxychloroquine (HCQ): an observational cohort study in primary and secondary prevention of pneumonia in an at-risk population
On the 2-Row Rule for Infectious Disease Transmission on Aircraft
The outbreak of COVID-19: An overview
Hendra virus in Queensland, Australia, during the winter of 2011: Veterinarians on the path to better management strategies
Visibility and transmission: complexities around promoting hand hygiene in young children – a qualitative study
Mapping road network communities for guiding disease surveillance and control strategies
A COVID-19 Risk Assessment for the US Labor Force
Gut microbiome and the risk factors in central nervous system autoimmunity
Public health and medical care for the world's factory: China's Pearl River Delta Region
Global health goals: lessons from the worldwide effort to eradicate poliomyelitis
Population response to the risk of vector-borne diseases: lessons learned from socio-behavioural research during large-scale outbreaks
Reduced Risk of Importing Ebola Virus Disease because of Travel Restrictions in 2014: A Retrospective Epidemiological Modeling Study
Investigation of three clusters of COVID-19 in Singapore: implications for surveillance and response measures
The Perceived Threat of SARS and its Impact on Precautionary Actions and Adverse Consequences: A Qualitative Study Among Chinese Communities in the United Kingdom and the Netherlands
Association of COVID-19 Infections in San Francisco in Early March 2020 with Travel to New York and Europe
Harnessing the privatisation of China's fragmented health-care delivery
Toward a consensus view in the management of acute facial injuries during the Covid-19 pandemic
Selected nonvaccine interventions to prevent infectious acute respiratory disease
Prevalence of Psychiatric Disorders Among Toronto Hospital Workers One to Two Years After the SARS Outbreak

Share this to: