Questions and Answers

Posted by : on

Category : NLP

This post follows up from my earlier post on topic modelling, analyzing scientific publications to gain knowledge on COVID-19.

Here, I want to delve into a bit deeper on how to use only relatively simple and classic natural language processing techniques to help find semantically similar documents on some subtopics of COVID-19.

Specifically, let’s try to find documents pertaining to:

Effectiveness of personal protective equipment and its usefulness to reduce risk of transmission in health care and community settings

This topic is taken out of a list of topics from COVID-19-research-challenge.

Prepare document Corpus

The preprocessing step for the texts is largely the same as before. However, I have added more tokens in the dictionary by using unigram, bigram and trigram together. This combination is helpful for taking phrases such as “crystal structure” into account.

    data_raw = load_data() # raw data from Kaggle
    data_list = pickle.load( open( "data_list.p", "rb" ) ) # with preprocessing done already
	

    # Build the bigram models
    bigram = gensim.models.phrases.Phrases(data_list['title'], min_count=20, threshold=10)
    # Build the trigram models
    trigram = gensim.models.phrases.Phrases(bigram[data_list['title']], threshold=10)
    
    breakdown = trigram[data_list['title']]
    
    # Count n-gram frequencies
    frequency = defaultdict(int)
    for text in breakdown:
        for token in text:
            frequency[token] += 1
    
    processed_corpus = [[token for token in text if frequency[token] > 60] for text in data_list['abstract']]

Some token examples:

unigram_examples[0:10]
Out[11]: 
['sequence',
 'rna',
 'transfer',
 'synthesis',
 'murine',
 'coronavirus',
 'receptor',
 'family',
 'novel',
 'protein']
 
 bigram_examples[0:10]
Out[12]: 
['crystal_structure',
 'hepatitis_virus',
 'rt_pcr',
 'severe_acute',
 'respiratory_syndrome',
 'sars_cov',
 'rna_synthesis',
 'vaccinia_virus',
 'sars_coronavirus',
 'intensive_care']
 
 trigram_examples[0:10]
Out[16]: 
['ucc_uuu_cgu',
 '002_05_2015',
 'x_xxy_yyz',
 'x_xxz_zzn',
 'run_tiled_primers',
 'g_guu_uuu',
 'dme1_chrx_2630566',
 'ssc_circ_009380',
 'vb_bbrs_phb09',
 'vb_bbrm_phb04']

Query

I create a dictionary, convert the documents to a bag-of-word vectors, construct a term-frequency-inverse-document-frequency matrix, and calculate the similarity of the corpus based on cosine distance.


############ Gensim

    dictionary = corpora.Dictionary(processed_corpus)
    print(dictionary)
    
    
    bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
    # train the model
    tfidf = models.TfidfModel(bow_corpus)
    corpus_tfidf = tfidf[bow_corpus]
    index = similarities.MatrixSimilarity(corpus_tfidf)

Once I have the similarity matrix of the entire corpus, I can start to compare the query against each one of them.


def parse(string, length=25):
    
    
    if isinstance(string, str):
        token, *vec = string.split(' ')
        vec = map(float, vec)
        return token, vec
    else:
        print (string)
        return None
    
    
def flatten_list(x):
    return [ item for sublist in x for item in sublist]


def preprocess_query(words, stopwords, wnl, dictionary):
    
    words = words.split(' ')
    words = [token.lower() for token in words]
    words = [token for token in words if token not in stopwords.words('english')]
       
    words = [wnl.lemmatize(token) for token in words]
    query_vec = dictionary.doc2bow(words)
    return query_vec


def search(index, tfidf, word, stopwords, wnl, dictionary, data_raw, total_doc_samples):
    
    query_vec = preprocess_query(word, stopwords, wnl, dictionary)
    
    sims = index[tfidf[query_vec]]

    
    # keep track of how similar each document is to the query
    rank = {i:value for i, value in enumerate(sims)}

    sorted_rank = sorted(rank.items(), key=lambda x:x[1])
    

    
    results = []
    for i in range(1,total_doc_samples+1):
        string = data_raw['title'][sorted_rank[-i][0]]
        results.append((sorted_rank[-i], string))
    
    return results
	

Search results

I rank all the documents in the order of decreasing similarity and pick only the top 1000 most similar documents.


    from nltk.corpus import stopwords
    import nltk
    wnl = nltk.WordNetLemmatizer()
    
    words = 'Effectiveness of personal protective equipment and its usefulness to reduce risk of transmission in health care and community settings'
    
    total_doc_samples = 1000
    results = search(index, tfidf, words, stopwords, wnl, dictionary, data_raw, total_doc_samples)

    x = np.zeros((len(results), len(dictionary)), dtype=np.float16)
    sim_x =  np.zeros((len(results),), dtype=np.float16)
    sim_doc_titles = []
    
    for i, result in enumerate(results):
        index = result[0][0]
        doc = corpus_tfidf[index]
        col = [item[0] for item in doc]
        v =  [item[1] for item in doc]
        x[i, col] = v
        sim_x[i] = result[0][1]
        sim_doc_titles.append(result[1])
		

Since a dry list of documents may be boring to look at, I cluster the search results and put them in a tree. Sklearn provides a very handy script which I adopt for this subtask.

    from sklearn.cluster import AgglomerativeClustering
    
    model = AgglomerativeClustering(affinity='cosine',
                                    distance_threshold=0, n_clusters=None,
                                    linkage='complete')
    model = model.fit(x)
    
    
    from scipy.cluster.hierarchy import dendrogram
    
    def plot_dendrogram(model, **kwargs):
        # Create linkage matrix and then plot the dendrogram
    
        # create the counts of samples under each node
        counts = np.zeros(model.children_.shape[0])
        n_samples = len(model.labels_)
        for i, merge in enumerate(model.children_):
            current_count = 0
            for child_idx in merge:
                if child_idx < n_samples:
                    current_count += 1  # leaf node
                else:
                    current_count += counts[child_idx - n_samples]
            counts[i] = current_count
    
        linkage_matrix = np.column_stack([model.children_, model.distances_,
                                          counts]).astype(float)
    
        # Plot the corresponding dendrogram
        dendrogram(linkage_matrix, **kwargs)


    
    plt.title('Hierarchical Clustering Dendrogram')
    # plot the top three levels of the dendrogram
    plot_dendrogram(model, truncate_mode='level', p=4)
    plt.xlabel("Number of points in node (or index of point if no parenthesis).")
    plt.ylabel("distances")
    plt.savefig(os.path.join(output_dir, 'dendrogram_similar_docs.png'), transparent=True , dpi=400, bbox_inches='tight')
    plt.show()

Picture description

Picture description

The similarity of documents decays gradually and there aren’t that many semantically similar documents to the query have been found. I group highly similar documents in red, moderately similar documents in blue, and not-quite similar documents in green.

I personally would say only those in red are relatively relevant and are therefore worth reading the full content of the articles. Here are the list of some of the titles in each color group.

  • Effectiveness of handwashing in preventing SARS: a review
  • Response and role of palliative care during the COVID-19 pandemic: a national telephone survey of hospices in Italy
  • Disinfection efficiency of positive pressure respiratory protective hood using fumigation sterilization cabinet
  • Rapid De-Escalation and Triaging Patients in Community-Based Palliative Care
  • Face shields for infection control: A review
  • Helmet Modification to PPE with 3D Printing During the COVID-19 Pandemic at Duke University Medical Center: A Novel Technique
  • Personal Protective Equipment
  • Brief guideline for the prevention of COVID-19 infection in head and neck and otolaryngology surgeons
  • Taking the right measures to control COVID-19
  • Outbreaks in Health Care Settings
  • The Role of Managerial Epidemiology in Infection Prevention and Control
  • Elective surgery in the time of COVID-19
  • Evaluation of the Person Under Investigation
  • Appraisal of recommended respiratory infection control practices in primary care and emergency department settings
  • Methicillin-resistant Staphylococcus aureus, Clostridium difficile, and extended-spectrum β-lactamase–producing Escherichia coli in the community: Assessing the problem and controlling the spread
  • Variation in health care worker removal of personal protective equipment
  • How Should U.S. Hospitals Prepare for Coronavirus Disease 2019 (COVID-19)?
  • Head and neck oncology during the COVID-19 pandemic: Reconsidering traditional treatment paradigms in light of new surgical and other multilevel risks
  • Modeling layered non-pharmaceutical interventions against SARS-CoV-2 in the United States with Corvid
  • Protecting health care workers from SARS and other respiratory pathogens: A review of the infection control literature
  • Covid-19: What’s the current advice for UK doctors?
  • Cost-effectiveness analysis of N95 respirators and medical masks to protect healthcare workers in China from respiratory infections
  • Assessment of Temporary Community-Based Health Care Facilities During Arbaeenia Mass Gathering at Karbala, Iraq: Cross-Sectional Survey Study
  • 23 Response to SARS as a prototype for bioterrorism Lessons in a Regional Hospital in Hong Kong
  • Recommendations and guidance for providing pharmaceutical care services during COVID-19 pandemic: A China perspective
  • Could influenza transmission be reduced by restricting mass gatherings? Towards an evidence-based policy framework
  • Equipment for Exotic Mammal and Reptile Diagnostics and Surgery
  • Infrastructure and Organization of Adult Intensive Care Units in Resource-Limited Settings
  • Risk factors for febrile respiratory illness and mono-viral infections in a semi-closed military environment: a case-control study
  • Contamination during doffing of personal protective equipment by healthcare providers
  • Facing the threat of influenza pandemic - roles of and implications to general practitioners
  • Supplies and equipment for pediatric emergency mass critical care
  • SARS Transmission among Hospital Workers in Hong Kong
  • The Demand for Health Care
  • Isolation Facilities for Highly Infectious Diseases in Europe – A Cross-Sectional Analysis in 16 Countries
  • A systematic risk-based strategy to select personal protective equipment for infectious diseases
  • 4 Steps in the selection of protective clothing materials
  • Helen Salisbury: Is general practice prepared for a pandemic?
  • Performance of materials used for biological personal protective equipment against blood splash penetration
  • Role of viral bioaerosols in nosocomial infections and measures for prevention and control
  • Guideline for Antibiotic Use in Adults with Community-acquired Pneumonia
  • Hydroxychloroquine (HCQ): an observational cohort study in primary and secondary prevention of pneumonia in an at-risk population
  • On the 2-Row Rule for Infectious Disease Transmission on Aircraft
  • The outbreak of COVID-19: An overview
  • Hendra virus in Queensland, Australia, during the winter of 2011: Veterinarians on the path to better management strategies
  • Visibility and transmission: complexities around promoting hand hygiene in young children – a qualitative study
  • Mapping road network communities for guiding disease surveillance and control strategies
  • A COVID-19 Risk Assessment for the US Labor Force
  • Gut microbiome and the risk factors in central nervous system autoimmunity
  • Public health and medical care for the world's factory: China's Pearl River Delta Region
  • Global health goals: lessons from the worldwide effort to eradicate poliomyelitis
  • Population response to the risk of vector-borne diseases: lessons learned from socio-behavioural research during large-scale outbreaks
  • Reduced Risk of Importing Ebola Virus Disease because of Travel Restrictions in 2014: A Retrospective Epidemiological Modeling Study
  • Investigation of three clusters of COVID-19 in Singapore: implications for surveillance and response measures
  • The Perceived Threat of SARS and its Impact on Precautionary Actions and Adverse Consequences: A Qualitative Study Among Chinese Communities in the United Kingdom and the Netherlands
  • Association of COVID-19 Infections in San Francisco in Early March 2020 with Travel to New York and Europe
  • Harnessing the privatisation of China's fragmented health-care delivery
  • Toward a consensus view in the management of acute facial injuries during the Covid-19 pandemic
  • Selected nonvaccine interventions to prevent infectious acute respiratory disease
  • Prevalence of Psychiatric Disorders Among Toronto Hospital Workers One to Two Years After the SARS Outbreak
About

Hello, My name is Wilson Fok. I love to extract useful insights and knowledge from big data. Constructive feedback and insightful comments are very welcome!