Text mining and manual data extraction in healthcare research

three person looking at x ray result
Photo by EVG Kowalievska on Pexels.com

Text mining

In this exploratory research, in order to develop categories for which kind of condition, disease or disorder people asked for donations, we screened the titles of the campaigns. During this screening, we developed a vocabulary with keywords (e.g., cancer, mobility, mental disorder) which allowed us to identify the health problem. When developing the vocabulary, we acknowledged that in German it is very common to form compound words—words which assemble several words at the same time to form one word. The number of associated words is unlimited; and sometimes the new word has a completely different meaning. Thus, we first extracted all words which included the keyword, and then we screened the list of the extracted compound words and excluded the irrelevant ones (i.e., changed meaning). We added the relevant compound words to the vocabulary of keywords. Using the text mining package tm in R we identified those campaigns which included any of the words in the extended vocabulary. In order to do so, first we ran some basic text transformation and text cleaning functions and then we built a term-document matrix including all the words in the vocabulary. Finally, we screened the text of the unclassified campaigns and added new health-related keywords to the vocabulary, and repeated the procedure specified above.

In addition to the condition-specific vocabulary, we also developed a vocabulary which allowed us to identify the health-related expenses that individuals aimed at covering from donations. The vocabulary was developed in the same way as specified above, albeit with different key words (e.g., medication, cost of therapy, travel, accommodation, cost of living, holidays).

Finally, by extracting part of a text string based on position in the text string we extracted funding needs as stated in the textual description of the crowdfunding campaigns.

Manual data extraction

Once the health-related campaigns were identified, we extracted three kinds of information manually from the textual descriptions. First, we read each campaign text carefully and validated the condition, disease or disorder which motivated individuals to seek additional funding. In the case of misspecification, we assigned a new motive for crowdfunding (manual validation). In total, we validated 35 health problems listed at least twice and 18 health problems listed only once.Footnote1 Second, we extracted the costs type that individuals aimed at covering from donations. The most important cost types identified were as follows: medication, surgery, therapy, medical equipment, treatment-related travel and accommodation, living expenses, holidays, medical research and patient education. Third, we identified whether individuals sought funding for treatment abroad or for non-residents.

Leave a Reply