mdh.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Analysis of similarity and differences between articles using semantics
Mälardalen University, School of Innovation, Design and Engineering.
2017 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%).

Place, publisher, year, edition, pages
2017. , 41 p.
Keyword [en]
Natural language processing, similarity, semantic analysis, computer science
National Category
Computer Science Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:mdh:diva-34843OAI: oai:DiVA.org:mdh-34843DiVA: diva2:1073195
Subject / course
Computer Science
Presentation
2017-01-26, Kappa, Västerås, 14:25 (English)
Supervisors
Examiners
Available from: 2017-02-17 Created: 2017-02-09 Last updated: 2017-02-17Bibliographically approved

Open Access in DiVA

fulltext(1940 kB)80 downloads
File information
File name FULLTEXT01.pdfFile size 1940 kBChecksum SHA-512
16dbbd485db9d3b9ef16936aecddc934b305b962cc6ac0e689146121c7d4b713e3edde9c118ddaeeedfd5fc0d69283a7ca8c7cdfe891b4bce095fd97a6e44efa
Type fulltextMimetype application/pdf

By organisation
School of Innovation, Design and Engineering
Computer ScienceLanguage Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 80 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 403 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf