https://www.mdu.se/

mdu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluating the Robustness of ML Models to Out-of-Distribution Data Through Similarity Analysis
Mälardalen University, School of Innovation, Design and Engineering, Embedded Systems. Saab AB, Linköping, Sweden.
Mälardalen University, School of Innovation, Design and Engineering, Embedded Systems.
Mälardalen University, School of Innovation, Design and Engineering, Embedded Systems.
Royal Institute of Technology, Stockholm, Sweden; Saab AB, Linköping, Sweden.
2023 (English)In: NEW TRENDS IN DATABASE AND INFORMATION SYSTEMS, ADBIS 2023, Springer Science+Business Media B.V., 2023, p. 348-359Conference paper, Published paper (Refereed)
Abstract [en]

In Machine Learning systems, several factors impact the performance of a trained model. The most important ones include model architecture, the amount of training time, the dataset size and diversity. We present a method for analyzing datasets from a use-case scenario perspective, detecting and quantifying out-of-distribution (OOD) data on dataset level. Our main contribution is the novel use of similarity metrics for the evaluation of the robustness of a model by introducing relative Fréchet Inception Distance (FID) and relative Kernel Inception Distance (KID) measures. These relative measures are relative to a baseline in-distribution dataset and are used to estimate how the model will perform on OOD data (i.e. estimate the model accuracy drop). We find a correlation between our proposed relative FID/relative KID measure and the drop in Average Precision (AP) accuracy on unseen data.

Place, publisher, year, edition, pages
Springer Science+Business Media B.V., 2023. p. 348-359
Series
Communications in Computer and Information Science, ISSN 1865-0929, E-ISSN 1865-0937
Keywords [en]
accuracy estimation, datasets, neural networks, similarity metrics, Learning systems, Dataset, Distance measure, Frechet, Machine learning systems, Modeling architecture, Neural-networks, Performance, Similarity analysis, Drops
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:mdh:diva-64446DOI: 10.1007/978-3-031-42941-5_30ISI: 001351054200030Scopus ID: 2-s2.0-85171979824ISBN: 9783031429408 (print)OAI: oai:DiVA.org:mdh-64446DiVA, id: diva2:1802686
Conference
27th European Conference on Advances in Databases and Information Systems (ADBIS), Barcelona, Spain, 4-7 September, 2023
Available from: 2023-10-05 Created: 2023-10-05 Last updated: 2024-12-18Bibliographically approved
In thesis
1. Synthetic Data in Data-driven Systems
Open this publication in new window or tab >>Synthetic Data in Data-driven Systems
2025 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Dataset generation is cumbersome yet of great importance for successful training of machine learning models. Collecting real-world data is expensive and sometimes prohibited, considering e.g. safety aspects or legal restrictions. By generating the bulk of training data by synthetic means it is possible to impose arbitrary and extensive scene randomization for increased data diversity.

Methods to quantify similarity between datasets on a statistical level are important tools to detect Out-of-Distribution (OOD) data and domain alignment. We have studied how such methods can be used to correlate model prediction accuracy drop when exposed to OOD-data.

Domain adaptation can be applied as an additional step to synthetic data, to decrease the gap to real world datasets, however it can introduce inadvertent label-flipping, a sort of semantic inconsistency between synthetic source and domain adapted output. Therefore, we pursuit another way of reducing the domain gap, by generating high-fidelity digital representations of real-world scenes and objects. We do this through the use of Neural Radience Fields and Gaussian Splats. These methods allow us to render objects of interest for a detection problem, with the perfect annotation of synthetically produced data, and a high degree of realism which we show improves detection accuracy compared to traditionally generated visual content.

Abstract [sv]

Generering av data för AI-modeller är besvärligt men av stor betydelse för väl-fungerande träning av maskininlärningsmodeller. Att samla in riktig sensordata är dyrt och ibland inte möjligt, med hänsyn till exempelvis säkerhetsaspekter eller juridiska begränsningar. Genom att generera huvuddelen av träningsdata på syntetisk väg är det möjligt att införa omfattande scenrandomisering vilket leder till ökad datadiversifiering. Metoder för att kvantifiera likheter mellan datamängder på statistisk nivå är viktiga verktyg för att identifiera när data ligger utanför den tänkta distributionen. Vi har studerat hur sådana metoder kan användas för att korrelera hur en modellsprecision sjunker när den exponeras för osedd data. Domänanpassning kan tillämpas som ett ytterligare steg till syntetisk data, för att minska gapet till riktig sensordata, men detta kan innebära att man introducerar oavsiktliga annoteringsfel, en sorts semantisk inkonsistens mellan syntetisk källdata och domänanpassad utdata. Därför går vi en annan väg för att minska domängapet genom att generera digitala representationer med hög kvalitet av verkliga scener och föremål. Vi gör detta genom att använda Neural Radience Fields (NeRF) och Gaussiska Splats. Dessa metoder gör det möjligt för oss att skapa objekt av intresse för ett detektionsproblem, med automatisk annotering baserad på syntetiskt framställda data, och en hög grad av realism som vi visar förbättrar detektionsnoggrannheten jämfört med traditionellt genererat visuellt innehåll.

Place, publisher, year, edition, pages
Västerås: Mälardalens Universitet, 2025. p. 186
Series
Mälardalen University Press Licentiate Theses, ISSN 1651-9256 ; 370
Keywords
datasets, neural networks, synthetic data generation, automatic annotation, dataset generation
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:mdh:diva-69154 (URN)978-91-7485-689-7 (ISBN)
Presentation
2025-01-30, Delta, Mälardalens universitet, Västerås, 13:00 (English)
Opponent
Supervisors
Available from: 2024-11-18 Created: 2024-11-18 Last updated: 2025-01-09Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Lindén, JoakimForsberg, HåkanDaneshtalab, Masoud

Search in DiVA

By author/editor
Lindén, JoakimForsberg, HåkanDaneshtalab, Masoud
By organisation
Embedded Systems
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 134 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf