Title: Visual Analytics for Understanding the Evolution of Large Software Products
1Visual Analytics for Understanding the Evolution
of Large Software Products
- Alexandru Telea
- University of Groningen, the Netherlands
2Software life cycle
release
refactor
migrate
end lifecycle
start
?
?
?
?
Time
Corrective (repair bugs)
25
Perfective (new features)
50
Adaptive (new framework)
25
Design
Implementation
Testing
Analysis
10
15
30
45
Goal reduce development and maintenance costs,
increase quality Focus reduce testing time in
development support informed, efficient
decision-making in releasing, refactoring,
migration
3Problems in the software industry
- software is outsourced, gets older, more complex
? quality decreases - software size, team size, and complexity ? all
increase
- time-to-market decreases ? quality decreases
- defect removal costs increase exponentially from
introduction time
- management decisions are based on subjective
information
41. Problem statement
- Maintenance facts
- thousands of files, hundreds of developers, many
years - knowledge lost, bugs created as software evolves
- costs gt80 of the entire software lifecycle
- 40 of maintenance is spent in understanding
software - Goal support maintenance by analyzing evolution
data - mine relevant facts from software repositories
- analyze, correlate, filter facts
- support questions with easy-to-use tools
5Software Analytics
- Visual Analytics the science of analytical
reasoning facilitated by interactive visual
interfaces Thomas, 2001 - Software Analytics the application of visual
analytics to the understanding, maintenance,
assessing, and evolution of software Telea,
2008
client environment
Research Development
Management
Software
increase productivity and quality
decision making support
Analysis tools
6Software Analytics Framework
client environment
query and mining engines
central fact database
software repositories (CVS, Subversion, )
interactive visualization tools
fact extraction engines
7Involved Techniques
graph layouts
software metrics
static analysis
pixel-filling layouts
treemaps
code flows
Let us see all these next!
8Trend Analyzer
- get data from repository (SVN, CVS, CM/Synergy,
) - use a simple 2D layout to show version attributes
- answer questions by sorting, coloring, and
clustering files
Let us see a simple demo!
9Trend Analysis Evolution at file level
file
time (version)
- unit of analysis file (not finer-grained)
- shows evolution of 1..3 per-file metrics
- correlate inter-file metric changes
10Trend Analysis Evolution at line level
lines
time (version)
- unit of analysis individual lines (not
coarser-grained) - shows insertions, deletions, constant code
blocks - cannot show drift/merge sensitive to syntax
details
11Trend Analysis Evolution at block level
The WinDiff tool
line groups
version
detail
- unit of analysis line blocks (as detected by
diff) - shows insertions, deletions, constant blocks,
drift - cannot handle more than 2 versions
12Trend Analysis Evolution at syntax level?
Goal We would like a technique that handles
all events inserts, deletes, constants, merges,
splits, drifts
1
13Trend Analysis Evolution at syntax level?
Goal We would like a technique that handles
all events inserts, deletes, constants, merges,
splits, drifts can handle more versions of a
file (2..20)
1
2
14Trend Analysis Evolution at syntax level?
Goal We would like a technique that handles
all events inserts, deletes, constants, merges,
splits, drifts can handle more versions of a
file (2..20) is insensitive to small or
irrelevant program changes, e.g. comments,
identifier renaming, declaration order
1
2
3
15Trend Analysis Evolution at syntax level?
Goal We would like a technique that handles
all events inserts, deletes, constants, merges,
splits, drifts can handle more versions of a
file (2..20) is insensitive to small or
irrelevant program changes, e.g. comments,
identifier renaming, declaration order works
between the line and file level-of-detail, as
specified by the user
1
2
3
4
Lets see next how to do this!
16Code Matching
Idea Use code matching techniques Auber et al.,
07 Chevalier et al., 07
- given N versions of a file f1... fN
- extract their syntax trees T1... TN
- construct correspondences between all pairs Ti ,
Ti1
hash all nodes u ? Ti , v ? Tj into equivalence
classes using d(u,v) 1 dtyp(u,v)dstr(u,v)
1
(d(u)-d(v))2 (m(u)-m(v))2 (s(u)-s(v))2
structural distance between subtrees at u,v
0, if u,v have same type, else 1
type distance between u,v
find best matches between subtrees in same class
2
17Code Matching
Example
- two matches a are found class A and the for
loop - matches between matched children are not
considered - unmatched nodes represent insertions and
deletions (E,H)
18Visualization
- OK, now we have the matches how to visualize
them? - draw syntax trees using a cushioned icicle plot
- compact usage of screen space
- good for correspondence visualization (next)
classical tree drawing
cushioned icicle plot
19Correspondence drawing
- mirror icicle plots against previous and next
version - connect matched nodes with spline tubes
20Correspondence drawing
- use translucent cushion-like texture along tubes
- diminishes visual clutter
- draw opaque 3-pixels fixed-width tube axis
- guarantees visibility
transparence texture
luminance texture
21Structure tracking
- how to follow the evolution of a code fragment
over N versions? - code tracking algorithm
- connect each matched node with its children ( -
- - - ) - together with correspondences, we have now a
flow graph G - assign a color to each n ? Ti which is not
matched in Ti-1 - propagate colors downstream in G
- at merges, mix colors weighted by tree size
- repeat process upstream from sinks
1
3
4
5
downstream propagation
upstream propagation
22Results
original method Chevalier et al
improved method
23Results
original method Chevalier et al
improved method
24Visualizing events of interest
- Insertions and deletions
- appear as white gaps between tubes
- Splits and merges
- define a labeling
where
1
2
1
N/2
N
- a split occurs from version i to i1 if
and
f gt 5 means n,m aresplit apart
kmin ? 1,2 means n,m are in same code fragment
25Example application
- real-world C code base (6000 lines), 45
versions - zoom-in on 6 versions of interest
- complex constructions (e.g. templates) and
evolution changes - added noise random identifier renaming, spaces,
layout changes - Visual enhancements
- color matched code fragments in gray
- mark splits and merges with icons
26Example application
27Example application
code shrinks with 10
28Example application
a method f gets split
a small fragment drifts to end
29Example application
surviving code of f
a method f gets split
30f undergoes now many changes
surviving code of f
but stays constant from now on
a method f gets split
31Example application
two fragments get swapped
32Example application
and the getswapped oncemore
33Example application
and there is a third swap (if you look carefully)
34Code flows Summary
- visualization and detection of code evolution
events - emphasis on structure at syntax level (between
lines and files) - detect and show evolution events (split, merge,
drift, ...) - scale to thousands of lines, 10..20 versions
- Future work?
- multiscale visualization
- add code metrics atop structure
- show more than just correspondence relations
35Now a large-scale application
- Situation
- client established embedded SW producer
- product 8 years evolution (2002-2008)
- 3.5 MLOC of source code (1881 ?les)
- 1 MLOC of headers (2454 ?les)
- 15 releases
- 3 teams 60 people (2 x EU, 1 x India)
- product failed to meet requests, at end
- Questions
- what happened right/wrong?
- what can SW archive tell us? (post-mortem)
- can such lessons be used in the future?
36Methodology
- Create a number of data visualizations
- try to spot attribute correlations and data
trends - discuss the relevant images with project team
- For each visualization
- Team is invited to derive one or more findings
- what can you read from it?
- Present our own findings
- We discuss differences
37a1. Team structure Code ownership (findings)
Number of developers
1
8
Red modules contributions from more than 8
developers
38a1. Team structure Code ownership (findings)
OSPR1C1.c
PRDT1C3.c
INSCL1C2.c
SNSAD1C1.c
39a2. Team structure Team assignment
Number of Modification Requests (MRs)
1
30
Some modules have many red(dish) files
40a2. Team structure Team assignment
Team A
Team B
7 of the 11 red(ish) modules are assigned to the
red team
Team C
41a2. Team structure Team assignment
Team A
Many strategic/problematic components (70) are
outsourced (to India Team A) This team is
responsible for many MRs!
Team B
Team C
42a1. Product requirements Impacted areas
Time
329 Files
MR related check-in
R1.3 - start
Little increase in the file curve Many in
files that existed before R1.3 started
43a1. Product requirements Impacted areas
- Few new files added in R1.3 most
activity/changes in old files - Indication of (too) long maintenance closure
of requirements
44a2. Product requirements MR duration
Time
Ex Number of file commits referring to MRs with
IDs in the range 4700 - 4800)
In mid 2008, activity related to MRs addressed in
2006-2008 still takes place
MR id range (4000 5000 grouped on hundreds)
45a2. Product requirements MR duration
MR id
- MRs have historically had a (too) large duration
- Helps us empirically predict closure time for
ongoing/future requirements
46b1. SW Architecture Dependency graph
Most use dependencies go to files within the
IFACE module, basicfunctions, and platform
(system headers)
uses
Is used
uses call, type, variable, macro,
47b1. SW Architecture Dependency graph
Without the IFACE, basicfunctions, and platform
modules
We discovered several unwanted dependencies
uses
Is used
48b1. SW Architecture Dependency graph
Most module interaction takes place via the
interface package (IFACE module) and via the
basicfunctions and platform packages
Yet, some modules are accessed directly, outside
the interface domain (this is not desired)
49b2. SW Architecture Call graph
Many connections between each package and most of
all other packages
50b2. SW Architecture Call graph
Show only call relations between modules that are
mutually call-dependent
- Many modules (in different packages) are
mutually call-dependent - This is not an ideal situation
51b2. SW Architecture Call graph (findings)
High coupling at package level
Not a strict layering in the system For ex, the
OSPR module is highly entangled with the rest of
the system
52b3. SW Architecture metrics
53b3. SW Architecture metrics (findings)
- at function level, evolution stability is
achieved in terms of fan-in / fan-out - also, other typical complexity-related metrics
grow (sub)linearly in time - exploding size is likely not the cause of
maintenance/evolution difficulties - strengthens our beliefs suboptimal team
structure and SW architecture
54c1. Source code testing complexity
Average complexity per method is higher than 20
Total complexity increased in R1.3 with 20
55c1. Source code testing complexity (findings)
Module testing requires high effort to obtain a
good coverage of code
New tests have to be added / old tests updated.
Testing complexity increases
56c2. Source code external duplication
Connections module pairs that contain blocks of
near-similar code of over 25 LOC
Few connections, so little external duplication
57c2. Source code internal duplication
Number of duplicated blocks
1
60
Few modules have some red files little internal
duplication
58c3. Source code metrics (LOT)
Number of lines of text
1
2500
Some modules have a high percentage of red files
59c3. Source code metrics (LOC)
Number of lines of code
1
1500
Same modules have a high percentage of red files
60c3. Source code metrics (McCabe)
McCabe complexity
1
500
Same modules have a high percentage of red files
61c3. Source code metrics (findings)
- The file size in LOT is a good indication for
size file in LOC - which is a good indication for complexity in
the case of relative assessments - This was noticed by other researchers too
LOT
LOC
McCabe complexity
62c4. Source code criticality
Number of MRs
Average MR closure
1
30
1
90 days
a)
b)
Average MR propagation (teams)
Average MR propagation (files)
1
30
1
3
d)
c)
63d1. Documentation
Time
1688 other supporting files
854 doc html
64d1. Documentation
65d2. Documentation
Time
Files
Activity heat map
66d2. Documentation
67Conclusions
- Software Evolution Analysis
- an extremely rich field, at its beginnings only
- wealth of information sources
- clear interest from both researchers and
industry - Important points
- scalable, easy-to-use tools are absolutely
essential - integrating multiple information types is hard
- visual analytics is an excellent aid!
Thank you! a.c.telea_at_rug.nl