DocTree Analyzer

Intelligent Document Tree Analysis & DMS Migration Platform

9
Projects
5
ML Models
8
Views
7
Anomaly Detectors

Legality Software • .NET 10 • WPF • ML.NET

The Problem

Manual effort: 40-80 hours per firm With DocTree: 15-30 minutes

The Solution

1. Input
TREE file or CSV
2. Analyze
Classify + Extract + Detect
3. Review
Human-in-the-loop refinement
4. Migrate
Plan → Validate → Execute
5. Export
Excel / Word / CSV / PS1
Input Parse DOS TREE /F output or flat CSV path lists into structured FolderNode trees
Analyze Classify roles (8 types) • Extract entities (companies, people, matters) • Detect 8 anomaly types • Discover structural patterns
Review Active learning questions • User confirms/corrects roles • Validate/dismiss anomalies • Models retrain in real-time
Migrate Map to NetDocuments (Cabinet → Workspace → Folder) • Generate actions (Rename, Merge, Skip, Flatten) • Validate constraints
Export Excel (5 sheets) • Word (formatted report) • CSV • JSON migration plan • PowerShell scripts with -WhatIf

Architecture Overview

WPF Desktop App
CLI Console App
Core (Domain + Services)
ML.NET (5 Models)
FileStorage (SQLite + EF Core)
Excel / Word / CSV Exporters
JsonAdapter
PatternImplementation
MVVMCommunityToolkit.Mvvm - ObservableProperty, RelayCommand
Dependency InjectionMicrosoft.Extensions.DependencyInjection (all projects)
RepositoryIAnalysisRepository → SQLite via EF Core
StrategyIAnalysisPipelineStrategy (Sequential / Parallel / Progressive)
FactoryDynamicDbContextFactory (per-project DB switching)

Solution Structure (9 Projects)

ProjectTypePurpose
DocTreeAnalyzer.CoreClass LibraryDomain models, interfaces, services, business logic
DocTreeAnalyzer.MLClass Library5 ML.NET classifiers (role, entity, severity, duplicates, dismiss)
DocTreeAnalyzer.WpfWPF AppFull desktop GUI - 8 views, MVVM architecture
DocTreeAnalyzerConsole AppCLI for headless analysis & scripting
DocTreeAnalyzer.FileStorageClass LibrarySQLite persistence (EF Core), multi-project DBs
DocTreeAnalyzer.ExcelClass LibraryExcel export via ClosedXML (5 sheets)
DocTreeAnalyzer.WordClass LibraryWord export via OpenXml (formatted report)
DocTreeAnalyzer.JsonAdapterClass LibraryJSON serialization adapter
DocTreeAnalyzer.TestsMSTest130+ unit tests

Analysis Pipeline

Two-Phase Progressive Architecture

Phase 1 - Fast Rule-Based (instant)

Parse Tree
Split Combined
Client+Matter
Rule-Based
Classification
Entity
Extraction
Pattern
Discovery
UI Update

Phase 2 - ML Enrichment (background, live updates)

Bootstrap
Training Data
Train 5
ML Models
ML
Predictions
Anomaly
Detection
Question
Generation
Live UI
Refresh
Performance Mode Single-pass parallel pipeline for large trees (10,000+ nodes)

Folder Role Classification

8 Folder Roles

RootTop-level tree root
ClientClient/company folder
MatterLegal matter or engagement
DocumentCategoryStandard doc category (Correspondence, Pleadings...)
SubCategoryFurther subdivision
AdHocPersonal/temporary folders
TemporalGroupYear/date-based groupings
UnknownUnclassifiable

Classification Approach

  • Rule-based first - depth heuristics, keyword matching, legal term lists
  • ML second - SDCA multiclass on text features + numeric signals
  • Confidence scoring - High/Medium/Low based on prediction probability
  • Pattern inheritance - nodes inherit roles from detected structural patterns
  • Active learning - user corrections retrain the model

ML Features

  • Folder name (text featurized)
  • Depth, child count, sibling count
  • Contains person name / corp identifier
  • Matches legal category / temporal name
  • Parent folder name

Machine Learning (5 Models)

ModelAlgorithmTaskTraining
FolderClassifier SDCA Maximum Entropy Predict folder role (8 classes) Bootstrap from rule-based + active learning feedback
EntityTypeClassifier SDCA Maximum Entropy Classify entity type (Corp, Person, Partnership, Trust, Govt) Synthetic data + regex-identified examples
SemanticDuplicateClassifier SDCA Logistic Regression Detect equivalent folder names Known synonym groups + negative pairs
AnomalySeverityPredictor SDCA Regression Predict anomaly severity (0.0 - 1.0) User validation feedback across sessions
AnomalyDismissClassifier SDCA Logistic Regression Predict if user will dismiss anomaly Historical dismiss patterns

Key ML Patterns

Bootstrap Training

Models self-train from rule-based heuristics + synthetic data. No pre-labeled dataset required.

Model Bundle

All 5 models packaged into a .dtamodel ZIP archive for portability across machines.

Feedback Loop

User validations retrain severity/dismiss models. Accuracy improves with each session.

Anomaly Detection (8 Detectors)

Anomaly TypeDetection Method
SpellingErrorLevenshtein distance vs. legal terms dictionary
CaseInconsistencySame name, different casing across tree
SemanticDuplicateML binary classifier + synonym groups
RedundantNestingParent/child with near-identical names
AdHocPersonalFolderShort first names at unexpected depths
RoleMismatchPredicted role conflicts with tree position
CombinedClientMatterConfigurable delimiters detect merged names
PathTooLongFull path exceeds Windows 260-char MAX_PATH limit

User Workflow

  • Review - browse anomalies grouped by type
  • Validate - confirm anomaly is real
  • Dismiss - mark as false positive
  • Skip & Roll Up - skip folder, move children to parent
  • Auto-dismiss - ML predicts user preferences

Severity Scoring

  • Initial: rule-based heuristic (0.0 - 1.0)
  • Refined: ML regression from user feedback
  • Adapts per-user across sessions

Entity Extraction

Extracted Information

FieldExample
Client NameAcme Corporation
Operating-As Nameo/a FastTrack Services
Person NameSmith, John
Matter DescriptionReal Estate Purchase
Corporate NumberBC0123456
Entity TypeCorporation / Person / Partnership / Trust / Govt

Detection Methods

Rule-Based (FallbackEntityExtractor)

  • Regex patterns for "LastName, FirstName"
  • Corporate identifiers (Ltd, Inc, Corp, LLP...)
  • Operating-as patterns (o/a, dba, t/a)
  • Corporate number formats (BC/AB/ON + digits)

ML Classification

  • SDCA multiclass on text features
  • Classifies into 5 entity types
  • Confidence scoring per entity

Active Learning System

Identify
Uncertainty
Generate
Questions
Prioritize by
Impact Score
User
Answers
Retrain
Models
Propagate
Changes

Question Types

  • Ambiguous depth-1 folders (Client vs. Category?)
  • Uncertain document categories
  • Temporal folder strategy (keep as temporal or flatten?)
  • Unknown role resolution

Impact Scoring

  • Number of descendants affected
  • Confidence gap (how uncertain)
  • Position in tree hierarchy
  • Anomaly count on subtree
Answers persist across sessions Export/Import feedback as JSON

DMS Migration (NetDocuments)

Mapping Strategy

// Source tree hierarchy Root → Client → Matter → Category → Documents // NetDocuments target Cabinet → Workspace → Folders // Mapping Client → Workspace.ClientName Matter → Workspace.MatterName Category → Folder structure

Migration Actions

  • Rename - normalize names, fix spelling
  • Merge - combine semantic duplicates
  • Skip - exclude ad-hoc folders
  • Flatten - remove redundant nesting
  • Create - generate missing structure
  • Move - relocate misplaced folders

Validation Rules

  • Name length limits (NetDocuments max)
  • Reserved character detection
  • Max depth constraints
  • 500 subfolder limit (with UI warning panel)
  • Duplicate workspace name detection

Export Formats

  • JSON - Full plan with workspaces & actions
  • CSV - Source/target mapping rows
  • PowerShell - Executable script with -WhatIf safe mode
# Generated PowerShell Rename-Item "$src\Corr" "Correspondence" -WhatIf Move-Item "$src\Misc\*" "$dst" -WhatIf

WPF Desktop Application - Views

Dashboard

Summary statistics, confidence breakdown pie chart, role distribution with counts, entity extraction summary, quick navigation to other views.

Tree Explorer

Full interactive tree browser with 15+ features - filtering, search, multi-select, anomaly inline popup, preview mode, skip/roll-up, and more. (detailed on next slide)

Anomalies

All detected anomalies grouped by type with severity. Validate, dismiss, or skip & roll-up. Batch "Validate All" per folder. Navigate directly to tree node. Synced bidirectionally with Tree Explorer.

Patterns

Discovered structural patterns (e.g., Client → Matter → Category). Shows prevalence percentage, match count, example paths. Click to highlight pattern nodes in tree.

Active Learning

Prioritized questions sorted by impact score. Multiple-choice answers. Feedback statistics (answered/remaining). Export/import answers as JSON for sharing across team members.

DMS Migration

Generate NetDocuments migration plan. Browse workspaces with actions. Validation issues panel (name length, reserved chars, depth, 500-subfolder limit). Export to JSON, CSV, or PowerShell.

Settings & Projects

Edit keyword lists (legal categories, corp identifiers, semantic groups, delimiters). Multi-project support with isolated SQLite databases. Create, switch, delete projects.

History

Browse saved analysis sessions. Load previous analysis with full tree restoration (self-healing orphan repair). Delete old sessions. Re-analyze from saved state.

Tree Explorer - Features Deep Dive

Filtering & Search

  • Confidence filter - All / High / Medium / Low
  • Skipped filter - All / Active / Skipped
  • Show Files toggle (when tree has file entries)
  • Show Empty toggle (grey text for empty folders)
  • Anomalies Only filter
  • AI Anomalies Only filter
  • Text search with clear button
  • Live node count (visible / total)

Navigation

  • Root level offset - skip wrapper folders
  • Path breadcrumb - clickable root levels
  • Navigate to path - from other views
  • Open in Explorer - jump to real folder on disk
  • Source root mapping - map tree to local filesystem

Tree Interaction

  • Expand / Collapse All
  • Preview Mode - configurable depth & child limit with "Click for more..." placeholders
  • Double-click expand/collapse branch toggle
  • Multi-select - Shift+click range, Ctrl+click toggle
  • Skip & Roll Up - right-click to skip folders
  • Skip Range - Shift+right-click for batch skip
  • Confirm AI role - accept suggested classification
  • Accept All Suggestions - batch confirm

Visual Indicators

  • Role color dot - per folder role
  • Confidence badge (High/Med/Low)
  • AI badge - unconfirmed role indicator
  • Anomaly count badge - clickable
  • SKIP label + strikethrough + opacity
  • SPLIT badge - combined name was split
  • Grey text - empty folders (no files in subtree)

Warning Panels

  • 500 subfolder warning - collapsible banner for NetDocuments limit violations
  • Click to navigate - jump to offending folder
  • Unconfirmed roles banner - count of AI suggestions pending review
  • Dismissable - X button to hide warnings

Details Panel (Right)

  • Selected node info - name, path, depth, role, confidence, child count
  • Anomaly list - per-node anomalies with type, description, suggestion
  • Actions - Open in Explorer, confirm role

Anomaly Popup

  • Right-click / badge click opens popup
  • Per-anomaly actions - Validate, Skip & Roll Up, Dismiss
  • Validate All button per folder
  • Resolution badge - shows resolved state

Dashboard - Features

Summary Cards

3,438
Total Folders
7
Max Depth
17
Anomalies →
4
Patterns →

Anomalies & Patterns cards are clickable — navigate to their tabs

Classification Confidence

High
72% (2,841)
Med
18% (412)
Low
10% (185)

Click any bar → Tree Explorer filtered to that confidence tier

Role Distribution

Client
412
Matter
356
DocCategory
1,240
SubCategory
890
AdHoc
89

Horizontal bar chart — each role color-coded with count

Entity Browser (split pane)

  • Search/filter entities by name
  • Grouped by type: People, Corporations, Partnerships, Trusts, Government
  • Expandable groups with count badges
  • Entity detail pane — shows display name, operating-as, type, folder path
  • Folder tree preview — hierarchical view of entity's children
  • "View in Tree" button — navigates to Tree Explorer

Anomalies View - Features

3-Panel Layout

  • Left — Anomaly type tabs with colored dots & count badges
  • Center — Filtered & grouped anomaly list
  • Right — Folder contents detail pane
  • Resizable splitters between panels

Anomaly Types (9)

  • SemanticDuplicate
  • DuplicateFolder
  • CaseInconsistency
  • SpellingError
  • RedundantNesting
  • AdHocPersonalFolder
  • UnusualDepth
  • RoleMismatch
  • CombinedClientMatter

Filtering

  • Type tab filter — click type to focus
  • "Hide validated" checkbox
  • Text search across description, path, suggestion
  • ML auto-dismiss banner — shows count of ML-hidden anomalies with toggle to reveal

Grouping Modes

  • By Suggestion — group by fix action
  • By Folder — group by parent folder
  • By Severity — High / Medium / Low
  • Smart defaults per anomaly type
  • Expand All / Collapse All controls

Anomaly Cards

  • Folder name + severity badge (colored)
  • Expandable: description + path + suggestion
  • Validated items at 45% opacity
  • Resolution badge after action

Per-Anomaly Actions

  • Validate — confirm anomaly is real
  • Skip & Roll Up — skip folder, move children to parent
  • Dismiss — mark as false positive
  • Validate Split — confirm combined name was correctly split

Bulk Actions

  • Validate All per group
  • Consolidate — merge semantic duplicate pairs
  • Dismiss Group — bulk dismiss

Cross-View Integration

  • "View in Tree" button — navigates to folder in Tree Explorer
  • Bidirectional sync — validate from tree popup updates this view
  • Folder detail pane — shows children tree for selected anomaly
  • Auto-select when tree node is clicked

Active Learning - Features

Question List (Left Panel)

  • Grouped by client/path — collapsible sections with count badges
  • Numbered badges — sequential priority (1, 2, 3...)
  • Question preview — question text + folder path
  • AI suggestion hint — "Suggests: Client" shown inline
  • Answered indicator — green checkmark
  • "Hide Answered" toggle to focus on remaining

Toolbar Actions

  • Apply Answers — submit answers & retrain ML classifier
  • Export Feedback — save answers as JSON
  • Import Feedback — load previously saved answers
  • Accept All Suggestions — bulk accept with confidence threshold

Stats & Feedback

  • Progress counter — "12 of 24 questions answered"
  • Post-apply feedback — "[N] classifications improved, low confidence: 185 → 42"
  • Busy overlay during retraining

Question Detail (Right Panel)

  • Question card — full question text
  • Context card — explains why this folder is ambiguous
  • Folder path — full path display
  • Impact score — "Resolves ~47 folders"

AI Suggestion Panel

  • Suggested answer (bold, highlighted)
  • Rationale — "ML confidence 78%, depth-1 folder, 12 children, parent is Root"
  • "Accept" button — one-click accept suggestion

Answer Selection

  • Role buttons — Client, Matter, DocumentCategory, SubCategory, AdHoc, TemporalGroup, Unknown
  • Current answer display (green) with "Clear" option

Accept All Dialog

  • Confidence threshold slider (0-100%)
  • Preview count — "47 suggestions will be accepted"
  • Confirm / Cancel buttons

Patterns View - Features

Pattern Cards (Left Panel)

  • Pattern name + match count + prevalence badge
  • Description text
  • Role sequence flow — colored pills with arrows (e.g., ClientMatterCategory)
  • Example paths showing real matches
  • Tag/Untag chip for export selection
  • Visual states: selected (blue border), tagged (green border)

Filtering & Search

  • Text search — filters by name, description, role values
  • Filter modes: All / Top N / Min Prevalence %
  • Numeric threshold input for TopN / MinPercent

Tag & Export

  • Tag All / Untag All buttons
  • Tagged count badge (green)
  • Export Tagged to CSV — saves Categories_Export_[date].csv
  • Columns: Pattern, Description, Roles, MatchCount, Prevalence, ExamplePaths

Detail Panel (Right)

  • Pattern name (large, accent-colored)
  • Description (full text, wrapped)
  • Stats card — match count + prevalence %

Editable Role Sequence

  • ComboBox per level — dropdown with all 8 roles
  • Change a role → reclassifies all matching folders
  • Auto-merge detection — if edited sequence matches another pattern, they merge automatically (counts sum, examples union)

This is the primary tool for correcting ML classifications at scale

Example Paths (clickable)

  • Underlined links — click to navigate to Tree Explorer
  • Shows real folder paths matching this pattern
  • Helps verify pattern correctness

Cross-View Integration

  • Tree → Patterns: selecting a patterned node highlights its pattern
  • Patterns → Tree: clicking example path navigates to branch
  • Bidirectional sync with Tree Explorer

CLI Application

Usage

# Analyze a tree file > doctreeanalyzer "Docs Folder Tree.txt" # Auto-discovers file in current directory > doctreeanalyzer

Output Sections

  • Parse statistics (nodes, depth, files)
  • Rule-based classification results
  • ML training accuracy metrics
  • Extracted entities
  • Anomalies grouped by type
  • Discovered patterns
  • Active learning questions
  • Sample classifications with confidence

Sample Output

== PARSE RESULTS == Total folders: 3,438 Max depth: 7 File entries: 0 == CLASSIFICATION == High confidence: 2,841 (82.6%) Medium confidence: 412 (12.0%) Low confidence: 185 (5.4%) == ENTITIES (24 found) == [Corp] Acme Ltd (0.94) [Person] Smith, John (0.88) [Trust] Family Trust (0.91) == ANOMALIES (17) == SpellingError (3): Correspondance... SemanticDuplicate (5): Emails/Email RedundantNesting (4): Docs/Documents

Data Storage & Persistence

SQLite + EF Core

  • Per-project databases - isolated via DynamicDbContextFactory
  • WAL mode - concurrent reads during analysis
  • Memory-mapped I/O - 256MB mmap for performance
  • Self-healing - FullPath fallback for broken tree FKs

Schema (8 Tables)

SessionsAnalysis metadata & summary stats
FolderNodesSelf-referencing tree (ParentEntityId FK)
AnomaliesDetected issues with validation state
PatternsStructural patterns + JSON arrays
EntitiesExtracted client/person/corp entities
QuestionsActive learning questions
AnswersUser responses to questions
MigrationRowsDMS source-to-target mappings

Performance Optimizations

  • Bulk insert - AutoDetectChanges disabled during Add
  • Split queries - AsSplitQuery() for large includes
  • Raw SQL deletes - bypass change tracker
  • No-tracking reads - AsNoTracking() for load operations

Multi-Project Architecture

// Each project gets its own DB class ProjectInfo { Name = "Smith Law Firm" DbPath = "projects/smith.db" } // Factory switches connection at runtime DynamicDbContextFactory .SetConnectionString(project.DbPath)

Export Capabilities

Excel (.xlsx)

ClosedXML

  • Summary sheet
  • Classifications sheet
  • Anomalies sheet
  • Patterns sheet
  • Entities sheet

Word (.docx)

DocumentFormat.OpenXml

  • Executive summary
  • Role distribution table
  • Anomalies by type
  • Pattern descriptions
  • Entity catalog

CSV

Pure C#

  • classifications.csv
  • anomalies.csv
  • patterns.csv
  • entities.csv
  • dms_mapping.csv

DMS Migration Exports

  • JSON - Complete plan with workspaces, actions, validations
  • CSV - Flat source/target mapping for import tools
  • PowerShell - Executable script with -WhatIf safety

Other Exports

  • Clipboard - Plain text summary for email/chat
  • Feedback JSON - Active learning answers for sharing
  • .dtamodel - ML model bundle for distribution

Key Differentiators

Self-Training ML

No pre-labeled dataset needed. Models bootstrap from rule-based heuristics and improve with every user interaction.

Progressive Analysis

Phase 1 delivers instant results. Phase 2 enriches with ML in the background while user starts reviewing.

Legal Domain Intelligence

Built-in knowledge of legal document hierarchies, common folder patterns, and DMS migration constraints.

Configurable

Keyword lists, identifiers, categories, and delimiters are fully customizable per-firm via Settings view.

Multi-Project Isolation

Each client/firm gets their own SQLite database. Switch between projects instantly. ML models persist per-project.

Actionable Output

Not just analysis - generates executable PowerShell scripts, migration plans, and formatted reports ready for client delivery.

Technology Stack

Runtime

  • .NET 10 (Preview)
  • C# 13
  • WPF (Windows)

ML

  • ML.NET 5.0
  • SDCA algorithms
  • Text featurization

Data

  • EF Core 10
  • SQLite (WAL mode)
  • Dynamic DB factory

UI Framework

  • CommunityToolkit.Mvvm
  • MVVM pattern
  • Observable properties

Export

  • ClosedXML (Excel)
  • OpenXml (Word)
  • System.Text.Json

Testing

  • MSTest v3
  • 130+ unit tests
  • coverlet (coverage)
Dependency Injection throughout Interface-driven design Zero external services required Fully offline

DocTree Analyzer

Intelligent Document Tree Analysis & DMS Migration Platform

From folder chaos to structured migration in minutes.

Legality Software • Built with .NET 10, ML.NET, WPF