12 KiB
Design Document
Overview
This design addresses the bug in SecondaryCircuitInspectionResultAppService.FindDatas where daily deduplication of inspection results occurs after pagination instead of before. The solution involves refactoring the MongoDB aggregation pipeline to perform deduplication before pagination, ensuring that records with the same (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) are properly merged into a single record per day.
Architecture
The current architecture has a flawed query flow:
Current (Incorrect) Flow:
Query → Filter → Sort → Paginate (Skip/Limit) → Deduplicate → Return
New (Correct) Flow:
Query → Filter → Sort → Deduplicate (Group) → Count → Paginate (Skip/Limit) → Return
The key change is moving the deduplication step before pagination. This requires:
- Modifying the MongoDB aggregation pipeline to include a
$groupstage - Calculating the total count after deduplication
- Applying pagination to the deduplicated result set
Components and Interfaces
Modified Methods
1. FindDatas Method
- Current Behavior: Calls
QueryPagedResultsAcrossCollectionsAsyncthen deduplicates in memory - New Behavior: Calls a new method that performs deduplication in the aggregation pipeline
2. New Method: QueryPagedDeduplicatedResultsAcrossCollectionsAsync
private async Task<(List<SecondaryCircuitInspectionResult> Results, long TotalCount)>
QueryPagedDeduplicatedResultsAcrossCollectionsAsync(
List<string> collectionNames,
FilterDefinition<SecondaryCircuitInspectionResult> filter,
string sortField,
bool isDescending,
int skipCount,
int pageSize,
CancellationToken cancellationToken = default)
Purpose: Performs cross-collection query with deduplication before pagination
Returns: Tuple containing both the paginated results and the total count of deduplicated records
3. Modified Method: CountAcrossCollectionsAsync
- Current Behavior: Counts all records matching the filter
- New Behavior: Should be replaced by the count returned from the new query method
Data Models
No changes to data models are required. The existing SecondaryCircuitInspectionResult entity already contains all necessary fields:
Year(int)Month(int)Day(int)SecondaryCircuitInspectionItemId(Guid)Status(string)
MongoDB Aggregation Pipeline Design
Pipeline Stages
The new aggregation pipeline will have the following stages:
- $match: Filter records based on search conditions
- $project: Project necessary fields for processing
- $unionWith: Combine data from multiple time-sharded collections
- $sort: Sort records before grouping (to ensure consistent "first" record selection)
- $group: Group by (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) and take first record
- $facet: Split into two pipelines:
- count: Count total deduplicated records
- data: Apply skip/limit for pagination
- $project: Format the final output
Detailed Pipeline Structure
[
// Stage 1: Match filter conditions
{ $match: <filterDocument> },
// Stage 2: Project fields
{ $project: <projectionDocument> },
// Stage 3: Union with other collections (repeated for each collection)
{ $unionWith: {
coll: "SecondaryCircuitInspectionResult_YYYY_MM",
pipeline: [
{ $match: <filterDocument> },
{ $project: <projectionDocument> }
]
}
},
// Stage 4: Sort before grouping
{ $sort: { <sortField>: <sortDirection> } },
// Stage 5: Group by deduplication keys
{ $group: {
_id: {
Year: "$Year",
Month: "$Month",
Day: "$Day",
SecondaryCircuitInspectionItemId: "$SecondaryCircuitInspectionItemId",
Status: "$Status"
},
doc: { $first: "$$ROOT" }
}
},
// Stage 6: Replace root with the preserved document
{ $replaceRoot: { newRoot: "$doc" } },
// Stage 7: Sort again after grouping (to maintain user-requested sort order)
{ $sort: { <sortField>: <sortDirection> } },
// Stage 8: Facet for count and pagination
{ $facet: {
totalCount: [{ $count: "count" }],
data: [
{ $skip: <skipCount> },
{ $limit: <pageSize> }
]
}
}
]
Implementation Details
Key Changes in FindDatas Method
// OLD CODE (Lines 704-730):
var pagedResults = await QueryPagedResultsAcrossCollectionsAsync(
existingCollections, filter, sortField, isDescending, skipCount, pageSize, default);
var deduplicatedResults = pagedResults
.GroupBy(x => new { ExecutionTime = x.ExecutionTime.Date, x.SecondaryCircuitInspectionItemId, x.Status })
.Select(g => g.First())
.ToList();
// NEW CODE:
var (deduplicatedResults, totalCount) = await QueryPagedDeduplicatedResultsAcrossCollectionsAsync(
existingCollections, filter, sortField, isDescending, skipCount, pageSize, default);
Grouping Key Structure
The grouping key must include:
Year: Integer year valueMonth: Integer month value (1-12)Day: Integer day value (1-31)SecondaryCircuitInspectionItemId: GUID of the inspection itemStatus: String status value
Handling $first Operator
The $first operator in the $group stage will preserve the first document in each group. Since we sort before grouping, this ensures we get the earliest (or latest, depending on sort direction) record for each day.
Error Handling
Potential Issues and Solutions
-
Empty Result Sets
- Issue: Facet stage always returns a document, even if no results
- Solution: Check if
totalCountarray is empty and handle accordingly
-
Collection Not Found
- Issue: Querying non-existent collections throws exceptions
- Solution: Use
FilterExistingCollectionsAsyncbefore building pipeline (already implemented)
-
Memory Limits
- Issue: Large result sets before pagination could exceed memory
- Solution: MongoDB handles this internally; aggregation pipeline is memory-efficient
-
Index Performance
- Issue: Grouping without proper indexes could be slow
- Solution: Ensure compound index exists on (Year, Month, Day, SecondaryCircuitInspectionItemId, Status)
Testing Strategy
Unit Tests
-
Test Deduplication Logic
- Create multiple records with same (Year, Month, Day, ItemId, Status)
- Verify only one record is returned per group
- Verify the correct record is selected (first after sorting)
-
Test Pagination After Deduplication
- Create 25 unique day-item-status combinations
- Query with PageSize=10
- Verify Page 1 has 10 records, Page 2 has 10 records, Page 3 has 5 records
- Verify TotalCount = 25
-
Test Total Count Accuracy
- Create 100 raw records that deduplicate to 30 unique combinations
- Verify TotalCount returns 30, not 100
-
Test Cross-Collection Deduplication
- Insert records in multiple month-sharded collections
- Query across collections
- Verify deduplication works across collection boundaries
-
Test Sort Order Preservation
- Create records with different ExecutionTime values on the same day
- Sort by ExecutionTime descending
- Verify the latest record is selected for each day
Edge Cases
- Empty Input: No records match filter → Return empty list with TotalCount=0
- Single Record Per Day: No deduplication needed → Return all records
- All Records Same Day: Multiple items/statuses on same day → Deduplicate correctly
- Null Status Values: Handle null status in grouping key
- Large Page Size: PageSize > TotalCount → Return all deduplicated records
Correctness Properties
A property is a characteristic or behavior that should hold true across all valid executions of a system—essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.
Property 1: Deduplication Completeness
For any set of inspection results with matching (Year, Month, Day, SecondaryCircuitInspectionItemId, Status), the query result should contain at most one record per unique combination of these fields.
Validates: Requirements 1.2
Property 2: Pagination Consistency
For any query with pagination parameters (PageIndex, PageSize), the total number of records across all pages should equal the TotalCount returned in the first page response.
Validates: Requirements 2.2
Property 3: Sort Order Preservation
For any sort field and direction, the records within each deduplicated group should be selected according to the sort order (first record after sorting).
Validates: Requirements 1.3
Property 4: Count Accuracy
For any query filter, the TotalCount returned should equal the number of unique (Year, Month, Day, SecondaryCircuitInspectionItemId, Status) combinations in the filtered dataset.
Validates: Requirements 2.1
Property 5: Cross-Collection Consistency
For any time range spanning multiple collections, deduplication should produce the same result as if all data were in a single collection.
Validates: Requirements 4.2
Property 6: Idempotent Deduplication
For any dataset, applying the deduplication logic multiple times should produce the same result as applying it once.
Validates: Requirements 1.1
Performance Considerations
Index Requirements
Create a compound index to optimize the grouping operation:
db.SecondaryCircuitInspectionResult_YYYY_MM.createIndex({
Year: 1,
Month: 1,
Day: 1,
SecondaryCircuitInspectionItemId: 1,
Status: 1,
ExecutionTime: -1
});
This index supports:
- Efficient grouping by (Year, Month, Day, ItemId, Status)
- Efficient sorting by ExecutionTime within groups
Memory Usage
The $group stage accumulates documents in memory. However, since we're only keeping the first document per group (using $first), memory usage is bounded by the number of unique groups, not the total number of documents.
Estimated Memory: ~1KB per unique day-item-status combination
Query Performance
Expected Performance:
- Small datasets (<10K records): <100ms
- Medium datasets (10K-100K records): 100-500ms
- Large datasets (>100K records): 500-2000ms
Performance will be monitored using the existing LogQueryPerformance method.
Backward Compatibility
API Compatibility
The FindDatas method signature remains unchanged:
public async Task<RequestPageResult<SecondaryCircuitInspectionResultDetailOutput>> FindDatas(
PageSearchCondition<SecondaryCircuitInspectionResultSearchConditionInput> searchCondition)
Response Format
The response format remains identical. The only difference is that duplicate records are now properly eliminated before pagination.
Client Impact
Clients may observe:
- Fewer total records: TotalCount will be lower (reflecting deduplicated count)
- Different records per page: Since deduplication happens first, page contents may differ
- Same API contract: No code changes required in clients
Migration Strategy
Deployment Steps
- Deploy Code: Update the
SecondaryCircuitInspectionResultAppServicewith new implementation - Monitor Performance: Watch query performance metrics for any degradation
- Verify Results: Spot-check query results to ensure deduplication is working correctly
Rollback Plan
If issues arise:
- Revert to previous version of
SecondaryCircuitInspectionResultAppService - The old in-memory deduplication logic will resume (with its bugs)
- No data migration needed (this is a query-only change)
Index Creation
Indexes should be created during a maintenance window:
// For each existing collection
db.SecondaryCircuitInspectionResult_2025_01.createIndex({
Year: 1, Month: 1, Day: 1,
SecondaryCircuitInspectionItemId: 1,
Status: 1,
ExecutionTime: -1
}, { background: true });
Use background: true to avoid blocking other operations.