Evaluation
The NTCIR19-Lifelog task employs a comprehensive evaluation methodology to assess the performance of participating systems across all sub-tasks. We will use post-evaluation as the primary evaluation approach, which involves assessing system performance after all submissions have been received, allowing for thorough analysis and comparison of results.
General Evaluation Metrics
The evaluation will utilize multiple metrics to comprehensively assess system performance:
- Precision and Recall: Standard information retrieval metrics to measure the accuracy and completeness of retrieved results.
- Mean Average Precision (MAP): A single-value metric that summarizes the precision-recall curve, providing an overall measure of retrieval quality.
- Normalized Discounted Cumulative Gain (NDCG): A ranking-based metric that considers the position of relevant items in the result list.
- Success at K (S@K): Measures whether at least one relevant item appears in the top K results.
- Semantic Similarity Scores: For tasks involving semantic understanding, we evaluate the semantic relevance of retrieved items to the query.
Lifelog Task Evaluation
LSAT Evaluation
For the Lifelog Semantic Access Task (LSAT), the evaluation follows the same methodology as NTCIR18-Lifelog6:
Evaluation Tool: The trec-eval programme is employed to generate result scores for each run.
Relevance Judgments: Binary relevance judgments are created through a pooled approach, where human assessors manually evaluate each submitted image for each topic, up to a maximum of 100 images per topic, per run, per participant. The relevance judgments are binary and are informed by the immediate context of each image when important for understanding the semantic relevance.
Evaluation Process:
- All participant runs are collected by the submission deadline.
- Expert assessors prepare relevance judgments using the pooled approach.
- The trec-eval programme computes evaluation metrics automatically.
- Selected results undergo manual review to ensure quality.
- Statistical analysis is performed to compare systems.
- Results are distributed to participants for analysis.
Topic Types: There are two types of topics:
- ADHOC: Topics that may have many moments in the collection that are relevant.
- KNOWN-ITEM: Topics with one (or few) relevant moments in the collection.
CASTLE Task Evaluation
CSAT Evaluation
The CASTLE Semantic Access Task (CSAT) evaluation focuses on assessing the retrieval of key interactions or events from multimodal collaborative session data.
Evaluation Approach: Similar to LSAT, CSAT uses post-evaluation with pooled relevance judgments. Human assessors evaluate retrieved instances based on semantic relevance to the query.
Evaluation Metrics: CSAT evaluation uses standard IR metrics including Precision, Recall, MAP, NDCG, and Success at K, adapted for multimodal collaborative session data.
CAST-Seg Evaluation
The CASTLE Conversation Segmentation task evaluation assesses the quality of segmentation boundaries identified in collaborative sessions.
Evaluation Metrics: Segmentation quality is evaluated based on:
- Boundary detection accuracy
- Semantic coherence of segmented units
- Alignment with ground truth segmentation boundaries
- Completeness of segmentation coverage
Recipe Generation Evaluation
Recipe Generation submissions are evaluated based on:
- Relevance and accuracy of generated recipes
- Coherence and completeness of cooking procedures
- Appropriate use of multimodal lifelog data
- Creativity and innovation in recipe generation
Note: Detailed evaluation criteria for Recipe Generation will be provided in the topic release package.
Evaluation Criteria
Systems are evaluated based on several criteria:
- Accuracy: The correctness of retrieved items in relation to the query.
- Completeness: The ability to retrieve all relevant items for a given query.
- Efficiency: The computational efficiency and response time of the system.
- Robustness: The system's performance across different types of queries and data conditions.
- Novelty: The innovation and contribution of the approach to the field.
Evaluation Timeline
The evaluation timeline aligns with the task schedule:
- Formal Run Phase: March–August 2026 - Participants submit their formal runs during this period.
- Evaluation Period: September 2026 - Post-evaluation is conducted and draft overview paper is prepared.
- Results Release: October 2026 - Evaluation results are returned to participants.
- Camera-Ready Papers: November 2026 - Participants submit final camera-ready papers.
- Conference: December 2026 - NTCIR-19 Conference in Tokyo, Japan.
Fairness and Reproducibility
To ensure fairness and reproducibility:
- All systems are evaluated using the same ground truth data and evaluation scripts.
- Evaluation procedures are documented and made available to participants.
- Participants are encouraged to provide detailed descriptions of their methods for reproducibility.
- Any issues or discrepancies are addressed through a transparent review process.