This is a experimental project. Feel free to send feedback!

Thesis Tide

Thesis Tide ranks papers based on their relevance to the fields, with the goal of making it easier to find the most relevant papers. It uses AI to analyze the content of papers and rank them!

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has en...

Useful Fields:

The article presents a novel methodology for deriving high-quality 3D reconstructions from internet video sources, filling a significant gap in the supervision of 3D motion estimation. The use of sophisticated techniques such as camera pose and stereo depth estimation indicates strong methodological rigor. Furthermore, the ability to generate large-scale, world-consistent 3D point clouds could substantially advance fields like robotics and computer vision, making it highly applicable across various practical scenarios.

Learning Camera Movement Control from Real-World Drone Videos

This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We s...

Useful Fields:

This article presents a novel approach to automate camera movement control using real-world drone footage, addressing significant challenges faced in traditional AI videography. It features methodological rigor in the collection of high-quality trajectory data and the development of a specific architecture (DVGFormer) tailored for the task. The implications for enhancing video quality and ease of filming are substantial, providing a strong foundation for future innovations in both consumption and content creation technologies.

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these...

Useful Fields:

This article presents significant advancements in text-to-image generation, particularly through its focus on mobile deployment and efficiency. The introduction of a model that maintains high-quality outputs, like 1024x1024 px images at reduced sizes and improved generation speeds, is novel and directly addresses a gap in the applicability of T2I models to mobile platforms. The methodological rigor is evident in the systematic architectural modifications and the clever use of knowledge distillation and adversarial guidance, which could inspire further developments in efficient AI model training and deployment in varied environments.

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the...

Useful Fields:

The proposed EasyRef methodology demonstrates significant novelty in effectively conditioning diffusion models on multiple reference images while incorporating multimodal language understanding. The introduction of MRBench as a benchmark adds further value, facilitating consistent evaluation in future work. The strong experimental results indicate its practicality and efficiency, making it highly relevant for both academics and practitioners.

NormalFlow: Fast, Robust, and Accurate Contact-based Object 6DoF Pose Tracking with Vision-based Tactile Sensors

Tactile sensing is crucial for robots aiming to achieve human-level dexterity. Among tactile-dependent skills, tactile-based object tracking serves as the cornerstone for many tasks, including manipul...

Useful Fields:

NormalFlow demonstrates significant advancements in tactile sensing and object pose tracking with the introduction of a novel and robust algorithm that leverages vision-based tactile sensors. Its methodological rigor is evidenced by the thorough comparison with baseline algorithms and the demonstration of its effectiveness in real-world tracking scenarios, particularly with challenging low-texture objects. The availability of resources such as code and datasets further enhances its applicability and potential impact on future research.

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resoluti...

Useful Fields:

The study presents a novel approach (V2PE) that addresses a key limitation in current Vision-Language Models regarding long-context capability, which is critical given the increasing complexity of multimodal tasks. The methodological rigor, through empirical analysis and fine-tuning on augmented datasets, enhances its robustness. Additionally, the applicability of this work to real-world scenarios involving extensive video or text data elevates its impact, making it highly relevant for both theoretical advancement and practical deployment.

Blister Test to Measure the Out-of-Plane Shear Modulus of Few-Layer Graphene

We measure the out-of-plane shear modulus of few-layer graphene (FLG) by a blister test. During the test, we employed a monolayer molybdenum disulfide (MoS2) membrane stacked onto FLG wells to facilit...

Useful Fields:

The article presents a novel method (blister test) for measuring a mechanical property (shear modulus) of few-layer graphene, showcasing methodological rigor and applicability to other 2D materials. The findings can significantly impact the understanding of interlayer interactions, which is crucial for the development of advanced nanodevices. The research demonstrates potential implications for flexible electronics and nanoelectromechanical systems, indicating its relevance in both fundamental research and practical applications.

Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational dat...

Useful Fields:

The article presents a novel approach, incorporating knowledge graphs and self-correction mechanisms into text-to-image diffusion models, which addresses existing limitations in these models and represents a significant advancement in the field. The methodological rigor evidenced by both qualitative and quantitative results, along with the explicit targeting of complex cultural representation, enhances its relevance. Its interdisciplinary approach is likely to inspire future research in both AI and creative industries.

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the...

Useful Fields:

The article presents a novel approach to visual token compression that addresses a significant limitation in existing Vision-Language Models by unifying the processing of both images and videos. This is critical as it enhances the efficiency and performance across various tasks, indicating high potential for practical application. The methodological rigor shown through state-of-the-art results on multiple benchmarks further supports its relevance and importance in advancing research in the field.

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual genera...

Useful Fields:

The introduction of FluxSpace represents a significant advancement in image editing techniques, particularly for its innovative approach to disentangled editing using rectified flow transformers. The paper addresses a key limitation in current technologies, providing a novel solution that combines methodological rigor with practical applicability. The potential for semantically interpretable representations enhances its impact, making it relevant for both practical applications and future research developments in the field of image synthesis and editing.

The S-matrix bootstrap with neural optimizers I: zero double discontinuity

In this work, we develop machine learning techniques to study nonperturbative scattering amplitudes. We focus on the two-to-two scattering amplitude of identical scalar particles, setting the double d...

Useful Fields:

This article presents an innovative integration of machine learning techniques with traditional theoretical frameworks in scattering amplitude analysis, demonstrating both novelty and methodological rigor. The use of neural networks for parameterization and the bootstrap framework adds a modern computational angle to a well-established area of theoretical physics. The reported perfect agreement between neural network analyses and standard bootstrap methods reinforces the validity and potential impact of this research, paving the way for enhanced computational techniques in nonperturbative quantum field theory.

Axionic quantum criticality of generalized Weyl semimetals

We formulate a field theoretic description for $d$ -dimensional interacting nodal semimetals, featuring dispersion that scales with the linear ( $n$ th) power of momentum along $d_L&#...

Useful Fields:

The article presents a novel approach to studying quantum critical phenomena in Weyl semimetals using a robust field theoretic framework. It provides a comprehensive analysis of the interactions in nodal semimetals and applies renormalization group techniques that could lead to significant insights into axionic insulation and quantum phase transitions. The use of different dimensional parameters and the exploration of multicriticality enhances its potential impact. However, the complexity of the mathematical models may limit accessibility for broader applications outside niche areas of condensed matter physics.

Representing Long Volumetric Video with Temporal Gaussian Hierarchy

This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature g...

Useful Fields:

This article presents a novel approach to a significant problem in the field of volumetric video rendering—efficiently handling long-duration videos with reduced memory usage while maintaining quality. The introduction of the Temporal Gaussian Hierarchy is both innovative and methodologically rigorous, likely addressing a key limitation in current techniques. Its empirical validation against existing methods further enhances its credibility and relevance.

Spectral Image Tokenizer

Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial location...

Useful Fields:

The article presents a novel image tokenization approach based on spectral analysis using discrete wavelet transforms, which is a significant advancement in the field of image generation. The proposed method improves upon traditional raster scan methods by enabling multi-scale image representation and offering benefits in terms of image reconstruction and upsampling. The claims of enhanced conditioning for autoregressive models through this new tokenization method demonstrate methodological rigor and potential for practical application, although empirical results could further solidify its claims.

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences i...

Useful Fields:

The article introduces an innovative framework (Feat2GS) to evaluate visual foundation models' understanding of 3D, addressing key limitations in existing methodologies. It offers a novel approach to probing both geometry and texture awareness using unposed images, significantly advancing the understanding of VFMs in three-dimensional contexts. Its extensive experiments and clear potential for practical application make it highly impactful. The availability of code and data further enhances reproducibility and encourages uptake in the community.

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such...

Useful Fields:

AgentTrek addresses a significant gap in automated GUI agent development by minimizing the reliance on labor-intensive human annotations, thereby offering a scalable solution with the potential to enhance the training of these agents significantly. The article effectively combines innovative data synthesis techniques with robust evaluation measures, showcasing a novel approach to a prevailing problem.

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified...

Useful Fields:

The article presents a novel approach in the rapidly evolving field of multimodal large language models by introducing token folding and a vision-expert-based pretraining strategy. This is particularly relevant due to the existing challenges in training complexity and model architecture, making it highly impactful for future developments in the domain. The promise of releasing models and code enhances its practical applicability and encourages community engagement.

Do Multimodal Large Language Models See Like Humans?

Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddre...

Useful Fields:

The article presents a novel benchmark, HVSBench, specifically designed to investigate the alignment of Multimodal Large Language Models with human visual perception. This is a critical gap in the field, and the benchmark's comprehensive design, including a large and diverse dataset, enhances its robustness. The findings indicate significant areas for improvement in current MLLMs, which can drive future research and development, making this work highly impactful.

TimeRefine: Temporal Grounding with Time Refining Video LLM

Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-to...

Useful Fields:

The proposed TimeRefine method represents a significant advancement in the area of video temporal grounding by introducing a novel approach to refine timestamp predictions iteratively. Its methodological rigor is underscored by the experimental results showing marked improvements over existing benchmarks. Additionally, the plug-and-play nature of the method increases its applicability across various models in the field, enhancing its potential impact.

Owl-1: Omni World Model for Consistent Long Video Generation

Video generation models (VGMs) have received extensive attention recently and serve as promising candidates for general-purpose large vision models. While they can only generate short videos each time...

Useful Fields:

Owl-1 presents a novel framework for overcoming the limitations of traditional video generation models, specifically addressing the issue of inconsistency in long video generation. The methodological rigor is evident from the extensive experiments that provide comparative performance metrics against state-of-the-art (SOTA) techniques. The impact on future research is significant, as it opens avenues for improving video generation technology, especially in areas requiring long-term narrative coherence. However, the reliance on an existing VGM infrastructure may limit the novelty for some researchers who are already working in this domain.