Last updated on October 1st, 2024 at 10:34 am
Arquivo.pt query logs are unique resources for research
Arquivo.pt provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that Arquivo.pt search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.
Analyzing user behavior is an important research topic to understand users’ information needs and enhance the quality of search results. Thus, when a user interacts with a search engine, the system records the user’s actions in a file called the query log. Query logs from web archives are unique resources for research because they describe the real needs of web-archive users about the historical information published online over time.
Research case study
Flavie Gallois and Adam Jatowt from the University of Innsbruck, and Ricardo Campos from the University of Beira Interior and INESC TEC analyzed user search behavior based on the Arquivo.pt search query log dataset collected over a period of 3 months from June to September 2021 (Analyzing User Search Behaviour in Temporal Web Repositories through Search Query Log Analysis).
This study analyzed query features such as length, type or frequency and compared the obtained results with previous work about user search behavior over web-archives and live-web search engines.
This study revealed interesting trends and patterns about how users search for information within web archives, with strong potential for future research work.
How do web-archive users search?
The users came from Portugal in 85.7% of the queries. However, the Portuguese language was identified through automatic language identification of queries as being used in only 37% of the queries. This suggests that users apply other languages than their own to search in web archives.
Users of Arquivo.pt tend to use longer queries with more words and characters in comparison to previous studies, both over web archives and live-web search engines. About 92% of the queries had 5 or fewer terms (average of 25 characters), with 3 being the most common number of submitted terms. In previous work about search behavior in web archives, it was observed that users tended to submit from 1 to 3 terms per query, with 1 term as the most common submission.
Users tend to issue multiple queries within a session instead of a single query, possibly indicating a need for refining their search queries or exploring multiple options for inquiry.
87,7% of the queries submitted to Arquivo.pt used Desktop Browsers, despite Arquivo.pt providing mobile-friendly user interfaces. Old web-archived pages are not responsive and render poorly on mobile devices. Thus, it is expectable that users mostly use web archives through their desktops.
Users refined the time span of the search (using the datepickers) in about 50% of queries which indicates awareness of temporal needs peculiar to web-archive usage. Interestingly, users modified the From datepicker more frequently than the To datepicker. Notice that keeping the default time span may fit the user information needs and does not necessarily indicate the lack of awareness about the existence of the function to define time span (peculiar to web-archive search).
Only a small percentage of users included specific years in their query terms (4%), potentially suggesting that in these cases the time span function was insufficient, or unnoticed by some users.
The obtained results suggest that users are more conscious of their information needs and have improved their search techniques to be more effective over web-archives instead of just using them out of curiosity as first-comers.
What is searched in a web-archive?
The authors of the study applied automatic named entity recognition over the user queries and derived a set of word clouds that graphically provide a glimpse of the most common information needs of Arquivo.pt users:
Access to research Arquivo.pt query dataset
Arquivo.pt released a set of resources to support research studies over its
Query log dataset
- Query_Dataset_Sample.csv: Sheet containing a sample of the dataset query.
- Query_Dataset_ArquivoPT.7z (in UTF-8): this file contains to the full query log dataset available for research collected over a period of 3 months from June to September 2021. We advise to be careful when opening because some readers such as Microsoft Excel may use the wrong charset and damage the content for instance of column L “QUERY”.
Original log files (samples)
- Query_Log_Page_Search_Log4j_Sample.txt: raw sample of the page search query log (Log4j format) randomly selected.
- Query_Log_Image_Search_Log4j_Sample.txt: raw sample of the image search query log (Log4 format) randomly selected.
- Query_Log_Apache_HTTPD_Sample.txt: raw query log sample of the Apache HTTPd
Documentation
- Arquivo.pt Query Dataset for Research (cheat sheet)
- An Analysis on a Query Dataset from Arquivo.pt Search Engine (technical report)
Evaluation Metrics for web-archive search
The first step to understand user behavior is to define evaluation metrics. Defining metrics is a powerful tool to set long and short-term goals to decide which new products and features should be released to the users.
We share a work-in-progress report which aggregates information about Web Archive Search Evaluation Metrics. This contributes to comparing users’ search behavior between live-web and web-archive search engines. Feel free to comment directly on the collaborative document or to contact us.
This report also provides a summary of references about previous work, query workflows and structure of the corresponding query logs produced by Arquivo.pt, to facilitate the work from the researchers to study these data sets.
Know more
- Analyzing User Search Behaviour in TemporalWeb Repositories through Search Query Log Analysis
- A Search Log Analysis of a Portuguese Web Search Engine
- Understanding the Information Needs of Web Archive Users
- Characterizing Search Behavior in Web Archives
- Information Search in Web Archives PhD thesis Miguel Costa