背景
想通过 Nginx 的 access.log 分析网站的访问情况,但是直接通过日志文件看不太直观,于是想通过代码把日志文件解析并保存数据库中,这样分析起来更方便。
实现
参考 nginx日志解析:java正则解析 这篇文章,通过使用正则表达式把日志文件中的各个参数解析出来即可。
比如,我的服务器上 Nginx 记录的日志格式如下:
1
| 203.208.60.89 - - [04/Jan/2019:16:06:38 +0800] "GET /atom.xml HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
|
对应的 Java 正则表达式就是:
1
| (?<ip>\d+\.\d+\.\d+\.\d+)( - - \[)(?<datetime>[\s\S]+)(?<t1>\][\s"]+)(?<request>[A-Z]+) (?<url>[\S]*) (?<protocol>[\S]+)["] (?<code>\d+) (?<sendbytes>\d+) ["](?<refferer>[\S]*)["] ["](?<useragent>[\S\s]+)["]
|
完整代码如下:
LogEntity
类用于保存解析后的日志信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
| import lombok.Data;
import javax.persistence.*; import java.time.LocalDateTime;
@Data @Table @Entity(name = "log") public class LogEntity {
@Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Integer id;
private String ip;
private LocalDateTime time;
private String request;
private String url;
private String protocol;
private Integer code;
private Integer sendByteSize;
private String refferer;
private String useAgent;
private boolean isBot;
private boolean isResource;
private String project;
}
|
NginxLogConverter
类实现解析的具体逻辑
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
| import lombok.extern.slf4j.Slf4j;
import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.List; import java.util.Locale; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.stream.Collectors;
@Slf4j public class NginxLogConverter {
private static final String PATTERN = "(?<ip>\\d+\\.\\d+\\.\\d+\\.\\d+)( - - \\[)(?<datetime>[\\s\\S]+)(?<t1>\\][\\s\"]+)(?<request>[A-Z]+) (?<url>[\\S]*) (?<protocol>[\\S]+)[\"] (?<code>\\d+) (?<sendbytes>\\d+) [\"](?<refferer>[\\S]*)[\"] [\"](?<useragent>[\\S\\s]+)[\"]";
public static LogEntity parse(String text, String project) { Pattern r = Pattern.compile(PATTERN); Matcher m = r.matcher(text);
while (m.find()) { LogEntity log = new LogEntity(); log.setIp(m.group("ip")); log.setProject(project); String datetime = m.group("datetime"); log.setTime(convertTime(datetime)); log.setRequest(m.group("request")); log.setUrl(m.group("url")); log.setProtocol(m.group("protocol")); log.setCode(Integer.valueOf(m.group("code"))); log.setSendByteSize(Integer.valueOf(m.group("sendbytes"))); log.setRefferer(m.group("refferer")); log.setUseAgent(m.group("useragent")); log.setBot(isBot(log.getUseAgent())); log.setResource(isResource(log.getUrl())); return log; } log.error(String.format("%s 格式化错误", text)); return null; }
private static LocalDateTime convertTime(String s) { String t = s.substring(0, s.indexOf(" ")); return LocalDateTime.parse(t, DateTimeFormatter.ofPattern("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH)); }
private static boolean isBot(String userAgent) { String t = userAgent.toLowerCase(); return t.contains("bot") || t.contains("spider"); }
private static boolean isResource(String url) { String t = url.toLowerCase(); return t.contains(".js") || t.contains(".css") || t.contains(".png") || t.contains(".ico") || t.contains(".gif") || t.contains(".txt") || t.contains(".woff") || t.contains(".eot") || t.contains(".jpg"); } }
|
使用方式
1 2 3 4 5 6 7 8 9 10
| public static void main(String[] args) { Path path = Paths.get("/xx/xxx/access.log"); try { List<String> logs = Files.readAllLines(path); List<LogEntity> list = logs.stream().map(s -> parse(s, "${projectName}")).collect(Collectors.toList()); System.out.println(list.size()); } catch (IOException e) { e.printStackTrace(); } }
|
参考
- nginx日志解析:java正则解析