MOSIP deployment - DNS problem

Hi,

We are trying to deploy MOSIP 1.2.0 from scratch and we are facing problems with DNS.
We have deployed the following coredns pods:

kube-system coredns-5b969fccd9-b7wfd 1/1 Running 0 22m 10.244.214.51 mzworker0.sb
kube-system coredns-5b969fccd9-js9bd 1/1 Running 0 22m 10.244.119.133 mzworker5.sb

When we install the kernel chart, we get the following logs from kernel-keys-generator pod:

{“@timestamp”:“2023-12-06T07:08:37.760Z”,“@version”:“1”,“message”:“Fetching config from server at : http://config-server.default:80/config",“logger_name”:“org.springframework.cloud.config.client.ConfigServicePropertySourceLocator”,“thread_name”:“main”,“level”:“INFO”,“level_value”:20000,“appName”:"keys-generator”}
{“@timestamp”:“2023-12-06T07:08:57.858Z”,“@version”:“1”,“message”:“Connect Timeout Exception on Url - http://config-server.default:80/config. Will be trying the next url if available”,“logger_name”:“org.springframework.cloud.config.client.ConfigServicePropertySourceLocator”,“thread_name”:“main”,“level”:“INFO”,“level_value”:20000,“appName”:“keys-generator”}
{“@timestamp”:“2023-12-06T07:08:57.858Z”,“@version”:“1”,“message”:“Could not locate PropertySource: I/O error on GET request for "http://config-server.default:80/config/kernel/default/release-1.2.0\”: config-server.default; nested exception is java.net.UnknownHostException: config-server.default",“logger_name”:“org.springframework.cloud.config.client.ConfigServicePropertySourceLocator”,“thread_name”:“main”,“level”:“WARN”,“level_value”:30000,“appName”:“keys-generator”}

Please note that if we instruct the kernel chart to be deployed in a specific different worker, then the config server host will be resolved.
We have already rolled out and restarted the core DNS without solving our problem (kubectl -n kube-system rollout restart deploy coredns).
We observed that pods should be deployed in the same node to properly communicate with each other.
The network interface in our VMs is not eth0. It is ens192.

ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.2.48 netmask 255.255.255.0 broadcast 192.168.2.255
inet6 fe80::a75a:d052:c1ba:2f39 prefixlen 64 scopeid 0x20
ether 00:50:56:85:7c:05 txqueuelen 1000 (Ethernet)
RX packets 118659979 bytes 106394374777 (99.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 62527887 bytes 40414754087 (37.6 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

We have configured network_interface to be “ens192” in k8s.yml.
Also the coredns.yml playbook was changed as it was using the hardcoded value “ifcfg-eth0”.
Before having changed the coredns.yml we had the same problem, that is we used both “ifcfg-eth0” and “ifcfg-ens192” values.
If you have an idea how to tackle this issue please help us.
Many thanks.

Hi @Dimitris_Fotiadis

Thank you for reaching out. It appears that the issues related to the DNS and communication between pods when deployed on different workers. Based on your observations, it seems that deploying pods on the same node resolves the problem but to you, a resolution on this in detail @syed.salman from the DevOps team will guide you.

Best Regards,
Team MOSIP

@Dimitris_Fotiadis

This looks like some communication issue with cluster network pods, can you please restart calico pods under kube-system namespace

1 Like

Hi Syed, thanks for your quick response.

Having restarted calico pods did not fix the problem.
After the reinstallation of kernel chart, the kernel-keys-generator pod was deployed to mzworker8.sb. The output logs are the following:

[root@mzmaster ~]# kubectl logs kernel-keys-generator-9xmmb -f
Download the client from http://mz.ingress:30080/
Zip File Path: artifactory/libs-release-local/hsm/client.zip

Subsequently it fails to enter running state.

default artifactory-service-69f4644bff-5vntr 1/1 Running 0 2d9h 10.244.214.23 mzworker0.sb
default config-server-5bf77d664-xnb24 1/1 Running 0 2d 10.244.10.131 mzworker6.sb
default kernel-keys-generator-9xmmb 0/1 Error 0 16s 10.244.145.11 mzworker8.sb

If we instruct the kernel-keys-generator-pods to be deployed at worker0 (artifactory is there), the above error is gone but it will not be able to find config-server pod (deployed at worker 6) and again we get the first error mentioned in our initial message. .

Only if all above pods are deployed in the same worker (worker0), the problem is solved. This is not a solution though.

BR,
Dimitris

@Dimitris_Fotiadis

It looks like a coredns issue, please try redeploying coredns deployed on the console machine.

remove coredns

an playbooks/reset/reset-dockers.yml

deploy coredns

an playbooks/coredns.yml
1 Like

Thanks for the new hint, we did this but the problem persists.

1 Like